Recommended data structure for 1 million+ ordered collection in .NET 3.5
My data structures knowledge is way rusty, and to be honest it was never my strongest point.
Right now we're about to build a queue-like component that has the following requirements:
- Has to be able to queue, dequeue and find a specific item by key.
- each item will be a structure or class with another class as key, having 5 different properties, category-like. Assume something like: MasterCategoryId, ChildCategoryId, TimeId, PriorityId, GroupId.
- it has to be a sorte collection.
- usually the collection will hold anywhere from 5k to 10k objects, but in order to consider the worst case scenario, we're testing our current prototype to hold about a million objects.
- right now it will not be multithreaded.
- about 90 or 95% of the items in the collection (queueing) will happen when the component is created, but the component is being used as a tree, in the sense that we will be dequeueing the last item on the collection, do calculations on it, and then it will report it's result to its parent, which may or may not be in the collection already. If it isn't, the queue method used to try to find the parent will have to insert the item.
- since the component is like a queue that is processed, the collection will be empty after dequeueing everything.
I think that sums it up. so, obviously a single list or ordered list is out of the question, due to the fact that every time we add or remove objects from the collection it gets sorted again, and doing this in a single collection with a million objects is slow.
We tested a couple of approaches in the past, like linked lists, which proved to be fast for queueing, but slow for finding items (since we do have that scenario).
Right now we've come up with a structure like
SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, ..
You get the idea.
It's kind of a sweet spot of grouping levels, keeping each collection relatively small (around 300 items per dictionary).
so, for the first level, we will have a sorteddictionary, for which the keys are the ids of each master category, and the values will be a sorteddictionary of which the key will be the id of the child category...and so on.
Right now we've tested with 100, 1,000, 10,000, 100,000 and 1,000,000 items.
For the smaller range, up to 100k, the solution is fast. it can queue/dequeue/find in less than a second even for up to 300k, which is really above the 80-90% of the scenarios that we will handle.
When it comes to a million, it does become slower, taking about 3-4 seconds to queue the whole thing, and up to 10 seconds to deplete the queue.
So, my questions are:
- Is there a more suited collection or approach for our specific scenario?
- I had never worked with this amount of items in collections before. Are these timing reasonable for such high numbers, or not? I ask because I've read some tweets of people who do 200k operations a second on stuff like MSMQ or NserviceBus (which i know is not related to this, i am just trying to understand and compare my results).
- The objects I am using right now in the prototype are just mock classes, just the composite object key and a single property. Would my results be affected when I use the real classes, or not? I'm guessing not, since all the framework would do is add the reference to the object, but just want to confirm, since like I said, data structures have never been my strongest knowledge.
- As a bit of a separate topic, If i was to want to prepare for this to be multithreaded, what are some considerations I would have to take?
A few remarks and general suggestions (I'm sorry that I don't have the time to test myself):
I believe that your general approach - nested (sorted)dictionaries - is sound. I use similar structures very often, to my satisfaction - not for performance reasons because my cases are always small, but for clarity and flexibility.
In your case it also addresses one of the performance issues, because sorting (at inserting and at deleting) takes time and smaller (sub)dictionaries obviously sort faster.
Yes, having class-instances as values (or keys) only uses the reference, so in that respect it doesn't really matter what size or structure your class has.
The increasing time for deleting (and adding) presumably is caused (primarily) by the resorting that is done every time you remove (or add) an item.
As regards the performance of adding items:
If your case enables you to feed the dictionaries with items in a sorted (ascending) order, you may want to switch to a SortedLIST, rather than a SortedDICTIONARY, because in the list adding items is O(1) rather than O(log n) if the new items will wind up at the end of the sorted collection.
A collection has an initial CAPACITY - the reserved space for items. Adding items beyond the current capacity of the collection results in (a) increasing the capacity-value of the collection; (b) reallocating space for the (whole) new capacity; (c) copying the values from the original (small) storage to the new one. Therefore, if you have some idea about how large your collections are going to be: initialize the collection with the appropriate capacity. This is not (yet) possible with a sortedDictionary, but the sortedLIST supports that feature.
As regards the performance of deleting items:
You may want to consider using an approach that uses a custom-class wrapping the sorted collection, such that it doesn't "really" delete items (thereby avoiding the resorting), until the whole collection is empty.
All in all, have a try with something along the following lines (using vb syntax; I'm sure you can translate it to C#,C+ or whatever language you wish to use.
Public Class MySortedCollection(Of myKeyType, myValueType) Public Sub New(Optional myCapacity As Integer = 0) If myCapacity <= 0 Then MyValues = New SortedList(Of myKeyType, myValueType) ValidItems = New Dictionary(Of myKeyType, Boolean) Else MyValues = New SortedList(Of myKeyType, myValueType)(myCapacity) ValidItems = New Dictionary(Of myKeyType, Boolean)(myCapacity) End If LiveItemsCount = 0 End Sub Private MyValues As SortedList(Of myKeyType, myValueType) Private ValidItems As Dictionary(Of myKeyType, Boolean) Private LiveItemsCount As Integer Public ReadOnly Property Count As Integer Get Return LiveItemsCount End Get End Property Public Sub Clear() MyValues.Clear() ValidItems.Clear() LiveItemsCount = 0 End Sub Public Sub Add(key As myKeyType, value As myValueType) MyValues.Add(key, value) ValidItems.Add(key, True) LiveItemsCount += 1 End Sub Public Function Remove(key As myKeyType) As Integer If ValidItems(key) Then ValidItems(key) = False LiveItemsCount -= 1 If LiveItemsCount = 0 Then MyValues.Clear() ValidItems.Clear() End If End If Return LiveItemsCount End Function Public Function TryGetValue(key As myKeyType, ByRef value As myValueType) As Boolean If MyValues.TryGetValue(key, value) Then If ValidItems(key) Then Return True Else value = Nothing End If End If Return False End Function Public Function TryGetAndDelete(key As myKeyType, ByRef value As myValueType) As Boolean If Me.TryGetValue(key, value) Then ValidItems(key) = False LiveItemsCount -= 1 If LiveItemsCount = 0 Then MyValues.Clear() ValidItems.Clear() End If Return True End If Return False End Function // add more collection-wrapping methods as needed End Class
You "pay" performance-wise for the overhead of the wrapping class as well as for the auxiliary dictionary that is used inside to keep track of the "real" items versus those that are considered deleted. You save however on the repeated sorting upon deleting an item. It depends of course on the actual case whether it will help (or harm...). And again, I haven't tested it myself.