Why the Excess Memory for Strings in Delphi?
I'm reading in a large text file with 1.4 million lines that is 24 MB in size (average 17 characters a line).
I'm using Delphi 2009 and the file is ANSI but gets converted to Unicode upon reading, so fairly you can say the text once converted is 48 MB in size.
( Edit: I found a much simpler example ... )
I'm loading this text into a simple StringList:
AllLines := TStringList.Create; AllLines.LoadFromFile(Filename);
I found that the lines of data seem to take much more memory than their 48 MB.
In fact, they use 155 MB of memory.
I don't mind Delphi using 48 MB or even as much as 60 MB allowing for some memory management overhead. But 155 MB seems excessive.
This is not a fault of StringList. I previously tried loading the lines into a record structure, and I got the same result (160 MB).
I don't see or understand what could be causing Delphi or the FastMM memory manager to use 3 times the amount of memory necessary to store the strings. Heap allocation can't be that inefficient, can it?
I've debugged this and researched it as far as I can. Any ideas as to why this might be happening, or ideas that might help me reduce the excess usage would be much appreciated.
Note: I am using this "smaller" file as an example. I am really trying to load a 320 MB file, but Delphi is asking for over 2 GB of RAM and running out of memory because of this excess string requirement.
Addenum: Marco Cantu just came out with a White Paper on Delphi and Unicode. Delphi 2009 has increased the overhead per string from 8 bytes to 12 bytes (plus maybe 4 more for the actual pointer to the string). An extra 16 bytes per 17x2 = 34 byte line adds almost 50%. But I'm seeing over 200% overhead. What could the extra 150% be?
Success!! Thanks to all of you for your suggestions. You all got me thinking. But I'll have to give Jan Goyvaerts credit for the answer, since he asked:
...why are you using TStringList? Must the file really be stored in memory as separate lines?
That led me to the solution that instead of loading the 24 MB file as a 1.4 million line StringList, I can group my lines into natural groups my program knows about. So this resulted in 127,000 lines loaded into the string list.
Now each line averages 190 characters instead of 17. The overhead per StringList line is the same but now there are many fewer lines.
When I apply this to 320 MB file, it no longer runs out of memory and now loads in less than 1 GB of RAM. (And it only takes about 10 seconds to load, which is pretty good!)
There will be a little bit extra processing to parse the grouped lines, but it shouldn't be noticeable in real time processing of each group.
(In case you were wondering, this is a genealogy program, and this may be the last step I needed to allow it to load all the data about one million people in a 32-bit address space in less than 30 seconds. So I've still got a 20 second buffer to play with to add the indexes into the data the will be required to allow display and editing of the data.)
You asked me personally to answer your question here. I don't know the precise reason why you're seeing such high memory usage, but you need to remember that TStringList does a lot more than just loading your file. Each of these steps requires memory that may result in memory fragmentation. TStringList needs to load your file into memory, convert it from Ansi to Unicode, split it into one string for each line, and stuff those lines into an array that will be reallocated many times.
My question to you is why are you using TStringList? Must the file really be stored in memory as separate lines? Are you going to modify the file in-memory, or just display parts of it? Keeping the file in memory as one big chunk and scanning the whole thing with regular expressions that match the parts you want will be more memory efficient than storing separate lines.
Also, must the whole file be converted to Unicode? While your application is Unicode, your file is Ansi. My general recommendation is to convert Ansi input to Unicode as soon as possible, because doing so saves CPU cycles. But when you have 320 MB of Ansi data that will stay as Ansi data, memory consumption will be the bottleneck. Try keeping the file as Ansi in memory, and only convert the parts you'll be displaying to the user as Ansi.
If the 320 MB file isn't a data file you're extracting certain information from, but a data set you want to modify, consider converting it into a relational database, and let the database engine worry how to manage the huge set of data with limited RAM.