Compute, convert and compile XML to CSV using JAVA
I need to convert and compile multiple XML files (in a standard format) to a single CSV file. Because I also need to perform computations on some of the imported elements, XSLT is not an option (Stackoverflow: XML to CSV Using XSLT) unless I perform computations on each converted CSV file.
XPath has been suggested as an alternative to SAX2, but because the final CSV output is large (based on over 100 XML files) I am hesitant to use arrays. (Stackoverflow: Convert XML file to CSV)
Using SAX2 I have been somewhat successful in extracting the tag elements.
If I could append output - for each individual file - to the final CSV output I assume that I would have a more memory stable application.
I hope others would benefit from knowing the answer to the question: How can I efficiently handle computations in conjunction with XML-CSV conversions for large-scale data?
XML file 1
<element id="1"> <info>Yes</info> <startValue>0</startValue> <!-- Value entered twice, ignore--!> <startValue>256</startValue> <stopValue>64</stopValue> </element> <element id="2"> <info>No</info> <startValue>50</startValue> <stopValue>25</stopValue> </element> <....
XML file 2
<element id="1"> <info>No</info> <startValue>128</startValue> <stopValue>100</stopValue> </element> <....
for all files get ID get info for all stop and start values ignore wrong values: use counter difference = startValue(i) - stopValues(j) = 196, 28 append (ID, info and difference) to file "outputfile.csv"
CSV Eutput Example
File ID Info Difference Etc _________________________________________________ 0 1 Yes 196 .... 0 2 No 25 .... 1 1 No 28 .... . ... ... .... . ... ... .... nfiles
I would recommend using JDOM to read the XML into memory. Then you can very easily access it programmatically using regular Java syntax. After that, you can use any library to easily create a CSV file. Personally I use opencsv.
If your concern is memory usage, the biggest thing is to keep as few XML files in memory at one time as possible. If you read the files one by one and then store only the information you need in some other data structure, you should be fine. For example, you could create a Map of start values keyed by ID and a Map of stop values keyed by ID.