Update large xml document

The XML:

<database totalkeys="172" totalvalues="98014">
     <key id="k1" name="Key1" valuecount="3">
         <value id="v1" name="Value1"/>
         <value id="v2" name="Value2"/>
         <value id="v3" name="Value3"/>
     <key id="k2" name="Key2" valuecount="3">
         <value id="v1" name="Value1"/>
         <value id="v2" name="Value2"/>
         <value id="v3" name="Value3"/>

The actual XML itself is much larger as you can see by the totalkeys and totalvalues attributes. Each <key> has anywhere from 5 to 19000 values.

In order to update the XML i have to gather info from three separate sources. From these sources I create 3 Dictionaries.

  1. Dictonary<string, List<string>> --> <keyId, List<valueIds>>
  2. Dictonary<string, string> --> <keyId, keyName>
  3. Dictonary<string, string> --> <valueId, valueName>

How do I update the XML without having to individually check if each <key> and <value> already exists? Currently I use SelectSingleNode for each and if it returns null I create the node and append it to the xml. This is very slow. Is there a faster way to go about this? Is XML even the right choice for a database this size?


Yeah that's going to be slow. The XML file is not indexed, so when you issue a SelectSingleNode query will have to start at the beginning of the file and check each key element and then each child element of that key. XML was not designed to be easily searchable - or to be used like a database.

As @Matthew Haugen suggested in a comment you could parse the XML into a dictionary when you read it in. Then you could quickly check if a key existed. This would only make sense if you have a lot of updates to make so that the cost of parsing the entire file was less than doing the searches. This would also take a good bit of memory as well.

But the root problem here is that XML was not made to be a database and as you are finding large XML files are slow. It looks like you are trying to re-implement a relation database with XML. So you should look at storing your data in a SQL database. You could use an in process database like SQL Lite or even a Microsoft Access Database so you don't have to setup a server.

Need Your Help

Hadoop-> Mapper->How can we read only Top N rows from each file from given input path?

hadoop map process rows

I am new to Hadoop, My requirement is I need to process only first 10 rows from the each input file. and how to exit mapper after reading 10 rows of each file.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.