Database to store sparse matrix
I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.
My needs are the following:
Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible
Are there existing databases adapted for these kind of constraints?
If not, what would get me the best performance :
- A SQL database, with a table like this:
row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)
- A NoSQL key-value store, with two tables like this:
row => columns ordered list
column => rows ordered list
(but with parallel insertion of elements to the lists)
- Something else
Thanks for your help!
A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.
Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.
On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.
However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.