Why are key value pair noSQL db's faster than traditional relational DBs
It has been recommended to me that I investigate Key/Value pair data systems to replace a relational database I have been using.
What I am not quite understanding is how this improves efficiency of queries. From what I understand you are going to be throwing away a lot information that would help to make queries more efficient, by simply turning your structure database into one big long list of keys and values?
Have I missed the point completely?
The key advantage of a relational database is the ability to relate and index information. Most 'NoSQL' systems don't provide a relational algebra or a great query language.
What you need to ask yourself is, does switching make sense for my intended use case?
You have kind of missed the point. The point is, you sometimes don't have an index (in the way you do with a general relational DB anyways). Even when you do have an index, the ability to relate it together is difficult and what relational databases excel at. NoSQL solutions have a number of novel structure which make many usecases trivially easy, e.g. Redis is a data-structure oriented DB well-suited to rapidly building anything with queues or its pub-sub architecture. MongoDB is a freeform document database which stores documents as JSON (BSON) and excels at rapid development. BigTable solutions are a little less structured than that, but expand the idea of a row to have families of columns — key value pairs contained in each row arranged efficiently on disk. You can build an inverted index on top of this with a technology like ElasticSearch.
Not everything needs the consistency guarantees or disk layout of a traditional RDBMS. Another major use case of NoSQL is massive scalability, many solutions (e.g. BigTable -- HBase/Cassandra) are designed to shard and scale horizontally easily (not so easy with SQL!). Cassandra in particular is designed for no SPOF. Further, column-oriented datastores are meant to optimize disk speeds via sequential reads (and reduce write-amplification). That being said, unless you really need it, a traditional SQL server is generally good enough.
There's advantages and disadvantages. Personally, I use a mix of both. Use the right tool for the right job, which may end up being PostgreSQL or MySQL more often than not.
You can liken a basic key-value system to making an SQL table with two columns, a unique key and a value. This is quite fast. You have no need to do any relations or correlations or collation of data. Just find the value and return it. This is an oversimplification, NoSQL databases do have a lot of interesting functionality and application beyond simple K,V stores.
I don't know if your scientific data is well suited to most NoSQL implementations, that depends on the data. If you look at HBase or Cassandra, it may well suit a scientist's needs (with proper rowkey design -- timestamp must not be first, check out OpenTSDB). I know of many companies that store sensor readings in Cassandra by using a random-order partitioner and the UUID of the sensor to roll up readings into daily fat rows. Every day new databases are created around specific use cases, so that answer may change. For specific use cases, you can reap huge rewards for using specific datastores at the cost of flexibility and tooling.
The efficiency comes from three main areas:
- The database has far fewer functions: there is no concept of a join and lessened or absent transactional integrity requirements. Less function means less work means faster, on the server side at least.
- Another design principle is that the data store lives in a cloud of servers so your request may have multiple respondents. These systems also claim the multi-server system improves fault tolerance through replication.
- It is fully buzzword compliant, using a bunch of ideas and descriptions that are not wholly invented yet. For example, Amazon is currently giving their services away in order to better understand how people might use them and get some experience to refine the specification.
To my eye, someone coming to you with a requirement that "our new data will be too much for our RDBMS" ought either have numbers to back that assertion up or admit they just want to try the new shiny. Is noSQL meritless? Probably not. Is it going to turn the world upside-down as Java 1.0 was hyped to? Probably not.
There's no harm in investigating new things, just don't bet the farm on them in favor of 50 year old, well-established, well-understood technology.
Here I'm assuming that you want to optimize one particular query, which is simply looking up a record by key. One example of this might be looking up a userinfo record by username. For some systems a query like that has to be incredibly fast and all other queries are unimportant.
The biggest factor in database performance will be the number of I/O operation required to read/write data. Most database systems use similar data structures (i.e. b-trees) which can retieve uncached data in O(log(n)) I/Os. In order to give durable updates the data will have to be written to disk: most systems do that sequentially, which is the fastest way.
So, where can a Key-Value store get efficiencies?
- Non-normalized data. Putting all the data in one row means no joins.
- Low CPU overhead. A key-value store avoids the CPU cost of query processing/optimization, security checks, constraint checks, etc.
- It is easier to have the store be in-process (as opposed to a SQL server running as a separate service) this eliminate IPC overhead.
Most RDBMS systems are built on top of something which looks like a key-value store so you could view this as cutting out the middleman.
There are a lot of good observations above and sometimes a little too much passion on both sides by both proponents. Let's get back to your original question. Suppose you do a design on Cassandra and do an identical design on an RDBMS. Say you have a set of KV pairs in Cassandra, and go and do an identical set of KV pairs on relational. (It is actually possible to do this - say, as a fully denormalized name value pair on relational). Even so, relational will run slower simply because of the overhead of the relational DBMS - logging, catalog access, integrity checking, transaction atomicity, etc. In addition, in column family data store the data is lexigraphically sorted; it is not in relational. I believe that several of the social networking sites did this, they built identical structures on both, but relational was slower. It is important to remember that after a user queries the product database, looks at who also bought this or that, builds their shopping cart and their wishlist, all of which will be done on NOSQL, when the user hits the checkout button, the transaction will be run on a relational database. Why can't we so-called experts realize it is not one versus the other in this database debate, but rather that there is a place for relational, as there is for NOSQL, graph, inverted column databases, multidimensional, etc. and even files.