How should I store extremely large amounts of traffic data for easy retrieval?
for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).
This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.
What is a good way to do this? I already have some ideas:
Create a file for each user and day and append every dataset to it.
- Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
- Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
Use a database
- Advantage: It's very easy to find specific data with the right SQL query.
- Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.
- Advantage: It would be easy to get information for one user using SQL queries on his file.
- Disadvantage: Getting overall information would still be difficult.
But perhaps someone else has a very good idea?
Thanks very much in advance.
First, get The Data Warehouse Toolkit before you do anything.
You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.
[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]
SQL databases are slow, but that slow is good for flexible retrieval.
The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.
A typical DW approach for this is to do this.
Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.
Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.
Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.
Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.
When someone wants to do analysis, build them a datamart.
For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.
You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.
I think the proper answer really depends on the definition of a "dataset". As you mention in your question you are storing individual sets of information for each record; timestamp, userid, destination ip, source ip, number of bytes etc..
SQL Server is perfectly capable of handing this type of data storage with hundreds of millions of records without any real difficulty. Granted this type of logging is going to require some good hardware to handle it, but it shouldn't be too complex.
Any other solution in my opinion is going to make reporting very hard, and from the sounds of it that is an important requirement.
So you are in one of the cases where you have much more write activity than read, you want your writes not to block you, and you want your reads to be "reasonably fast" but not critical. It's a typical business intelligence use case.
You should probably use a database and store your data in as a "denormalized" schema to avoid complex joins and multiple inserts for each record. Think of your table as a huge log file.
In this case, some of the "new and fancy" NoSQL databases are probably what you're looking for: they provide relaxed ACID constraints, which you should not terribly mind here (in case of crash, you can loose the last lines of your log), but they perform much better for insertion, because they don't have to sync journals on disk at each transaction.