ETL Processing Design and Performance
I am working on a ETL process for a data warehouse using C#, that supports both SQL Server and Oracle. During development I have been writing stored procedures that would synchronize data from one database to another database. The stored procedures code are rather ugly because it involves dynamic SQL. It needs to build the SQL strings since we have dynamic database name.
My team lead want to use C# code to do the ETL. We have code generation that automatic generate new classes when database definition changes. That's also why I decided not to use Rhino ETL.
Here are the pros and cons:
- fast loading process, everything is handled by the database
- easy deployment, no compiling is needed
- poor readability due to dynamic SQL
- Need to maintain both T-SQL and PL/SQL scripts when database definition changes
- Slow development because no intellisense when writing dynamic SQL
- easier to develop the ETL process because we get intellisense from our generated class
- easier to maintain because of generated class
- better logging and error handling
- slow performance compare with stored procedure
I would prefer to use application code to do the ETL process, but the performance was horrible compare with stored procedures. In one test when I tries to update 10,000 row. The stored procedures took only 1 sec, while my ETL code took 70s. Even I somehow manage to reduce the overhead, 20% of the 70s are purely calling update statement from application code.
Could someone provide me suggestions or comment on how to speed up the ETL process using application code?
My next idea is try doing parallel ETL process by opening multiple database connections and perform the update and insert.
You say you have code generation that automatically generates new classes - why don't you have code generation that automatically generate new stored procedures?
That should give you the best of two worlds; encapsulate it into a few nice classes that can inspect the database and update things as necessary and you can, well not increase readability, but hide it (you would not need to update the SPs manually)
Also, the difference should not be so huge, sounds as if you are not doing something right (reusing connections, moving data unnecessary from server to the application or processing data in smaller batches - row by row?).
Also, regarding better logging - care to elaborate on that? You can have logging on the database layer, too, or you can design your SPs so that application layer can still do the logging.
If your C# code is already slow with 10,000 rows, I cannot imagine it in a real environement...
Most ETL are done either within the database (stored procedures, packages, or even compiled within the database (PL/SQL, Java for Oracle)). They can handle millions of rows.
Or some professional tools can be used (Informatica, or others), but it will still be slower than stored procedures, but easier to manage.
So my conclusion is: If you want to come anywhere close to stored procedure performances, you will have to code an application as good as those professional ones on the market, that took years to develop and mature... Do you think you can?
Plus, if you have to handle different database types (SQL Server, Oracle), you CANNOT make a generic application AND optimize it at the same time, it's a choice. Because Oracle does not work the same way SQL Server does.
To give you an idea, in ETLs for Oracle, hints are used (like the Parallel Execution hints), and also some indexes may be dropped or integrity disabled temporarly to optimize the ETL.
There is no way that I know of to the the exact same thing in SQL Server (they might have similar options, but different syntax). So "one ETL for all databases" can hardly be done without losing efficiency and speed.
So I think your pros and cons are very accurate; you have to choose between speed and ease of development, but not both.