Wednesday, April 11, 2012

ETL or ELT, That Is The Question

The ETL process is at the heart of the data warehouse, turning data from messy and scattered to organized and centralized. But my experience has been that ETL is slower than ELT. Besides swapping two letters, I will describe what that means and why it matters.

First, the basics: ETL stands for "Extract, Transform and Load". It is the process by which data is extracted from the source systems (database, file), transformed into the proper format and structure, and loaded into the data warehouse. Since you're reading this article, you probably know that already.

Can you guess what ELT stands for? Don't think too much, it means "Extract, Load and Transform". This swapping of letters has a few important implications. For one, it means the data will go from source to destination without being transformed, and transformation will take place in the database. Using database bulk loaders and native data formats, this results in very fast data movement. Transforming data in the database narrows the performance tuning and maintenance to a smaller set of tools, thus gaining efficiency and requiring fewer specialized skills. Also, it opens the door to keeping a copy of the source data in the reporting databases, in addition to the data warehouse, because data has already been copied; this may or may not be interesting for your needs, you get to decide.

Moving data from source to staging can be done in a number of ways: real-time replication, nightly batch extraction, intra-day batch, snap backup, etc. The process of acquiring the data depends a lot on volume; small volume works well in batch but large volume needs more care. It also depends on business requirements, if intraday data is needed then multiple batches or real-time replication are necessary. At the end of the process, a day's (or partial day) worth of data has been copied from source systems to a staging area, ready for processing.

The transformation part is where the magic happens: cleanup, lookup, rollup, aggregation, validation and error checking, this is where the data is prepared for the data warehouse. Because data is already in the database, lookups can be done using joins, efficient stored procedures can be written, indexing and other performance tuning techniques can be used. By comparison, a standard ETL flow would have the ETL tool iteratively run through transformations outside of the database, storing lookup data in memory, and write the resulting data to the data warehouse at the end.

If budget and staffing level allow it, the combination of mainstream ETL software and powerful hardware is certainly more flexible than the solution I am describing here. Maintaining SQL-based data transformation is tedious, and still requires a script to control the flow.

But when budgets are tight and compromises must be made, the ELT approach has been shown to work with multi-terabyte data sets very effectively, at the cost of more difficult maintenance. It's a trade-off that is acceptable in smaller organizations, but may not scale well as business gets larger.

As you evaluate an ETL tool for your project, or if you are faced with performance issues that cannot be resolved without using big hardware, consider looking into an E-L-T data flow. I have used it with more than 4TB of data, adding 30GB daily, and performance has been very good. The cost you pay is slower maintenance compared to a graphical interface, but this is easily alleviated by using strong programming standards. At the very least it will give you another option, and you may realize this is all you needed. Plus, if you can afford the additional disks needed to store a copy of the source data, you gain the additional flexibility of having both the data warehouse and source data in one place. Win-win!

No comments:

Post a Comment