About The Data: April 2012

Wednesday, April 18, 2012

What In The Universe?...

This is a discussion about giving business users the ability to create their own reports, and the benefits of using a metadata layer in your business intelligence platform. Business Objects calls this a Universe, other software have other names for it.

So you built a great data warehousing system, data is flowing in and is well organized, and now you are thinking about delivering this data to the business users. (If you waited until the data warehouse was built to think about information delivery, it is a bit late, this should have come up at the very beginning. Moving on.)

You write reports using SQL queries, and you deliver those reports by email, or maybe have the reports published on a web portal. And you do this for all reports, and those reports are very useful, and the report requests keep coming in. More users from more departments start asking for information, and soon you become a specialist of all the data, all the metrics, all the partnerships, all the details and exceptions of every aspect of the company. You also become the bottleneck, as you cannot keep up with the requests, and business users become impatient.

Sounds familiar? Don't worry, it is quite common, the success of a business intelligence platform can be overwhelming at first.

One way to address this situation is to hire more people to write more reports. It works for a while, but it only delays the problem, and actually makes it worse because it allows the concentration of a lot of business knowledge within a small group of people, who are not the ones making decisions in all those areas. It is inevitable that the business intelligence professionals within your organisation will acquire a lot of business knowledge, but this knowledge needs to be distributed in the respective departments.

A better way to reduce the bottleneck, and at the same time limit the concentration of knowledge, is to give the users the ability to create their own reports, and train them on how to do it effectively. Read this again: give ability to create reports, and train users. If the users are not trained, they will not know how do it. This is just common sense.

So you install a database query tool on the computer of the marketing manager, product development manager, and VP of finance, and they start writing SQL queries against the data warehouse, and you have solved the problem. Right? Wrong. Writing SQL is your job. Your business users need a different tool, something more meaningful than data types, codes, joins and group bys. Your users need a graphical interface.

Most reporting software have a built-in graphical tool to write queries. Business Objects, Microstrategy, Tableau, Cognos, they all offer a way of writing queries using a drag-and-drop interface. But that still requires technical knowledge of the underlying database, with all the complexities and exceptions.

This is where the metadata layer comes in. In short, the metadata layer (also called the business layer, or semantic layer) is a piece of software that sits between the database and the reporting tool, and replaces the details of the database with meaningful business metrics. Dragging the labels "product revenue" and "order date" is meaningful, users of the data will understand it immediately. It standardizes the metrics and names used to describe them, and reduces (and hopefully eliminates) instances of "my data says this" and "my numbers show that". All users see the same information, consistently. Also, the metadata layer allows you to add descriptions to the metrics, or filters, or external data, all without adding complexity for the users.

Another benefit of the metadata layer is that you and your team will also be using it, and it accelerates the creation of reports and analyzes. What, you did not think all the reports would be created by business users from now on, did you? Users will now have a tool to access the data as it relates to their area of expertise, but for company-wide metrics you should retain control of report creation. The metadata layer will make this process easier, too. Plus, as most reporting tools on the market allow you to do, you will still have the ability to bend the metadata rules by writing SQL queries directly against the database, and display it in reports, which the users of the graphical tool will most likely not be able to do. That's the combination of easier work and job security!

A business metadata layer is a good way of improving productivity of your business users and your own developers by hiding the complexities of the data. It can be used to do a lot of fancy things behind the scenes, but mostly will enable more people to gain access to the data in a meaningful way, and take ownership of the data most relevant to their area of expertise. If you are not currently using a reporting tool that includes a metadata layer, take the time to do some research as it can really help you.

Wednesday, April 11, 2012

ETL or ELT, That Is The Question

The ETL process is at the heart of the data warehouse, turning data from messy and scattered to organized and centralized. But my experience has been that ETL is slower than ELT. Besides swapping two letters, I will describe what that means and why it matters.

First, the basics: ETL stands for "Extract, Transform and Load". It is the process by which data is extracted from the source systems (database, file), transformed into the proper format and structure, and loaded into the data warehouse. Since you're reading this article, you probably know that already.

Can you guess what ELT stands for? Don't think too much, it means "Extract, Load and Transform". This swapping of letters has a few important implications. For one, it means the data will go from source to destination without being transformed, and transformation will take place in the database. Using database bulk loaders and native data formats, this results in very fast data movement. Transforming data in the database narrows the performance tuning and maintenance to a smaller set of tools, thus gaining efficiency and requiring fewer specialized skills. Also, it opens the door to keeping a copy of the source data in the reporting databases, in addition to the data warehouse, because data has already been copied; this may or may not be interesting for your needs, you get to decide.

Moving data from source to staging can be done in a number of ways: real-time replication, nightly batch extraction, intra-day batch, snap backup, etc. The process of acquiring the data depends a lot on volume; small volume works well in batch but large volume needs more care. It also depends on business requirements, if intraday data is needed then multiple batches or real-time replication are necessary. At the end of the process, a day's (or partial day) worth of data has been copied from source systems to a staging area, ready for processing.

The transformation part is where the magic happens: cleanup, lookup, rollup, aggregation, validation and error checking, this is where the data is prepared for the data warehouse. Because data is already in the database, lookups can be done using joins, efficient stored procedures can be written, indexing and other performance tuning techniques can be used. By comparison, a standard ETL flow would have the ETL tool iteratively run through transformations outside of the database, storing lookup data in memory, and write the resulting data to the data warehouse at the end.

If budget and staffing level allow it, the combination of mainstream ETL software and powerful hardware is certainly more flexible than the solution I am describing here. Maintaining SQL-based data transformation is tedious, and still requires a script to control the flow.

But when budgets are tight and compromises must be made, the ELT approach has been shown to work with multi-terabyte data sets very effectively, at the cost of more difficult maintenance. It's a trade-off that is acceptable in smaller organizations, but may not scale well as business gets larger.

As you evaluate an ETL tool for your project, or if you are faced with performance issues that cannot be resolved without using big hardware, consider looking into an E-L-T data flow. I have used it with more than 4TB of data, adding 30GB daily, and performance has been very good. The cost you pay is slower maintenance compared to a graphical interface, but this is easily alleviated by using strong programming standards. At the very least it will give you another option, and you may realize this is all you needed. Plus, if you can afford the additional disks needed to store a copy of the source data, you gain the additional flexibility of having both the data warehouse and source data in one place. Win-win!

Monday, April 2, 2012

Expanding Your Horizons

Horizontal partitioning is a great way of storing and retrieving data from very large tables by breaking them into smaller tables, providing both predictable performance and simple querying. Some vendors offer built-in tools to handle partitions, other vendors have plug-ins available. It can also be implemented as a simple set of tables with management scripts, as I will be describing.

Imagine a retail store or e-commerce site taking many orders, often with multiple line items for each order. This is a very common scenario. The database table that stores the line items quickly grows in size, and after a few months can reach several million rows in a moderately successful store. The web site or point-of-sale software should have no difficulty writing new rows to the database, but the large table slows down ETL and analytics queries. This is the scenario we are addressing.

Once the basic indexing and optimization techniques are no longer sufficient, a very large table becomes a good candidate for partitioning. In short, the large table will be replaced with a number of smaller tables called partitions, and the partitions will be reassembled using a view so it can be queried as a single table. Turning the large table into partitions is a one-time event, and from that point forward there needs to be some process that will maintain the partitions and the view as new data is added.

Let's take the example of table order_item:

order_item
------------
id
order_id
inventory_id
quantity
price

record_create_date

record_update_date

Let's say this table has grown to 75 million records after 2 years of operations. On the production database, the table may be truncated and the data archived, or treated in some other way that does not cause a performance issue for single record inserts and updates. But in the analytics database this table is now causing some slowness, and it is decided to partition it.

The partitions will be called order_item_001, order_item_002, order_item_003, etc. The size of each partition determines the total number of partitions. Let's pick a size of 20 million for this example. The order_item table will be broken into 4 partitions, the last one being the "current" partitions with 15 million records. It can be called order_item_004, or order_item_current, or anything that will allow identifying the last partition. One benefit of using the name order_item_current is that it will remain constant as new partitions are added, which makes inserts seamless for ETL.

In order to query the partitioned table, a view is used to reassemble the partitions, like so:

create view order_item
as
select *, partition = NULL from order_item_current
union all
select *, partition = 1 from order_item_001
union all
select *, partition = 2 from order_item_002
union all
select *, partition = 3 from order_item_003

The view is now taking the place of the table as far as existing SQL queries and scripts are concerned. For optimal performance, it is important that all partitions use the same indexes, it makes the query optimizer's job much easier and gets the best possible performance. The partition number is added in the view instead of being added to the partition table so that the table structure remains unchanged. Adding the partition number is necessary to identify which partition to update during ETL.

As new records are inserted and the count reaches 20 million in the current partition, it will be turned into the next numbered partition, order_item_004, the view adjusted accordingly, and new records will continue to be inserted in the now empty order_item_current table. Because the partitions are of equal size, analytic queries that return a range of data tend to take the same time to execute, and this eliminates the slowness caused by the growing table. In the case of retail transactions, reports are usually retrieving a quarter, a year, or last year plus current year, and the execution time becomes predictable despite data growth and seasonality.

Updating of existing records cannot be done through the view (SQL does not allow it), which is why adding the partition number is important. A simple query that matches the keys and returns the partition number can be used to identify which partitions need updating, and the update can then be applied to only those partitions. A bit of dynamic SQL can be used here to craft queries on the fly.

The key to successfully implementing horizontal partitioning is to carefully select the logical separation of the data. The end result needs to be a set of partitions that are similar in size, so that performance remains constant for similar size result sets. Also, data that is likely to be accessed together should remain in as few partitions as possible. Using time to separate the data sounds good at first, but is not a great approach because spikes like seasonality (Q4 retail season and other holidays for example) can result of partitions of very different size.

Instead, I have used rowcount for tables as large as 6 billion rows, with partition size ranging from 20 to 75 million, with very good and constant performance. A weekly management task is all that is needed to iterate through the partitioned tables:

determine if "current" partition has exceeded the threshold
if so
create the next numbered partition
adjust the view to include new partition
if not, move on to next table

This is a great example of minimal effort with maximum impact.

If your data warehouse is stored in a relational database, and you have the ability to write and schedule stored procedures, then you already have everything you need to implement horizontal partitioning of large tables. It is a great way of solving a common performance problem, is very cost effective for both implementation and maintenance, and will free you up for more pressing tasks, like keeping up with the incessant flow of changes in the source systems feeding your data warehouse.