Monday, March 26, 2012

Think Vertical

Columnar databases are all the rage for data warehousing and analytics systems, and there is a good reason for that: they are fast. There are several flavors and technologies available, and they all share the same column-wise organization of the data. Here's how it works.

As I discussed before, the main bottleneck of a data warehouse system is disk i/o. Writing and updating data in the database during ETL, retrieving data when refreshing reports, and exporting lists for mailings or invoice reconciliation, all of that requires disk i/o. The database server is constantly writing or retrieving data from storage. Disks are much slower than memory and cpu, and they become the pinch point. This is nothing new, it is common to all database systems, and to all computers for that matter. But it quickly becomes a problem for data warehousing because of the large data sets. An operational database usually inserts, updates or deletes a single record at a time; loading a fact table or refreshing a business report will routinely handle millions of records in a single query.

Columnar database engines directly address the disk i/o bottleneck by storing data differently, in columns instead of rows. It allows retrieval of a specific column for a large number of records, without having to read all the other columns. A traditional database engine reads all columns for the record, then throws away what is not needed; a columnar database only reads what is needed without wasting precious disk i/o.

This is all done under the hood and invisible to the database programmer, and standard SQL queries are used to manipulate the data. Let's look at the example of a query that calculates the total revenue by month for the state of California, running against this fictitious order table:



This table is not part of a star schema, as it does not have any keys, it is just used to illustrate the concept. The query might look like this:


Only 3 columns are involved in this query: order_date, state and item_price. A column-wise database engine can use an index on the state column (if present) to find the California orders, then retrieve the order_date and item_price to complete the query. The name and address fields are not retrieved by the database, which saves a lot of disk activity as these fields tend to be wide. By comparison, a traditional database engine will read the entire record and discard the unnecessary columns, thus wasting i/o. For a single record it is inconsequential, but for millions of records it makes a big difference on performance.

In addition to reading only the data needed, other techniques can be used to take advantage of the columnar organization, such as data compression and more efficient indexing. This is a relatively new technology and vendors are exploring the possibilities.

Organizing the data into facts and dimensions to reduce the joins and pre-calculate common metrics will have a major impact on performance on any database engine. Combined with a fast database, the star schema becomes a very powerful data warehousing tool. Columnar databases offer this fast performance, and do not require the powerful hardware of traditional databases in order to deliver good performance, thus keeping costs reasonable.

There are several vendors offering a columnar database. Some are big names, some are newcomers, some use open source technologies. I am a big fan of Sybase IQ, I have used it for the last 7 years. Performance is fantastic, hardware requirements are minimal, and maintenance is very easy.

(Before you ask: Sybase/SAP is not paying to place their product on my blog. I am not an employee of Sybase or SAP. I just like the product, it has been a great tool for me.)

If you are not currently using a columnar database, you should look into it. Compared to a traditional database, the gain in performance using commodity hardware makes this a great option for growing data warehouses.

No comments:

Post a Comment