About The Data

Friday, May 31, 2013

Sybase IQ 16 Sets A New Data Load Record!

If you read this blog, you probably know I am a big fan of columnar databases for analytics, and specifically of Sybase IQ. Using very small commodity hardware, it provides amazing performance on several TB of data, and maintenance is minimal, which is a big deal in small and medium organizations.

This press release just came out from SAP with some impressive data load performance using Sybase IQ 16: 34.3TB per hour! That is pretty fast by any standard, it takes full advantage of the revamped parallel loading feature, and makes me confident I will be recommending IQ on future projects.

See the press release here:

http://scn.sap.com/community/sybase-iq/blog/2013/05/30/sap-sybase-iq-16-sets-guinness-world-record-with-343tb-per-hour-for-big-data-loading

Wednesday, April 18, 2012

What In The Universe?...

This is a discussion about giving business users the ability to create their own reports, and the benefits of using a metadata layer in your business intelligence platform. Business Objects calls this a Universe, other software have other names for it.

So you built a great data warehousing system, data is flowing in and is well organized, and now you are thinking about delivering this data to the business users. (If you waited until the data warehouse was built to think about information delivery, it is a bit late, this should have come up at the very beginning. Moving on.)

You write reports using SQL queries, and you deliver those reports by email, or maybe have the reports published on a web portal. And you do this for all reports, and those reports are very useful, and the report requests keep coming in. More users from more departments start asking for information, and soon you become a specialist of all the data, all the metrics, all the partnerships, all the details and exceptions of every aspect of the company. You also become the bottleneck, as you cannot keep up with the requests, and business users become impatient.

Sounds familiar? Don't worry, it is quite common, the success of a business intelligence platform can be overwhelming at first.

One way to address this situation is to hire more people to write more reports. It works for a while, but it only delays the problem, and actually makes it worse because it allows the concentration of a lot of business knowledge within a small group of people, who are not the ones making decisions in all those areas. It is inevitable that the business intelligence professionals within your organisation will acquire a lot of business knowledge, but this knowledge needs to be distributed in the respective departments.

A better way to reduce the bottleneck, and at the same time limit the concentration of knowledge, is to give the users the ability to create their own reports, and train them on how to do it effectively. Read this again: give ability to create reports, and train users. If the users are not trained, they will not know how do it. This is just common sense.

So you install a database query tool on the computer of the marketing manager, product development manager, and VP of finance, and they start writing SQL queries against the data warehouse, and you have solved the problem. Right? Wrong. Writing SQL is your job. Your business users need a different tool, something more meaningful than data types, codes, joins and group bys. Your users need a graphical interface.

Most reporting software have a built-in graphical tool to write queries. Business Objects, Microstrategy, Tableau, Cognos, they all offer a way of writing queries using a drag-and-drop interface. But that still requires technical knowledge of the underlying database, with all the complexities and exceptions.

This is where the metadata layer comes in. In short, the metadata layer (also called the business layer, or semantic layer) is a piece of software that sits between the database and the reporting tool, and replaces the details of the database with meaningful business metrics. Dragging the labels "product revenue" and "order date" is meaningful, users of the data will understand it immediately. It standardizes the metrics and names used to describe them, and reduces (and hopefully eliminates) instances of "my data says this" and "my numbers show that". All users see the same information, consistently. Also, the metadata layer allows you to add descriptions to the metrics, or filters, or external data, all without adding complexity for the users.

Another benefit of the metadata layer is that you and your team will also be using it, and it accelerates the creation of reports and analyzes. What, you did not think all the reports would be created by business users from now on, did you? Users will now have a tool to access the data as it relates to their area of expertise, but for company-wide metrics you should retain control of report creation. The metadata layer will make this process easier, too. Plus, as most reporting tools on the market allow you to do, you will still have the ability to bend the metadata rules by writing SQL queries directly against the database, and display it in reports, which the users of the graphical tool will most likely not be able to do. That's the combination of easier work and job security!

A business metadata layer is a good way of improving productivity of your business users and your own developers by hiding the complexities of the data. It can be used to do a lot of fancy things behind the scenes, but mostly will enable more people to gain access to the data in a meaningful way, and take ownership of the data most relevant to their area of expertise. If you are not currently using a reporting tool that includes a metadata layer, take the time to do some research as it can really help you.

Wednesday, April 11, 2012

ETL or ELT, That Is The Question

The ETL process is at the heart of the data warehouse, turning data from messy and scattered to organized and centralized. But my experience has been that ETL is slower than ELT. Besides swapping two letters, I will describe what that means and why it matters.

First, the basics: ETL stands for "Extract, Transform and Load". It is the process by which data is extracted from the source systems (database, file), transformed into the proper format and structure, and loaded into the data warehouse. Since you're reading this article, you probably know that already.

Can you guess what ELT stands for? Don't think too much, it means "Extract, Load and Transform". This swapping of letters has a few important implications. For one, it means the data will go from source to destination without being transformed, and transformation will take place in the database. Using database bulk loaders and native data formats, this results in very fast data movement. Transforming data in the database narrows the performance tuning and maintenance to a smaller set of tools, thus gaining efficiency and requiring fewer specialized skills. Also, it opens the door to keeping a copy of the source data in the reporting databases, in addition to the data warehouse, because data has already been copied; this may or may not be interesting for your needs, you get to decide.

Moving data from source to staging can be done in a number of ways: real-time replication, nightly batch extraction, intra-day batch, snap backup, etc. The process of acquiring the data depends a lot on volume; small volume works well in batch but large volume needs more care. It also depends on business requirements, if intraday data is needed then multiple batches or real-time replication are necessary. At the end of the process, a day's (or partial day) worth of data has been copied from source systems to a staging area, ready for processing.

The transformation part is where the magic happens: cleanup, lookup, rollup, aggregation, validation and error checking, this is where the data is prepared for the data warehouse. Because data is already in the database, lookups can be done using joins, efficient stored procedures can be written, indexing and other performance tuning techniques can be used. By comparison, a standard ETL flow would have the ETL tool iteratively run through transformations outside of the database, storing lookup data in memory, and write the resulting data to the data warehouse at the end.

If budget and staffing level allow it, the combination of mainstream ETL software and powerful hardware is certainly more flexible than the solution I am describing here. Maintaining SQL-based data transformation is tedious, and still requires a script to control the flow.

But when budgets are tight and compromises must be made, the ELT approach has been shown to work with multi-terabyte data sets very effectively, at the cost of more difficult maintenance. It's a trade-off that is acceptable in smaller organizations, but may not scale well as business gets larger.

As you evaluate an ETL tool for your project, or if you are faced with performance issues that cannot be resolved without using big hardware, consider looking into an E-L-T data flow. I have used it with more than 4TB of data, adding 30GB daily, and performance has been very good. The cost you pay is slower maintenance compared to a graphical interface, but this is easily alleviated by using strong programming standards. At the very least it will give you another option, and you may realize this is all you needed. Plus, if you can afford the additional disks needed to store a copy of the source data, you gain the additional flexibility of having both the data warehouse and source data in one place. Win-win!

Monday, April 2, 2012

Expanding Your Horizons

Horizontal partitioning is a great way of storing and retrieving data from very large tables by breaking them into smaller tables, providing both predictable performance and simple querying. Some vendors offer built-in tools to handle partitions, other vendors have plug-ins available. It can also be implemented as a simple set of tables with management scripts, as I will be describing.

Imagine a retail store or e-commerce site taking many orders, often with multiple line items for each order. This is a very common scenario. The database table that stores the line items quickly grows in size, and after a few months can reach several million rows in a moderately successful store. The web site or point-of-sale software should have no difficulty writing new rows to the database, but the large table slows down ETL and analytics queries. This is the scenario we are addressing.

Once the basic indexing and optimization techniques are no longer sufficient, a very large table becomes a good candidate for partitioning. In short, the large table will be replaced with a number of smaller tables called partitions, and the partitions will be reassembled using a view so it can be queried as a single table. Turning the large table into partitions is a one-time event, and from that point forward there needs to be some process that will maintain the partitions and the view as new data is added.

Let's take the example of table order_item:

order_item
------------
id
order_id
inventory_id
quantity
price

record_create_date

record_update_date

Let's say this table has grown to 75 million records after 2 years of operations. On the production database, the table may be truncated and the data archived, or treated in some other way that does not cause a performance issue for single record inserts and updates. But in the analytics database this table is now causing some slowness, and it is decided to partition it.

The partitions will be called order_item_001, order_item_002, order_item_003, etc. The size of each partition determines the total number of partitions. Let's pick a size of 20 million for this example. The order_item table will be broken into 4 partitions, the last one being the "current" partitions with 15 million records. It can be called order_item_004, or order_item_current, or anything that will allow identifying the last partition. One benefit of using the name order_item_current is that it will remain constant as new partitions are added, which makes inserts seamless for ETL.

In order to query the partitioned table, a view is used to reassemble the partitions, like so:

create view order_item
as
select *, partition = NULL from order_item_current
union all
select *, partition = 1 from order_item_001
union all
select *, partition = 2 from order_item_002
union all
select *, partition = 3 from order_item_003

The view is now taking the place of the table as far as existing SQL queries and scripts are concerned. For optimal performance, it is important that all partitions use the same indexes, it makes the query optimizer's job much easier and gets the best possible performance. The partition number is added in the view instead of being added to the partition table so that the table structure remains unchanged. Adding the partition number is necessary to identify which partition to update during ETL.

As new records are inserted and the count reaches 20 million in the current partition, it will be turned into the next numbered partition, order_item_004, the view adjusted accordingly, and new records will continue to be inserted in the now empty order_item_current table. Because the partitions are of equal size, analytic queries that return a range of data tend to take the same time to execute, and this eliminates the slowness caused by the growing table. In the case of retail transactions, reports are usually retrieving a quarter, a year, or last year plus current year, and the execution time becomes predictable despite data growth and seasonality.

Updating of existing records cannot be done through the view (SQL does not allow it), which is why adding the partition number is important. A simple query that matches the keys and returns the partition number can be used to identify which partitions need updating, and the update can then be applied to only those partitions. A bit of dynamic SQL can be used here to craft queries on the fly.

The key to successfully implementing horizontal partitioning is to carefully select the logical separation of the data. The end result needs to be a set of partitions that are similar in size, so that performance remains constant for similar size result sets. Also, data that is likely to be accessed together should remain in as few partitions as possible. Using time to separate the data sounds good at first, but is not a great approach because spikes like seasonality (Q4 retail season and other holidays for example) can result of partitions of very different size.

Instead, I have used rowcount for tables as large as 6 billion rows, with partition size ranging from 20 to 75 million, with very good and constant performance. A weekly management task is all that is needed to iterate through the partitioned tables:

determine if "current" partition has exceeded the threshold
if so
create the next numbered partition
adjust the view to include new partition
if not, move on to next table

This is a great example of minimal effort with maximum impact.

If your data warehouse is stored in a relational database, and you have the ability to write and schedule stored procedures, then you already have everything you need to implement horizontal partitioning of large tables. It is a great way of solving a common performance problem, is very cost effective for both implementation and maintenance, and will free you up for more pressing tasks, like keeping up with the incessant flow of changes in the source systems feeding your data warehouse.

Monday, March 26, 2012

Think Vertical

Columnar databases are all the rage for data warehousing and analytics systems, and there is a good reason for that: they are fast. There are several flavors and technologies available, and they all share the same column-wise organization of the data. Here's how it works.

As I discussed before, the main bottleneck of a data warehouse system is disk i/o. Writing and updating data in the database during ETL, retrieving data when refreshing reports, and exporting lists for mailings or invoice reconciliation, all of that requires disk i/o. The database server is constantly writing or retrieving data from storage. Disks are much slower than memory and cpu, and they become the pinch point. This is nothing new, it is common to all database systems, and to all computers for that matter. But it quickly becomes a problem for data warehousing because of the large data sets. An operational database usually inserts, updates or deletes a single record at a time; loading a fact table or refreshing a business report will routinely handle millions of records in a single query.

Columnar database engines directly address the disk i/o bottleneck by storing data differently, in columns instead of rows. It allows retrieval of a specific column for a large number of records, without having to read all the other columns. A traditional database engine reads all columns for the record, then throws away what is not needed; a columnar database only reads what is needed without wasting precious disk i/o.

This is all done under the hood and invisible to the database programmer, and standard SQL queries are used to manipulate the data. Let's look at the example of a query that calculates the total revenue by month for the state of California, running against this fictitious order table:

This table is not part of a star schema, as it does not have any keys, it is just used to illustrate the concept. The query might look like this:

Only 3 columns are involved in this query: order_date, state and item_price. A column-wise database engine can use an index on the state column (if present) to find the California orders, then retrieve the order_date and item_price to complete the query. The name and address fields are not retrieved by the database, which saves a lot of disk activity as these fields tend to be wide. By comparison, a traditional database engine will read the entire record and discard the unnecessary columns, thus wasting i/o. For a single record it is inconsequential, but for millions of records it makes a big difference on performance.

In addition to reading only the data needed, other techniques can be used to take advantage of the columnar organization, such as data compression and more efficient indexing. This is a relatively new technology and vendors are exploring the possibilities.

Organizing the data into facts and dimensions to reduce the joins and pre-calculate common metrics will have a major impact on performance on any database engine. Combined with a fast database, the star schema becomes a very powerful data warehousing tool. Columnar databases offer this fast performance, and do not require the powerful hardware of traditional databases in order to deliver good performance, thus keeping costs reasonable.

There are several vendors offering a columnar database. Some are big names, some are newcomers, some use open source technologies. I am a big fan of Sybase IQ, I have used it for the last 7 years. Performance is fantastic, hardware requirements are minimal, and maintenance is very easy.

(Before you ask: Sybase/SAP is not paying to place their product on my blog. I am not an employee of Sybase or SAP. I just like the product, it has been a great tool for me.)

If you are not currently using a columnar database, you should look into it. Compared to a traditional database, the gain in performance using commodity hardware makes this a great option for growing data warehouses.

Monday, March 19, 2012

Behind The Backoffice, Part 3: Backups and DR

In part 3 of this article, I will go over some simple backup hardware and disaster recovery techniques I have used in the past. You can read part 1 for a review of networking and servers, and part 2 for a discussion of storage hardware.

Getting ready for a disaster is not exactly a thrilling thing to do, it ranks up there with buying life insurance on the fun-o-meter. But unless you are prepared, a simple disk failure can take unexpected proportions. All the more reason to do it right the first time so you don't have to do it again, and make sure it works when disaster strikes. Which it will, sooner or later.

Disaster can take many forms: power outage, hard drive failure, network hiccup, or more serious such as lightning strike, fire and flooding. Most of the time, the event will be disruptive but will not cause any real or long-term problem. The disk array will nicely recover, the network packets will be retransmitted, and all is good.

But once in a while, the timing is just right and the event turns into something bad. In the case of a database server, the large number of disks becomes a weak spot, and data loss can occur. Or, a component can fail and the server becomes unusable. Those are the two instances most likely to affect a data warehousing platform, and there are simple steps that can be taken to prevent most of the drama.

The first line of defense is to have spare parts on hand. Disks are the first components to fail, and while manufacturers will warranty their product for anywhere from one to five years, it takes some time to process the replacement. Having a few extra disks of the right type and capacity will make for a quick replacement and reduce chances of data loss.

Moving on to data, the first thing that comes to mind is a backup. What to include in a backup depends entirely on the platform used, but should include the data plus any information needed to recreate the database server if it were to vanish, like system configurations and ETL scripts. Same for the reporting server, report definitions and report instances need to be backed up, as well as configurations and any information needed for rebuild.

Backup technology is constantly evolving, and takes advantage of improvements in other areas such as faster disks and networking. While tape was once the most common type of backup (names like DLT, Exabyte and LTO come to mind), this technology is all but extinct in the corporate world. Keeping multiple copies in weekly rotation, storing boxes of tapes at an offsite secure storage, lost and damaged media... for all the benefits of backup, tape was a lot of hassle.

These days, a backup server is essentially a very low power server attached to a disk storage system, or an all-inclusive disk appliance, connected to a fast network. An example is the QNAP storage appliance, which I have used with success for database and system backup. There are many vendors making similar devices with all sorts of features.

The main function of the backup server is to present disks over the network, which are accessed remotely. The gigabit network is leveraged here to make it possible to back up a large database in a reasonable time. Also, and this is important, the backup hardware should be in a different location from the database server. The backup will prevent data loss, but having the backup server in a different location will also prevent a serious disaster such as fire or lightning strike to destroy both systems.

Following that train of thought, another option for backing up data is to use an online backup service, there are many available (Carbonite, Mozy, CrashPlan, DropBox are common names). Some will sing the merits of their distributed data centers, some will talk about cloud storage, but in the end they are all the same thing: storage over the internet. This is a great option for smallish data sets, but difficult to implement with terabytes of data. Still, it is worth taking a look.

That was for the data. But what happens if the server has a component failure? The data is safely backed up, but the database server is unavailable. For the business, the result is the same: the data warehouse is unavailable, decisions cannot be made, money is being lost.

Again, simpler is better. If a week of downtime is acceptable to the business, then the easiest approach is to carefully document the server specs, all the components, all the configurations, all the database and application settings, and be ready to quickly order a replacement server and rebuild the database server. Depending on the skills available in the company, this can usually be done in a few days or a week.

Most organizations will not accept a week of downtime, and in those cases a spare server is a better choice. This can be done in several ways, limited by imagination and the specific needs of the company: a second server kept in storage, a development server that is promoted to production server while a replacement is ordered, a multi-node database server where the nodes can be reassigned to other tasks, an unrelated server that has similar specs and is identified as a suitable replacement if needed, etc. I have used most of these solutions at various times, they all worked. The most convenient was the development server promoted to production, since it was already online, already running the database, and required only a few settings to be changed in order to be active. Also, it makes it easier to budget for a spare server if that server does double-duty.

Backups and disaster recovery planning is not glamorous, but needs to be done. Never accept to work on a platform that does not have backups or is not ready to deal with events, that level of risk is too high. Taking simple steps goes a long way.

Sunday, March 18, 2012

Behind The Backoffice, Part 2: Storage

This article discusses the hardware typically used in data warehousing platforms. In part 2, I am covering storage hardware. Servers and networking were covered in part 1.

Let me start by making this statement: the quality of the storage hardware used in the data warehousing platform is a critical success factor of the project, just as important as careful design of the data model. Pretty bold, right? But it is true.

For a large scale data warehouse, counted in terabytes, disk i/o is the main bottleneck. Not the network, not the cpus, not the memory, but reads and writes to disk. Many technologies and configurations can be used to alleviate this problem, but i/o remains the bottleneck. It will slow down data processing, make queries run longer, and can make the entire analytics platform appear sluggish if not addressed. It is worth spending some of your budget on faster storage, and take the time to configure it for maximum performance.

If you have been working on smaller data warehouses in the past, chances are the storage consisted of a few disks inserted in the server. That's what I did. Some servers are larger and allow as many as 4 or 6 disks or more, so the amount of storage may be sufficient for small amounts of data. With terabyte and even 2-terabyte drives, this can represent a lot of storage space. As far as capacity goes, it may be all you need. "Six 2-terabyte drives, that's 12TB of available storage, I'll never need more than that!" Wrong, you will need more, and the problem is not capacity. The problem is i/o.

In order to maximize data throughput, what you need is a lot of spinning disks. Using a lot of disks to create a larger partition of data, the data is then broken in small chunks and written/read to multiple disks at the same time. This is faster than writing the same amount of data to a single disk, and this is how faster disk i/o is achieved. Storage hardware is built for that very specific purpose: maximize disk i/o by spreading data across many disks.

The actual implementation varies, but almost all vendors use some level of RAID technology. RAID was originally invented to reduce the risk of data loss caused by the failure of a disk, by spreading data over multiple disks, and it does a spectacular job of it. The side benefit that quickly emerged is faster disk access, and we take full advantage of it by using disk arrays attached to the server. Many disk arrays have 16 disks, some have a lot more. Management of the disks is done through some interface where disks are grouped in logical drives using RAID, then presented to the server. In all cases, there is also a spare drive that will be used by the disk array in case an active drive fails. The market of storage technology is very crowded, new vendors emerge all the time as new technologies are created. The price range for storage hardware varies greatly, from a few thousands to half a million dollars or more. Speed, reliability and capacity are the main drivers. Fortunately, you do not need to spend half a million dollars to have a good solution, newer-faster-better hardware is always available.

If you are not familiar with fibre channels, you should be. Fibre channels (often called FC) are fiber optic cables used to connect a server with storage equipment. They use light to transmit data, and they are fast, faster than a gigabit network. They work best over short distance, 100' or so, which is not a problem in a typical server room where all equipment is in racks. Most storage hardware vendors offer fibre channels on all but the lowest end of their products. The server must be equipped with a host-bus adapter (HBA) to connect the fibre channel, it can be added as an expansion card, and is sometimes built into the server back panel along with the network interface.

The price of fibre channel hardware (HBA, cables, switches) can be steep. A decent single-port HBA can cost $1000 or more, fibre switches easily reach $3000 for small models. This is where your IT department can really help select the best components and work with their preferred suppliers. A switch is only needed in more complex setups, where storage is shared, FC management is centralized, or the equipment is located in multiple server rooms. The HBA may be added to the server by the manufacturer, but it may not be the most cost-effective way to go.

In part 3 of this article, I will discuss the need for backup and disaster recovery.