Monday, March 26, 2012

Think Vertical

Columnar databases are all the rage for data warehousing and analytics systems, and there is a good reason for that: they are fast. There are several flavors and technologies available, and they all share the same column-wise organization of the data. Here's how it works.

As I discussed before, the main bottleneck of a data warehouse system is disk i/o. Writing and updating data in the database during ETL, retrieving data when refreshing reports, and exporting lists for mailings or invoice reconciliation, all of that requires disk i/o. The database server is constantly writing or retrieving data from storage. Disks are much slower than memory and cpu, and they become the pinch point. This is nothing new, it is common to all database systems, and to all computers for that matter. But it quickly becomes a problem for data warehousing because of the large data sets. An operational database usually inserts, updates or deletes a single record at a time; loading a fact table or refreshing a business report will routinely handle millions of records in a single query.

Columnar database engines directly address the disk i/o bottleneck by storing data differently, in columns instead of rows. It allows retrieval of a specific column for a large number of records, without having to read all the other columns. A traditional database engine reads all columns for the record, then throws away what is not needed; a columnar database only reads what is needed without wasting precious disk i/o.

This is all done under the hood and invisible to the database programmer, and standard SQL queries are used to manipulate the data. Let's look at the example of a query that calculates the total revenue by month for the state of California, running against this fictitious order table:

This table is not part of a star schema, as it does not have any keys, it is just used to illustrate the concept. The query might look like this:

Only 3 columns are involved in this query: order_date, state and item_price. A column-wise database engine can use an index on the state column (if present) to find the California orders, then retrieve the order_date and item_price to complete the query. The name and address fields are not retrieved by the database, which saves a lot of disk activity as these fields tend to be wide. By comparison, a traditional database engine will read the entire record and discard the unnecessary columns, thus wasting i/o. For a single record it is inconsequential, but for millions of records it makes a big difference on performance.

In addition to reading only the data needed, other techniques can be used to take advantage of the columnar organization, such as data compression and more efficient indexing. This is a relatively new technology and vendors are exploring the possibilities.

Organizing the data into facts and dimensions to reduce the joins and pre-calculate common metrics will have a major impact on performance on any database engine. Combined with a fast database, the star schema becomes a very powerful data warehousing tool. Columnar databases offer this fast performance, and do not require the powerful hardware of traditional databases in order to deliver good performance, thus keeping costs reasonable.

There are several vendors offering a columnar database. Some are big names, some are newcomers, some use open source technologies. I am a big fan of Sybase IQ, I have used it for the last 7 years. Performance is fantastic, hardware requirements are minimal, and maintenance is very easy.

(Before you ask: Sybase/SAP is not paying to place their product on my blog. I am not an employee of Sybase or SAP. I just like the product, it has been a great tool for me.)

If you are not currently using a columnar database, you should look into it. Compared to a traditional database, the gain in performance using commodity hardware makes this a great option for growing data warehouses.

Monday, March 19, 2012

Behind The Backoffice, Part 3: Backups and DR

In part 3 of this article, I will go over some simple backup hardware and disaster recovery techniques I have used in the past. You can read part 1 for a review of networking and servers, and part 2 for a discussion of storage hardware.

Getting ready for a disaster is not exactly a thrilling thing to do, it ranks up there with buying life insurance on the fun-o-meter. But unless you are prepared, a simple disk failure can take unexpected proportions. All the more reason to do it right the first time so you don't have to do it again, and make sure it works when disaster strikes. Which it will, sooner or later.

Disaster can take many forms: power outage, hard drive failure, network hiccup, or more serious such as lightning strike, fire and flooding. Most of the time, the event will be disruptive but will not cause any real or long-term problem. The disk array will nicely recover, the network packets will be retransmitted, and all is good.

But once in a while, the timing is just right and the event turns into something bad. In the case of a database server, the large number of disks becomes a weak spot, and data loss can occur. Or, a component can fail and the server becomes unusable. Those are the two instances most likely to affect a data warehousing platform, and there are simple steps that can be taken to prevent most of the drama.

The first line of defense is to have spare parts on hand. Disks are the first components to fail, and while manufacturers will warranty their product for anywhere from one to five years, it takes some time to process the replacement. Having a few extra disks of the right type and capacity will make for a quick replacement and reduce chances of data loss.

Moving on to data, the first thing that comes to mind is a backup. What to include in a backup depends entirely on the platform used, but should include the data plus any information needed to recreate the database server if it were to vanish, like system configurations and ETL scripts. Same for the reporting server, report definitions and report instances need to be backed up, as well as configurations and any information needed for rebuild.

Backup technology is constantly evolving, and takes advantage of improvements in other areas such as faster disks and networking. While tape was once the most common type of backup (names like DLT, Exabyte and LTO come to mind), this technology is all but extinct in the corporate world. Keeping multiple copies in weekly rotation, storing boxes of tapes at an offsite secure storage, lost and damaged media... for all the benefits of backup, tape was a lot of hassle.

These days, a backup server is essentially a very low power server attached to a disk storage system, or an all-inclusive disk appliance, connected to a fast network. An example is the QNAP storage appliance, which I have used with success for database and system backup. There are many vendors making similar devices with all sorts of features.

The main function of the backup server is to present disks over the network, which are accessed remotely. The gigabit network is leveraged here to make it possible to back up a large database in a reasonable time. Also, and this is important, the backup hardware should be in a different location from the database server. The backup will prevent data loss, but having the backup server in a different location will also prevent a serious disaster such as fire or lightning strike to destroy both systems.

Following that train of thought, another option for backing up data is to use an online backup service, there are many available (Carbonite, Mozy, CrashPlan, DropBox are common names). Some will sing the merits of their distributed data centers, some will talk about cloud storage, but in the end they are all the same thing: storage over the internet. This is a great option for smallish data sets, but difficult to implement with terabytes of data. Still, it is worth taking a look.

That was for the data. But what happens if the server has a component failure? The data is safely backed up, but the database server is unavailable. For the business, the result is the same: the data warehouse is unavailable, decisions cannot be made, money is being lost.

Again, simpler is better. If a week of downtime is acceptable to the business, then the easiest approach is to carefully document the server specs, all the components, all the configurations, all the database and application settings, and be ready to quickly order a replacement server and rebuild the database server. Depending on the skills available in the company, this can usually be done in a few days or a week.

Most organizations will not accept a week of downtime, and in those cases a spare server is a better choice. This can be done in several ways, limited by imagination and the specific needs of the company: a second server kept in storage, a development server that is promoted to production server while a replacement is ordered, a multi-node database server where the nodes can be reassigned to other tasks, an unrelated server that has similar specs and is identified as a suitable replacement if needed, etc. I have used most of these solutions at various times, they all worked. The most convenient was the development server promoted to production, since it was already online, already running the database, and required only a few settings to be changed in order to be active. Also, it makes it easier to budget for a spare server if that server does double-duty.

Backups and disaster recovery planning is not glamorous, but needs to be done. Never accept to work on a platform that does not have backups or is not ready to deal with events, that level of risk is too high. Taking simple steps goes a long way.

Sunday, March 18, 2012

Behind The Backoffice, Part 2: Storage

This article discusses the hardware typically used in data warehousing platforms. In part 2, I am covering storage hardware. Servers and networking were covered in part 1.

Let me start by making this statement: the quality of the storage hardware used in the data warehousing platform is a critical success factor of the project, just as important as careful design of the data model. Pretty bold, right? But it is true.

For a large scale data warehouse, counted in terabytes, disk i/o is the main bottleneck. Not the network, not the cpus, not the memory, but reads and writes to disk. Many technologies and configurations can be used to alleviate this problem, but i/o remains the bottleneck. It will slow down data processing, make queries run longer, and can make the entire analytics platform appear sluggish if not addressed. It is worth spending some of your budget on faster storage, and take the time to configure it for maximum performance.

If you have been working on smaller data warehouses in the past, chances are the storage consisted of a few disks inserted in the server. That's what I did. Some servers are larger and allow as many as 4 or 6 disks or more, so the amount of storage may be sufficient for small amounts of data. With terabyte and even 2-terabyte drives, this can represent a lot of storage space. As far as capacity goes, it may be all you need. "Six 2-terabyte drives, that's 12TB of available storage, I'll never need more than that!" Wrong, you will need more, and the problem is not capacity. The problem is i/o.

In order to maximize data throughput, what you need is a lot of spinning disks. Using a lot of disks to create a larger partition of data, the data is then broken in small chunks and written/read to multiple disks at the same time. This is faster than writing the same amount of data to a single disk, and this is how faster disk i/o is achieved. Storage hardware is built for that very specific purpose: maximize disk i/o by spreading data across many disks.

The actual implementation varies, but almost all vendors use some level of RAID technology. RAID was originally invented to reduce the risk of data loss caused by the failure of a disk, by spreading data over multiple disks, and it does a spectacular job of it. The side benefit that quickly emerged is faster disk access, and we take full advantage of it by using disk arrays attached to the server. Many disk arrays have 16 disks, some have a lot more. Management of the disks is done through some interface where disks are grouped in logical drives using RAID, then presented to the server. In all cases, there is also a spare drive that will be used by the disk array in case an active drive fails. The market of storage technology is very crowded, new vendors emerge all the time as new technologies are created. The price range for storage hardware varies greatly, from a few thousands to half a million dollars or more. Speed, reliability and capacity are the main drivers. Fortunately, you do not need to spend half a million dollars to have a good solution, newer-faster-better hardware is always available.

If you are not familiar with fibre channels, you should be. Fibre channels (often called FC) are fiber optic cables used to connect a server with storage equipment. They use light to transmit data, and they are fast, faster than a gigabit network. They work best over short distance, 100' or so, which is not a problem in a typical server room where all equipment is in racks. Most storage hardware vendors offer fibre channels on all but the lowest end of their products. The server must be equipped with a host-bus adapter (HBA) to connect the fibre channel, it can be added as an expansion card, and is sometimes built into the server back panel along with the network interface.

The price of fibre channel hardware (HBA, cables, switches) can be steep. A decent single-port HBA can cost $1000 or more, fibre switches easily reach $3000 for small models. This is where your IT department can really help select the best components and work with their preferred suppliers. A switch is only needed in more complex setups, where storage is shared, FC management is centralized, or the equipment is located in multiple server rooms. The HBA may be added to the server by the manufacturer, but it may not be the most cost-effective way to go.

In part 3 of this article, I will discuss the need for backup and disaster recovery.

Saturday, March 17, 2012

Behind The Backoffice, Part 1: Servers and Network

Let's talk about hardware.

Servers, storage, networking, disaster recovery, that kind of hardware. I know, I know, many data warehousing professionals would rather "let the IT guys take care of this part, they know best". Indeed, IT departments are specialists at identifying the best hardware for each situation. Well, most IT departments are.

But knowing the hardware behind your analytics platform is important. You need to know how much data you can throw at it, what improvements can be made when new technologies become available, etc. As your level of responsibility over a project increases, hardware will become part of the project planning, and being able to speak intelligently about servers and storage is important. There is much to be said about hardware, so I will make this a multi-part article.

The hardware used will vary from project to project, and will typically fit within the following categories: database server, database storage, ETL or data processing server, reporting/analytics server, networking, and backups. The backup servers can be a simple file server, a standby database server, a duplication of the production equipment, or any combination.

To get started, the easiest thing to do is to think about the flow of the data from its source all the way to the user screen. This approach works when designing an analytics platform, a data warehouse, and also when selecting the hardware.

The source of data is usually an OLTP database server. The data will be extracted, and moved to a data processing server for transformation; loaded in a database server; read by a reporting server to create or refresh a report; and displayed on the user computer. Some of those steps may be happening in a different order.

Moving data over a network is simple enough, but can be a challenge as volume increases. Having a gigabit network connection between all the servers will make things much smoother. Most servers available these days come equipped with one and often two gigabit interfaces, but not all company networks are fully gigabit enabled, you may need to negotiate with IT departments or spend some of your own budget to purchase and set up the missing gigabit network wiring and routers. It is not as glamorous as other parts of the business intelligence project, but well worth it.

The first server you are likely to look into is the database server. As is the case with all hardware, bigger and faster is better, but prices go up quickly. For simplicity, let's consider the main attributes of a server: cpu speed, cpu cores, amount of memory, internal disks. All other attributes, such as bus speed, rack-mounted vs. tower, expansion slots, redundant power supplies, etc. are secondary.

Because prices can go up quickly when choosing more powerful components, and because the licensing model of some database software vendors is based on the number of cpu cores, the selection of the database server is tied to the database software being used. This can be seen as a constraint, but I prefer to see it as an opportunity. Using a columnar database, for example, allows the use of a less powerful server, but it will take advantage of faster i/o, which is tied to storage. Spending less on the database server allows spending more on higher end storage. There is an upcoming article on columnar databases where I will discuss this in more detail.

I have had a lot of success with both Dell and Sun servers in the past. For my purposes, Dell servers were used with 32-bit Windows Server, and Sun servers were used with 64-bit Solaris. The most powerful Sun server had a single quad-core cpu at 2.8GHz, 32GB of memory, and about 250GB of internal disk storage. A typical Dell server had a single-core cpu at 3.6GHz, 4GB of memory and 73GB of disks. Compared to most other servers in the data center, these were low-end machines. Yet, by combining such a machine with a good database engine, good storage hardware, and with careful configuration, it becomes a very powerful analytics database server, capable of crunching through billions of rows of data in little time.

In part 2 of this article, I will discuss storage hardware.

Friday, March 16, 2012

What Is Data?

So, what is data, exactly?

From my perspective, data is any piece of information produced by a company, purchased from a vendor, or shared by a partner, that can be used to make a decision of any kind by anyone in the organization. It is very broad. For our purposes, we will only consider data recorded in electronic form, which represents the vast majority of data available to a business today.

Your company runs an e-commerce website? Then your list of products, customers, orders, promotions, the web traffic, affiliate websites, all of that is data. If those orders are fulfilled from a warehouse, the inventory of that warehouse is data. Those orders are shipped by mail or some other carrier, bam, more data.

Cash registers in a brick-and-mortar store produce a lot of detailed data about the content of the shopping cart. As far as data goes, the point-of-sale software is essentially doing the same thing as an e-commerce website. The manufacturing process of a plant produces data about materials, quality, duration, etc. Some products come with warranties, the issuance and claims on those warranties are another source of data. The accounting and finance functions of just about any company are major sources of data.

Some data is less clearly defined. For example, the content of a phone call in a customer service center is not recorded in electronic form, but the customer service agent typically uses a software application to capture information during the call, complete with categories, duration, contact information and more. This becomes another source of data.

Invoices from service providers such as phone companies are another source of data, which typically flow into the accounting process.

Knowing what data is available to an organization is an important success factor for data warehousing and business intelligence professionals. Making a piece of data from a different department available in the data warehouse can make a big difference in how useful the data warehouse will be.

For example, consider a data warehouse being created for a mid-size winery that wants to analyze its store sales. Wine making is complex, it combines many specialized skills. It would be normal to start the analysis of sales by looking at the financials, the marketing data, maybe the demographics. That's what I would do. The manufacturing process (the actual wine making) also plays a big role in the sales of wine, it determines fundamental attributes of the product, in this case the quality and price point. But there are so many areas of wine making, what piece of information is important? Does the type of barrel used to age the wine have an impact on the sales? Probably. What about the type of ink used to print the label? Maybe, it is hard to tell without spending some time looking into it.

My approach has always been that you can only analyze the data if you have it. In the case of the winery, I did include information about the ink in the data warehouse (it turns out the type of ink used to print the label has an impact on the perception of quality by the consumer, which varies by region and price point; also, in the US, there are laws controlling the labels, different for every state). Each piece of data on its own may or may not be useful, but putting this data together makes every data element more powerful.

And THAT is what a data warehouse is all about.

Thursday, March 15, 2012

Welcome to About The Data blog!

Welcome to this blog about data warehousing, data management, and business analytics.

My name is Eric Lagacé, I have been working in this industry for over 15 years. I have worked on many projects of varying scale, from week-long quick fixes to year-long major endeavors, and there have been many ideas along the way, both good and bad, and many stories.

In this blog, I will chronicle those ideas and tell those stories, sometimes revisiting solutions to problems, sometimes addressing current issues with ongoing projects, sometimes discussing improvements that can be made. My goal is to share information, possibly help others in their projects, and hopefully start interesting conversations.

Much of my work experience is in the retail industry, both online and offline, with some amount of high-tech, medical and manufacturing. I usually get involved in every aspect of the projects, from analyzing requirements with business users, to designing and implementing the data warehouse, selecting the hardware and software, and ensuring successful long-term operation. I plan on touching on all of these topics.

If that sounds interesting, keep an eye on this page, and let's get the conversation started.