Sunday, March 18, 2012

Behind The Backoffice, Part 2: Storage

This article discusses the hardware typically used in data warehousing platforms. In part 2, I am covering storage hardware. Servers and networking were covered in part 1.

Let me start by making this statement: the quality of the storage hardware used in the data warehousing platform is a critical success factor of the project, just as important as careful design of the data model. Pretty bold, right? But it is true.

For a large scale data warehouse, counted in terabytes, disk i/o is the main bottleneck. Not the network, not the cpus, not the memory, but reads and writes to disk. Many technologies and configurations can be used to alleviate this problem, but i/o remains the bottleneck. It will slow down data processing, make queries run longer, and can make the entire analytics platform appear sluggish if not addressed. It is worth spending some of your budget on faster storage, and take the time to configure it for maximum performance.

If you have been working on smaller data warehouses in the past, chances are the storage consisted of a few disks inserted in the server. That's what I did. Some servers are larger and allow as many as 4 or 6 disks or more, so the amount of storage may be sufficient for small amounts of data. With terabyte and even 2-terabyte drives, this can represent a lot of storage space. As far as capacity goes, it may be all you need. "Six 2-terabyte drives, that's 12TB of available storage, I'll never need more than that!" Wrong, you will need more, and the problem is not capacity. The problem is i/o.

In order to maximize data throughput, what you need is a lot of spinning disks. Using a lot of disks to create a larger partition of data, the data is then broken in small chunks and written/read to multiple disks at the same time. This is faster than writing the same amount of data to a single disk, and this is how faster disk i/o is achieved. Storage hardware is built for that very specific purpose: maximize disk i/o by spreading data across many disks.

The actual implementation varies, but almost all vendors use some level of RAID technology. RAID was originally invented to reduce the risk of data loss caused by the failure of a disk, by spreading data over multiple disks, and it does a spectacular job of it. The side benefit that quickly emerged is faster disk access, and we take full advantage of it by using disk arrays attached to the server. Many disk arrays have 16 disks, some have a lot more. Management of the disks is done through some interface where disks are grouped in logical drives using RAID, then presented to the server. In all cases, there is also a spare drive that will be used by the disk array in case an active drive fails. The market of storage technology is very crowded, new vendors emerge all the time as new technologies are created. The price range for storage hardware varies greatly, from a few thousands to half a million dollars or more. Speed, reliability and capacity are the main drivers. Fortunately, you do not need to spend half a million dollars to have a good solution, newer-faster-better hardware is always available.

If you are not familiar with fibre channels, you should be. Fibre channels (often called FC) are fiber optic cables used to connect a server with storage equipment. They use light to transmit data, and they are fast, faster than a gigabit network. They work best over short distance, 100' or so, which is not a problem in a typical server room where all equipment is in racks. Most storage hardware vendors offer fibre channels on all but the lowest end of their products. The server must be equipped with a host-bus adapter (HBA) to connect the fibre channel, it can be added as an expansion card, and is sometimes built into the server back panel along with the network interface.

The price of fibre channel hardware (HBA, cables, switches) can be steep. A decent single-port HBA can cost $1000 or more, fibre switches easily reach $3000 for small models. This is where your IT department can really help select the best components and work with their preferred suppliers. A switch is only needed in more complex setups, where storage is shared, FC management is centralized, or the equipment is located in multiple server rooms. The HBA may be added to the server by the manufacturer, but it may not be the most cost-effective way to go.

In part 3 of this article, I will discuss the need for backup and disaster recovery.

No comments:

Post a Comment