Saturday, March 17, 2012

Behind The Backoffice, Part 1: Servers and Network

Let's talk about hardware.

Servers, storage, networking, disaster recovery, that kind of hardware. I know, I know, many data warehousing professionals would rather "let the IT guys take care of this part, they know best". Indeed, IT departments are specialists at identifying the best hardware for each situation. Well, most IT departments are.

But knowing the hardware behind your analytics platform is important. You need to know how much data you can throw at it, what improvements can be made when new technologies become available, etc. As your level of responsibility over a project increases, hardware will become part of the project planning, and being able to speak intelligently about servers and storage is important. There is much to be said about hardware, so I will make this a multi-part article.

The hardware used will vary from project to project, and will typically fit within the following categories: database server, database storage, ETL or data processing server, reporting/analytics server, networking, and backups. The backup servers can be a simple file server, a standby database server, a duplication of the production equipment, or any combination.

To get started, the easiest thing to do is to think about the flow of the data from its source all the way to the user screen. This approach works when designing an analytics platform, a data warehouse, and also when selecting the hardware.

The source of data is usually an OLTP database server. The data will be extracted, and moved to a data processing server for transformation; loaded in a database server; read by a reporting server to create or refresh a report; and displayed on the user computer. Some of those steps may be happening in a different order.

Moving data over a network is simple enough, but can be a challenge as volume increases. Having a gigabit network connection between all the servers will make things much smoother. Most servers available these days come equipped with one and often two gigabit interfaces, but not all company networks are fully gigabit enabled, you may need to negotiate with IT departments or spend some of your own budget to purchase and set up the missing gigabit network wiring and routers. It is not as glamorous as other parts of the business intelligence project, but well worth it.

The first server you are likely to look into is the database server. As is the case with all hardware, bigger and faster is better, but prices go up quickly. For simplicity, let's consider the main attributes of a server: cpu speed, cpu cores, amount of memory, internal disks. All other attributes, such as bus speed, rack-mounted vs. tower, expansion slots, redundant power supplies, etc. are secondary.

Because prices can go up quickly when choosing more powerful components, and because the licensing model of some database software vendors is based on the number of cpu cores, the selection of the database server is tied to the database software being used. This can be seen as a constraint, but I prefer to see it as an opportunity. Using a columnar database, for example, allows the use of a less powerful server, but it will take advantage of faster i/o, which is tied to storage. Spending less on the database server allows spending more on higher end storage. There is an upcoming article on columnar databases where I will discuss this in more detail.

I have had a lot of success with both Dell and Sun servers in the past. For my purposes, Dell servers were used with 32-bit Windows Server, and Sun servers were used with 64-bit Solaris. The most powerful Sun server had a single quad-core cpu at 2.8GHz, 32GB of memory, and about 250GB of internal disk storage. A typical Dell server had a single-core cpu at 3.6GHz, 4GB of memory and 73GB of disks. Compared to most other servers in the data center, these were low-end machines. Yet, by combining such a machine with a good database engine, good storage hardware, and with careful configuration, it becomes a very powerful analytics database server, capable of crunching through billions of rows of data in little time.

In part 2 of this article, I will discuss storage hardware.

No comments:

Post a Comment