Monday, March 19, 2012

Behind The Backoffice, Part 3: Backups and DR

In part 3 of this article, I will go over some simple backup hardware and disaster recovery techniques I have used in the past. You can read part 1 for a review of networking and servers, and part 2 for a discussion of storage hardware.

Getting ready for a disaster is not exactly a thrilling thing to do, it ranks up there with buying life insurance on the fun-o-meter. But unless you are prepared, a simple disk failure can take unexpected proportions. All the more reason to do it right the first time so you don't have to do it again, and make sure it works when disaster strikes. Which it will, sooner or later.

Disaster can take many forms: power outage, hard drive failure, network hiccup, or more serious such as lightning strike, fire and flooding. Most of the time, the event will be disruptive but will not cause any real or long-term problem. The disk array will nicely recover, the network packets will be retransmitted, and all is good.

But once in a while, the timing is just right and the event turns into something bad. In the case of a database server, the large number of disks becomes a weak spot, and data loss can occur. Or, a component can fail and the server becomes unusable. Those are the two instances most likely to affect a data warehousing platform, and there are simple steps that can be taken to prevent most of the drama.

The first line of defense is to have spare parts on hand. Disks are the first components to fail, and while manufacturers will warranty their product for anywhere from one to five years, it takes some time to process the replacement. Having a few extra disks of the right type and capacity will make for a quick replacement and reduce chances of data loss.

Moving on to data, the first thing that comes to mind is a backup. What to include in a backup depends entirely on the platform used, but should include the data plus any information needed to recreate the database server if it were to vanish, like system configurations and ETL scripts. Same for the reporting server, report definitions and report instances need to be backed up, as well as configurations and any information needed for rebuild.

Backup technology is constantly evolving, and takes advantage of improvements in other areas such as faster disks and networking. While tape was once the most common type of backup (names like DLT, Exabyte and LTO come to mind), this technology is all but extinct in the corporate world. Keeping multiple copies in weekly rotation, storing boxes of tapes at an offsite secure storage, lost and damaged media... for all the benefits of backup, tape was a lot of hassle.

These days, a backup server is essentially a very low power server attached to a disk storage system, or an all-inclusive disk appliance, connected to a fast network. An example is the QNAP storage appliance, which I have used with success for database and system backup. There are many vendors making similar devices with all sorts of features.

The main function of the backup server is to present disks over the network, which are accessed remotely. The gigabit network is leveraged here to make it possible to back up a large database in a reasonable time. Also, and this is important, the backup hardware should be in a different location from the database server. The backup will prevent data loss, but having the backup server in a different location will also prevent a serious disaster such as fire or lightning strike to destroy both systems.



Following that train of thought, another option for backing up data is to use an online backup service, there are many available (Carbonite, Mozy, CrashPlan, DropBox are common names). Some will sing the merits of their distributed data centers, some will talk about cloud storage, but in the end they are all the same thing: storage over the internet. This is a great option for smallish data sets, but difficult to implement with terabytes of data. Still, it is worth taking a look.

That was for the data. But what happens if the server has a component failure? The data is safely backed up, but the database server is unavailable. For the business, the result is the same: the data warehouse is unavailable, decisions cannot be made, money is being lost.

Again, simpler is better. If a week of downtime is acceptable to the business, then the easiest approach is to carefully document the server specs, all the components, all the configurations, all the database and application settings, and be ready to quickly order a replacement server and rebuild the database server. Depending on the skills available in the company, this can usually be done in a few days or a week.

Most organizations will not accept a week of downtime, and in those cases a spare server is a better choice. This can be done in several ways, limited by imagination and the specific needs of the company: a second server kept in storage, a development server that is promoted to production server while a replacement is ordered, a multi-node database server where the nodes can be reassigned to other tasks, an unrelated server that has similar specs and is identified as a suitable replacement if needed, etc. I have used most of these solutions at various times, they all worked. The most convenient was the development server promoted to production, since it was already online, already running the database, and required only a few settings to be changed in order to be active. Also, it makes it easier to budget for a spare server if that server does double-duty.

Backups and disaster recovery planning is not glamorous, but needs to be done. Never accept to work on a platform that does not have backups or is not ready to deal with events, that level of risk is too high. Taking simple steps goes a long way.

No comments:

Post a Comment