About The Data: 2012

Wednesday, April 18, 2012

What In The Universe?...

This is a discussion about giving business users the ability to create their own reports, and the benefits of using a metadata layer in your business intelligence platform. Business Objects calls this a Universe, other software have other names for it.

So you built a great data warehousing system, data is flowing in and is well organized, and now you are thinking about delivering this data to the business users. (If you waited until the data warehouse was built to think about information delivery, it is a bit late, this should have come up at the very beginning. Moving on.)

You write reports using SQL queries, and you deliver those reports by email, or maybe have the reports published on a web portal. And you do this for all reports, and those reports are very useful, and the report requests keep coming in. More users from more departments start asking for information, and soon you become a specialist of all the data, all the metrics, all the partnerships, all the details and exceptions of every aspect of the company. You also become the bottleneck, as you cannot keep up with the requests, and business users become impatient.

Sounds familiar? Don't worry, it is quite common, the success of a business intelligence platform can be overwhelming at first.

One way to address this situation is to hire more people to write more reports. It works for a while, but it only delays the problem, and actually makes it worse because it allows the concentration of a lot of business knowledge within a small group of people, who are not the ones making decisions in all those areas. It is inevitable that the business intelligence professionals within your organisation will acquire a lot of business knowledge, but this knowledge needs to be distributed in the respective departments.

A better way to reduce the bottleneck, and at the same time limit the concentration of knowledge, is to give the users the ability to create their own reports, and train them on how to do it effectively. Read this again: give ability to create reports, and train users. If the users are not trained, they will not know how do it. This is just common sense.

So you install a database query tool on the computer of the marketing manager, product development manager, and VP of finance, and they start writing SQL queries against the data warehouse, and you have solved the problem. Right? Wrong. Writing SQL is your job. Your business users need a different tool, something more meaningful than data types, codes, joins and group bys. Your users need a graphical interface.

Most reporting software have a built-in graphical tool to write queries. Business Objects, Microstrategy, Tableau, Cognos, they all offer a way of writing queries using a drag-and-drop interface. But that still requires technical knowledge of the underlying database, with all the complexities and exceptions.

This is where the metadata layer comes in. In short, the metadata layer (also called the business layer, or semantic layer) is a piece of software that sits between the database and the reporting tool, and replaces the details of the database with meaningful business metrics. Dragging the labels "product revenue" and "order date" is meaningful, users of the data will understand it immediately. It standardizes the metrics and names used to describe them, and reduces (and hopefully eliminates) instances of "my data says this" and "my numbers show that". All users see the same information, consistently. Also, the metadata layer allows you to add descriptions to the metrics, or filters, or external data, all without adding complexity for the users.

Another benefit of the metadata layer is that you and your team will also be using it, and it accelerates the creation of reports and analyzes. What, you did not think all the reports would be created by business users from now on, did you? Users will now have a tool to access the data as it relates to their area of expertise, but for company-wide metrics you should retain control of report creation. The metadata layer will make this process easier, too. Plus, as most reporting tools on the market allow you to do, you will still have the ability to bend the metadata rules by writing SQL queries directly against the database, and display it in reports, which the users of the graphical tool will most likely not be able to do. That's the combination of easier work and job security!

A business metadata layer is a good way of improving productivity of your business users and your own developers by hiding the complexities of the data. It can be used to do a lot of fancy things behind the scenes, but mostly will enable more people to gain access to the data in a meaningful way, and take ownership of the data most relevant to their area of expertise. If you are not currently using a reporting tool that includes a metadata layer, take the time to do some research as it can really help you.

Wednesday, April 11, 2012

ETL or ELT, That Is The Question

The ETL process is at the heart of the data warehouse, turning data from messy and scattered to organized and centralized. But my experience has been that ETL is slower than ELT. Besides swapping two letters, I will describe what that means and why it matters.

First, the basics: ETL stands for "Extract, Transform and Load". It is the process by which data is extracted from the source systems (database, file), transformed into the proper format and structure, and loaded into the data warehouse. Since you're reading this article, you probably know that already.

Can you guess what ELT stands for? Don't think too much, it means "Extract, Load and Transform". This swapping of letters has a few important implications. For one, it means the data will go from source to destination without being transformed, and transformation will take place in the database. Using database bulk loaders and native data formats, this results in very fast data movement. Transforming data in the database narrows the performance tuning and maintenance to a smaller set of tools, thus gaining efficiency and requiring fewer specialized skills. Also, it opens the door to keeping a copy of the source data in the reporting databases, in addition to the data warehouse, because data has already been copied; this may or may not be interesting for your needs, you get to decide.

Moving data from source to staging can be done in a number of ways: real-time replication, nightly batch extraction, intra-day batch, snap backup, etc. The process of acquiring the data depends a lot on volume; small volume works well in batch but large volume needs more care. It also depends on business requirements, if intraday data is needed then multiple batches or real-time replication are necessary. At the end of the process, a day's (or partial day) worth of data has been copied from source systems to a staging area, ready for processing.

The transformation part is where the magic happens: cleanup, lookup, rollup, aggregation, validation and error checking, this is where the data is prepared for the data warehouse. Because data is already in the database, lookups can be done using joins, efficient stored procedures can be written, indexing and other performance tuning techniques can be used. By comparison, a standard ETL flow would have the ETL tool iteratively run through transformations outside of the database, storing lookup data in memory, and write the resulting data to the data warehouse at the end.

If budget and staffing level allow it, the combination of mainstream ETL software and powerful hardware is certainly more flexible than the solution I am describing here. Maintaining SQL-based data transformation is tedious, and still requires a script to control the flow.

But when budgets are tight and compromises must be made, the ELT approach has been shown to work with multi-terabyte data sets very effectively, at the cost of more difficult maintenance. It's a trade-off that is acceptable in smaller organizations, but may not scale well as business gets larger.

As you evaluate an ETL tool for your project, or if you are faced with performance issues that cannot be resolved without using big hardware, consider looking into an E-L-T data flow. I have used it with more than 4TB of data, adding 30GB daily, and performance has been very good. The cost you pay is slower maintenance compared to a graphical interface, but this is easily alleviated by using strong programming standards. At the very least it will give you another option, and you may realize this is all you needed. Plus, if you can afford the additional disks needed to store a copy of the source data, you gain the additional flexibility of having both the data warehouse and source data in one place. Win-win!

Monday, April 2, 2012

Expanding Your Horizons

Horizontal partitioning is a great way of storing and retrieving data from very large tables by breaking them into smaller tables, providing both predictable performance and simple querying. Some vendors offer built-in tools to handle partitions, other vendors have plug-ins available. It can also be implemented as a simple set of tables with management scripts, as I will be describing.

Imagine a retail store or e-commerce site taking many orders, often with multiple line items for each order. This is a very common scenario. The database table that stores the line items quickly grows in size, and after a few months can reach several million rows in a moderately successful store. The web site or point-of-sale software should have no difficulty writing new rows to the database, but the large table slows down ETL and analytics queries. This is the scenario we are addressing.

Once the basic indexing and optimization techniques are no longer sufficient, a very large table becomes a good candidate for partitioning. In short, the large table will be replaced with a number of smaller tables called partitions, and the partitions will be reassembled using a view so it can be queried as a single table. Turning the large table into partitions is a one-time event, and from that point forward there needs to be some process that will maintain the partitions and the view as new data is added.

Let's take the example of table order_item:

order_item
------------
id
order_id
inventory_id
quantity
price

record_create_date

record_update_date

Let's say this table has grown to 75 million records after 2 years of operations. On the production database, the table may be truncated and the data archived, or treated in some other way that does not cause a performance issue for single record inserts and updates. But in the analytics database this table is now causing some slowness, and it is decided to partition it.

The partitions will be called order_item_001, order_item_002, order_item_003, etc. The size of each partition determines the total number of partitions. Let's pick a size of 20 million for this example. The order_item table will be broken into 4 partitions, the last one being the "current" partitions with 15 million records. It can be called order_item_004, or order_item_current, or anything that will allow identifying the last partition. One benefit of using the name order_item_current is that it will remain constant as new partitions are added, which makes inserts seamless for ETL.

In order to query the partitioned table, a view is used to reassemble the partitions, like so:

create view order_item
as
select *, partition = NULL from order_item_current
union all
select *, partition = 1 from order_item_001
union all
select *, partition = 2 from order_item_002
union all
select *, partition = 3 from order_item_003

The view is now taking the place of the table as far as existing SQL queries and scripts are concerned. For optimal performance, it is important that all partitions use the same indexes, it makes the query optimizer's job much easier and gets the best possible performance. The partition number is added in the view instead of being added to the partition table so that the table structure remains unchanged. Adding the partition number is necessary to identify which partition to update during ETL.

As new records are inserted and the count reaches 20 million in the current partition, it will be turned into the next numbered partition, order_item_004, the view adjusted accordingly, and new records will continue to be inserted in the now empty order_item_current table. Because the partitions are of equal size, analytic queries that return a range of data tend to take the same time to execute, and this eliminates the slowness caused by the growing table. In the case of retail transactions, reports are usually retrieving a quarter, a year, or last year plus current year, and the execution time becomes predictable despite data growth and seasonality.

Updating of existing records cannot be done through the view (SQL does not allow it), which is why adding the partition number is important. A simple query that matches the keys and returns the partition number can be used to identify which partitions need updating, and the update can then be applied to only those partitions. A bit of dynamic SQL can be used here to craft queries on the fly.

The key to successfully implementing horizontal partitioning is to carefully select the logical separation of the data. The end result needs to be a set of partitions that are similar in size, so that performance remains constant for similar size result sets. Also, data that is likely to be accessed together should remain in as few partitions as possible. Using time to separate the data sounds good at first, but is not a great approach because spikes like seasonality (Q4 retail season and other holidays for example) can result of partitions of very different size.

Instead, I have used rowcount for tables as large as 6 billion rows, with partition size ranging from 20 to 75 million, with very good and constant performance. A weekly management task is all that is needed to iterate through the partitioned tables:

determine if "current" partition has exceeded the threshold
if so
create the next numbered partition
adjust the view to include new partition
if not, move on to next table

This is a great example of minimal effort with maximum impact.

If your data warehouse is stored in a relational database, and you have the ability to write and schedule stored procedures, then you already have everything you need to implement horizontal partitioning of large tables. It is a great way of solving a common performance problem, is very cost effective for both implementation and maintenance, and will free you up for more pressing tasks, like keeping up with the incessant flow of changes in the source systems feeding your data warehouse.

Monday, March 26, 2012

Think Vertical

Columnar databases are all the rage for data warehousing and analytics systems, and there is a good reason for that: they are fast. There are several flavors and technologies available, and they all share the same column-wise organization of the data. Here's how it works.

As I discussed before, the main bottleneck of a data warehouse system is disk i/o. Writing and updating data in the database during ETL, retrieving data when refreshing reports, and exporting lists for mailings or invoice reconciliation, all of that requires disk i/o. The database server is constantly writing or retrieving data from storage. Disks are much slower than memory and cpu, and they become the pinch point. This is nothing new, it is common to all database systems, and to all computers for that matter. But it quickly becomes a problem for data warehousing because of the large data sets. An operational database usually inserts, updates or deletes a single record at a time; loading a fact table or refreshing a business report will routinely handle millions of records in a single query.

Columnar database engines directly address the disk i/o bottleneck by storing data differently, in columns instead of rows. It allows retrieval of a specific column for a large number of records, without having to read all the other columns. A traditional database engine reads all columns for the record, then throws away what is not needed; a columnar database only reads what is needed without wasting precious disk i/o.

This is all done under the hood and invisible to the database programmer, and standard SQL queries are used to manipulate the data. Let's look at the example of a query that calculates the total revenue by month for the state of California, running against this fictitious order table:

This table is not part of a star schema, as it does not have any keys, it is just used to illustrate the concept. The query might look like this:

Only 3 columns are involved in this query: order_date, state and item_price. A column-wise database engine can use an index on the state column (if present) to find the California orders, then retrieve the order_date and item_price to complete the query. The name and address fields are not retrieved by the database, which saves a lot of disk activity as these fields tend to be wide. By comparison, a traditional database engine will read the entire record and discard the unnecessary columns, thus wasting i/o. For a single record it is inconsequential, but for millions of records it makes a big difference on performance.

In addition to reading only the data needed, other techniques can be used to take advantage of the columnar organization, such as data compression and more efficient indexing. This is a relatively new technology and vendors are exploring the possibilities.

Organizing the data into facts and dimensions to reduce the joins and pre-calculate common metrics will have a major impact on performance on any database engine. Combined with a fast database, the star schema becomes a very powerful data warehousing tool. Columnar databases offer this fast performance, and do not require the powerful hardware of traditional databases in order to deliver good performance, thus keeping costs reasonable.

There are several vendors offering a columnar database. Some are big names, some are newcomers, some use open source technologies. I am a big fan of Sybase IQ, I have used it for the last 7 years. Performance is fantastic, hardware requirements are minimal, and maintenance is very easy.

(Before you ask: Sybase/SAP is not paying to place their product on my blog. I am not an employee of Sybase or SAP. I just like the product, it has been a great tool for me.)

If you are not currently using a columnar database, you should look into it. Compared to a traditional database, the gain in performance using commodity hardware makes this a great option for growing data warehouses.

Monday, March 19, 2012

Behind The Backoffice, Part 3: Backups and DR

In part 3 of this article, I will go over some simple backup hardware and disaster recovery techniques I have used in the past. You can read part 1 for a review of networking and servers, and part 2 for a discussion of storage hardware.

Getting ready for a disaster is not exactly a thrilling thing to do, it ranks up there with buying life insurance on the fun-o-meter. But unless you are prepared, a simple disk failure can take unexpected proportions. All the more reason to do it right the first time so you don't have to do it again, and make sure it works when disaster strikes. Which it will, sooner or later.

Disaster can take many forms: power outage, hard drive failure, network hiccup, or more serious such as lightning strike, fire and flooding. Most of the time, the event will be disruptive but will not cause any real or long-term problem. The disk array will nicely recover, the network packets will be retransmitted, and all is good.

But once in a while, the timing is just right and the event turns into something bad. In the case of a database server, the large number of disks becomes a weak spot, and data loss can occur. Or, a component can fail and the server becomes unusable. Those are the two instances most likely to affect a data warehousing platform, and there are simple steps that can be taken to prevent most of the drama.

The first line of defense is to have spare parts on hand. Disks are the first components to fail, and while manufacturers will warranty their product for anywhere from one to five years, it takes some time to process the replacement. Having a few extra disks of the right type and capacity will make for a quick replacement and reduce chances of data loss.

Moving on to data, the first thing that comes to mind is a backup. What to include in a backup depends entirely on the platform used, but should include the data plus any information needed to recreate the database server if it were to vanish, like system configurations and ETL scripts. Same for the reporting server, report definitions and report instances need to be backed up, as well as configurations and any information needed for rebuild.

Backup technology is constantly evolving, and takes advantage of improvements in other areas such as faster disks and networking. While tape was once the most common type of backup (names like DLT, Exabyte and LTO come to mind), this technology is all but extinct in the corporate world. Keeping multiple copies in weekly rotation, storing boxes of tapes at an offsite secure storage, lost and damaged media... for all the benefits of backup, tape was a lot of hassle.

These days, a backup server is essentially a very low power server attached to a disk storage system, or an all-inclusive disk appliance, connected to a fast network. An example is the QNAP storage appliance, which I have used with success for database and system backup. There are many vendors making similar devices with all sorts of features.

The main function of the backup server is to present disks over the network, which are accessed remotely. The gigabit network is leveraged here to make it possible to back up a large database in a reasonable time. Also, and this is important, the backup hardware should be in a different location from the database server. The backup will prevent data loss, but having the backup server in a different location will also prevent a serious disaster such as fire or lightning strike to destroy both systems.

Following that train of thought, another option for backing up data is to use an online backup service, there are many available (Carbonite, Mozy, CrashPlan, DropBox are common names). Some will sing the merits of their distributed data centers, some will talk about cloud storage, but in the end they are all the same thing: storage over the internet. This is a great option for smallish data sets, but difficult to implement with terabytes of data. Still, it is worth taking a look.

That was for the data. But what happens if the server has a component failure? The data is safely backed up, but the database server is unavailable. For the business, the result is the same: the data warehouse is unavailable, decisions cannot be made, money is being lost.

Again, simpler is better. If a week of downtime is acceptable to the business, then the easiest approach is to carefully document the server specs, all the components, all the configurations, all the database and application settings, and be ready to quickly order a replacement server and rebuild the database server. Depending on the skills available in the company, this can usually be done in a few days or a week.

Most organizations will not accept a week of downtime, and in those cases a spare server is a better choice. This can be done in several ways, limited by imagination and the specific needs of the company: a second server kept in storage, a development server that is promoted to production server while a replacement is ordered, a multi-node database server where the nodes can be reassigned to other tasks, an unrelated server that has similar specs and is identified as a suitable replacement if needed, etc. I have used most of these solutions at various times, they all worked. The most convenient was the development server promoted to production, since it was already online, already running the database, and required only a few settings to be changed in order to be active. Also, it makes it easier to budget for a spare server if that server does double-duty.

Backups and disaster recovery planning is not glamorous, but needs to be done. Never accept to work on a platform that does not have backups or is not ready to deal with events, that level of risk is too high. Taking simple steps goes a long way.

Sunday, March 18, 2012

Behind The Backoffice, Part 2: Storage

This article discusses the hardware typically used in data warehousing platforms. In part 2, I am covering storage hardware. Servers and networking were covered in part 1.

Let me start by making this statement: the quality of the storage hardware used in the data warehousing platform is a critical success factor of the project, just as important as careful design of the data model. Pretty bold, right? But it is true.

For a large scale data warehouse, counted in terabytes, disk i/o is the main bottleneck. Not the network, not the cpus, not the memory, but reads and writes to disk. Many technologies and configurations can be used to alleviate this problem, but i/o remains the bottleneck. It will slow down data processing, make queries run longer, and can make the entire analytics platform appear sluggish if not addressed. It is worth spending some of your budget on faster storage, and take the time to configure it for maximum performance.

If you have been working on smaller data warehouses in the past, chances are the storage consisted of a few disks inserted in the server. That's what I did. Some servers are larger and allow as many as 4 or 6 disks or more, so the amount of storage may be sufficient for small amounts of data. With terabyte and even 2-terabyte drives, this can represent a lot of storage space. As far as capacity goes, it may be all you need. "Six 2-terabyte drives, that's 12TB of available storage, I'll never need more than that!" Wrong, you will need more, and the problem is not capacity. The problem is i/o.

In order to maximize data throughput, what you need is a lot of spinning disks. Using a lot of disks to create a larger partition of data, the data is then broken in small chunks and written/read to multiple disks at the same time. This is faster than writing the same amount of data to a single disk, and this is how faster disk i/o is achieved. Storage hardware is built for that very specific purpose: maximize disk i/o by spreading data across many disks.

The actual implementation varies, but almost all vendors use some level of RAID technology. RAID was originally invented to reduce the risk of data loss caused by the failure of a disk, by spreading data over multiple disks, and it does a spectacular job of it. The side benefit that quickly emerged is faster disk access, and we take full advantage of it by using disk arrays attached to the server. Many disk arrays have 16 disks, some have a lot more. Management of the disks is done through some interface where disks are grouped in logical drives using RAID, then presented to the server. In all cases, there is also a spare drive that will be used by the disk array in case an active drive fails. The market of storage technology is very crowded, new vendors emerge all the time as new technologies are created. The price range for storage hardware varies greatly, from a few thousands to half a million dollars or more. Speed, reliability and capacity are the main drivers. Fortunately, you do not need to spend half a million dollars to have a good solution, newer-faster-better hardware is always available.

If you are not familiar with fibre channels, you should be. Fibre channels (often called FC) are fiber optic cables used to connect a server with storage equipment. They use light to transmit data, and they are fast, faster than a gigabit network. They work best over short distance, 100' or so, which is not a problem in a typical server room where all equipment is in racks. Most storage hardware vendors offer fibre channels on all but the lowest end of their products. The server must be equipped with a host-bus adapter (HBA) to connect the fibre channel, it can be added as an expansion card, and is sometimes built into the server back panel along with the network interface.

The price of fibre channel hardware (HBA, cables, switches) can be steep. A decent single-port HBA can cost $1000 or more, fibre switches easily reach $3000 for small models. This is where your IT department can really help select the best components and work with their preferred suppliers. A switch is only needed in more complex setups, where storage is shared, FC management is centralized, or the equipment is located in multiple server rooms. The HBA may be added to the server by the manufacturer, but it may not be the most cost-effective way to go.

In part 3 of this article, I will discuss the need for backup and disaster recovery.

Saturday, March 17, 2012

Behind The Backoffice, Part 1: Servers and Network

Let's talk about hardware.

Servers, storage, networking, disaster recovery, that kind of hardware. I know, I know, many data warehousing professionals would rather "let the IT guys take care of this part, they know best". Indeed, IT departments are specialists at identifying the best hardware for each situation. Well, most IT departments are.

But knowing the hardware behind your analytics platform is important. You need to know how much data you can throw at it, what improvements can be made when new technologies become available, etc. As your level of responsibility over a project increases, hardware will become part of the project planning, and being able to speak intelligently about servers and storage is important. There is much to be said about hardware, so I will make this a multi-part article.

The hardware used will vary from project to project, and will typically fit within the following categories: database server, database storage, ETL or data processing server, reporting/analytics server, networking, and backups. The backup servers can be a simple file server, a standby database server, a duplication of the production equipment, or any combination.

To get started, the easiest thing to do is to think about the flow of the data from its source all the way to the user screen. This approach works when designing an analytics platform, a data warehouse, and also when selecting the hardware.

The source of data is usually an OLTP database server. The data will be extracted, and moved to a data processing server for transformation; loaded in a database server; read by a reporting server to create or refresh a report; and displayed on the user computer. Some of those steps may be happening in a different order.

Moving data over a network is simple enough, but can be a challenge as volume increases. Having a gigabit network connection between all the servers will make things much smoother. Most servers available these days come equipped with one and often two gigabit interfaces, but not all company networks are fully gigabit enabled, you may need to negotiate with IT departments or spend some of your own budget to purchase and set up the missing gigabit network wiring and routers. It is not as glamorous as other parts of the business intelligence project, but well worth it.

The first server you are likely to look into is the database server. As is the case with all hardware, bigger and faster is better, but prices go up quickly. For simplicity, let's consider the main attributes of a server: cpu speed, cpu cores, amount of memory, internal disks. All other attributes, such as bus speed, rack-mounted vs. tower, expansion slots, redundant power supplies, etc. are secondary.

Because prices can go up quickly when choosing more powerful components, and because the licensing model of some database software vendors is based on the number of cpu cores, the selection of the database server is tied to the database software being used. This can be seen as a constraint, but I prefer to see it as an opportunity. Using a columnar database, for example, allows the use of a less powerful server, but it will take advantage of faster i/o, which is tied to storage. Spending less on the database server allows spending more on higher end storage. There is an upcoming article on columnar databases where I will discuss this in more detail.

I have had a lot of success with both Dell and Sun servers in the past. For my purposes, Dell servers were used with 32-bit Windows Server, and Sun servers were used with 64-bit Solaris. The most powerful Sun server had a single quad-core cpu at 2.8GHz, 32GB of memory, and about 250GB of internal disk storage. A typical Dell server had a single-core cpu at 3.6GHz, 4GB of memory and 73GB of disks. Compared to most other servers in the data center, these were low-end machines. Yet, by combining such a machine with a good database engine, good storage hardware, and with careful configuration, it becomes a very powerful analytics database server, capable of crunching through billions of rows of data in little time.

In part 2 of this article, I will discuss storage hardware.

Friday, March 16, 2012

What Is Data?

So, what is data, exactly?

From my perspective, data is any piece of information produced by a company, purchased from a vendor, or shared by a partner, that can be used to make a decision of any kind by anyone in the organization. It is very broad. For our purposes, we will only consider data recorded in electronic form, which represents the vast majority of data available to a business today.

Your company runs an e-commerce website? Then your list of products, customers, orders, promotions, the web traffic, affiliate websites, all of that is data. If those orders are fulfilled from a warehouse, the inventory of that warehouse is data. Those orders are shipped by mail or some other carrier, bam, more data.

Cash registers in a brick-and-mortar store produce a lot of detailed data about the content of the shopping cart. As far as data goes, the point-of-sale software is essentially doing the same thing as an e-commerce website. The manufacturing process of a plant produces data about materials, quality, duration, etc. Some products come with warranties, the issuance and claims on those warranties are another source of data. The accounting and finance functions of just about any company are major sources of data.

Some data is less clearly defined. For example, the content of a phone call in a customer service center is not recorded in electronic form, but the customer service agent typically uses a software application to capture information during the call, complete with categories, duration, contact information and more. This becomes another source of data.

Invoices from service providers such as phone companies are another source of data, which typically flow into the accounting process.

Knowing what data is available to an organization is an important success factor for data warehousing and business intelligence professionals. Making a piece of data from a different department available in the data warehouse can make a big difference in how useful the data warehouse will be.

For example, consider a data warehouse being created for a mid-size winery that wants to analyze its store sales. Wine making is complex, it combines many specialized skills. It would be normal to start the analysis of sales by looking at the financials, the marketing data, maybe the demographics. That's what I would do. The manufacturing process (the actual wine making) also plays a big role in the sales of wine, it determines fundamental attributes of the product, in this case the quality and price point. But there are so many areas of wine making, what piece of information is important? Does the type of barrel used to age the wine have an impact on the sales? Probably. What about the type of ink used to print the label? Maybe, it is hard to tell without spending some time looking into it.

My approach has always been that you can only analyze the data if you have it. In the case of the winery, I did include information about the ink in the data warehouse (it turns out the type of ink used to print the label has an impact on the perception of quality by the consumer, which varies by region and price point; also, in the US, there are laws controlling the labels, different for every state). Each piece of data on its own may or may not be useful, but putting this data together makes every data element more powerful.

And THAT is what a data warehouse is all about.

Thursday, March 15, 2012

Welcome to About The Data blog!

Welcome to this blog about data warehousing, data management, and business analytics.

My name is Eric Lagacé, I have been working in this industry for over 15 years. I have worked on many projects of varying scale, from week-long quick fixes to year-long major endeavors, and there have been many ideas along the way, both good and bad, and many stories.

In this blog, I will chronicle those ideas and tell those stories, sometimes revisiting solutions to problems, sometimes addressing current issues with ongoing projects, sometimes discussing improvements that can be made. My goal is to share information, possibly help others in their projects, and hopefully start interesting conversations.

Much of my work experience is in the retail industry, both online and offline, with some amount of high-tech, medical and manufacturing. I usually get involved in every aspect of the projects, from analyzing requirements with business users, to designing and implementing the data warehouse, selecting the hardware and software, and ensuring successful long-term operation. I plan on touching on all of these topics.

If that sounds interesting, keep an eye on this page, and let's get the conversation started.