Archive

Archive for the ‘ETL Best Practices’ Category

On the lighter side…

November 27, 2011 1 comment

There are a couple good comments from the readers recently, such as this one:

….You can not say that truncating a target table is a bad practice in general…

I totally agree with the above statement. Actually, truncating a target table is used in many of my ETL processes. Many of the processes I designed also do incremental loading only. So the reader is correct that how we design our ETL process really depends on the characteristics of the data we are loading. It becomes a bad practice when we use wrong design for the wrong data. In the particular example I gave in ETL #70 – The worst possible way to truncate your target table, truncating the target table was a bad practice indeed.

OK, enough about such serious topic. Let’s move on to something lighter. I recently moved to a different group within the same company. I’ve been in the same group for quite awhile, and feel pretty sad about leaving my co-workers and friends. My friend Hamid Y composed a poem for me. I am “publishing” the poem here with Hamid’s consent.

Being a SQL/BI developer for so long, but I still cannot explain to my family what exactly what I do. I recently told an Engineer wife that I work for a bank. She immediately started to consult me with her most frustrating banking questions, such as if she can join a credit union to avoid fees.

I am not sure if a x-ray machine or a sewing machine will help me to answer her question or not, but at least it’s on the lighter side of our daily SQL/BI work. Here it goes:

Dear Sherry,

I wrote this little poem for you…hope you like it..

You’re one of the smartest people I’ve seen
Pumping out SQL code and arithmetic mean

You found program bugs like a x-ray machine
And you saved this company more money than a teller machine

You kept this department together like a sewing machine
And making it run like a pinball machine

So now that you’re leaving our team
I’ll try not to make a scene

Just know that you’re loved and well-respected
By everybody in this department you affected

I am already missing people I’ve being working with day in and out. Thanks friends.

Categories: ETL Best Practices

ETL #70 – The worst possible way to truncate your target table

April 22, 2011 2 comments

In my previous post, Why truncating a target table is a bad practice in ETL?, I strongly suggest that we do not truncate our target tables.

In this blog, I’ll tell you the worst possible way I’ve seen to truncate a target table.

Before I get into that, I’ll share with you the experience a business user has shared with me:

….Sherry, can you check the report again? It has failed with the new data load today again.

….What caused it to do that? Has been happening sporadically lately where the whole report just goes down.

….I thought the issue has been resolved since last change.

Before the “last change”, the report has no staging data. It’s pulling large amount of data directly from our data source via OPENQUERY (you can read my other posts to know my opinions about  OPENQUERY). If you are lucky, the report will send back data within 5 minutes.

So the last change was made is to pull data every morning and store it in a table.

Here is how this table (in my opinion, this is your target table, not a staging table) is populated every morning:

image

It is truncated first, then with an attempt to populate it. Unfortunately, almost 5 out of 10 days, the OPENQUERY failed, leaving the target table empty 5 days out of 10. In those 5 lucky days, the target table will remain empty for about 10 minutes, which is how long it takes for the table to be fully populated.

That explains the business users’ experience with the report.

Do you still need more convincing not to truncate your target tables?

ETL Toolkit– Error Event Table and Audit Dimension

March 28, 2011 2 comments

Chapter 4 “Cleaning and Conforming” from Kimball’s The Data Warehouse ETL Toolkit takes about 50 pages, not an extremely long chapter. Here is a warning from the beginning of the chapter.

Please stay with us in this chapter. It is enormously important. This chapter
makes a serious effort to provide specific techniques and structure for an
often amorphous topic. The chapter is long, and you should probably read
it twice, but we think it will reward you with useful guidance for building
the data cleaning and conforming steps of your ETL system.

I have to confess that I read it twice, but never finished reading the entire chapter each time. That doesn’t mean that I totally ignored the data quality issue in my ETL design. Although I didn’t finish reading the entire chapter, I focused on these three sections:

image

Without the effort and hard work during the data profiling phase (or I called it data discovery phase), cleansing and auditing are not going anywhere. Here is an example from the IT Asset Management system from my previous project.

Network auto-discovery software are installed on servers. Multiple such software can be installed on the same server, collecting duplicate data or conflicting data. Only after lots of hard work of data profiling, we were able to create a strategy to resolve duplicate or conflicting data.

Looking at the Error event table schema suggested from the ETL Toolkit, it doesn’t not take long to come with a Screen Type of “Duplicate Composite Key” as a Screen Type for my data exception table. I called it data exception, instead of error event table.

By the way, the suggested schema here is a generic schema. I’d think that you can design your table any way you want to fit into your specific project.

image

As for the #2 Cleaning Deliverable “Audit Dimension”, my understanding is that it is just a summary from your data exception screens, with a score for each exception category. I’ve designed an audit table to summarize all the important measures for data exception and also for normal row counts, but I’ve never created scores for them. I guess I really didn’t know how to score them.

image

Define the grain of fact tables with unique key constraints

March 13, 2011 Leave a comment

I have been working with SQL Server, and Integration Services and Reporting Services for years. Data from our data warehouse in DB2 are ETL’d to the SQL server to be ready to be consumed by SSRS report. Until recently, I’ve only been using unique key constrains on the fact tables sparsely. I’ve started to do so on all the fact tables, and dimension tables now.

Out of three basic fact tables, transactional fact table, fact table for periodic snapshots, and for accumulating snapshots, we very often deal with the first type of transactional fact table, which holds data at the most detailed level. Generally speaking, this type of fact table will have the most number of dimensions associated with it. Fortunately, it’s not that many, in the type of fact tables I deal with. The most common grain of the fact tables I deal with involves account number, transaction date, employee login code, type of product, type of actions (this can grow to a few sometimes). If it’s an accumulating fact table, several milestone dates can also be part of the grain.

OK, back to the creation of the unique key constraints. Most of the unique key constraints will just be a composite key that is composed of all the dimensions in a grain. The grain of a fact table represents the most atomic level by which the facts may be uniquely defined. There might be also some dimensions (such as employee’s location) that do not contribute to the uniqueness of the fact records. Do not include those dimensions in the unique composite key.

A Habit of Writing Definition Document

February 24, 2011 Leave a comment

Reporting for a financial institute can be very difficult due to complex business rules. Reporting can be even harder when the data is about paying employees incentives/bonuses when the rules can be even more complex.

I got into a habit of writing a report/ETL Definition document for every reporting/ETL I do. Excel and Visio documents are my choices.

I don’t call this document “Requirement” or “Specification”. To me, “Requirements” should come from the requester, and ‘”Specification” should come from a business analyst.

Since I am the author of the document, my focus is what I have done in the reporting/ETL, and what business rules I have used.

 image

Categories: ETL Best Practices

Incremental Data Loading – 3 day rule

February 24, 2011 Leave a comment

Incremental loading is not a easy topic. I am just concentrating on a very narrow focus here.

Here is the context of this blog.

  1. Data Source: relational database in DB2
  2. Data Destination: a table in SQL database
  3. Reporting: a SSRS report with direct data pull from the above table
  4. ETL: Implemented in a SSIS package with some business rules built-in
  5. Daily Process: newly added data from the data source needs to be loaded into the destination table

Challenge of the ETL Process:

  1. Need to be re-startable: without any manual setting
  2. Need to be fast: the source can contain large number of records
  3. Need to be self-correctable: if source data was corrected and back dated, the destination data should be self-correctable

What it really means is to choose among the following options regarding the incremental daily loading:

  1. truncate and re-populate the destination table daily
  2. start from the last date from the destination table: I will need to pass the last date from my SQL table. I can either read it in as a user variable in my SSIS package, and pass it to my query, or I can just “ETL” the last date to the source DB2 database.
  3. just pull the previous day’s data from the source
  4. pull the previous 3 days’ data from the source

The first choice is simple enough, but will suffer from poor performance.

The second choice can be a little messy, but it sounds like a good choice.

The third choice is simple enough, but not a good one when considering that our SQL job can potentially fail every day, in which case, data will be missing due to job failure.

The last choice is simple, and it will pull data for the last 3 days even when your job failed in the last 2 days. And it also should be fast. Fortunately SQL jobs are fixed Monday to Friday in our environment.

For the sake of simplicity, I picked the last option for some of my ETL processes. So far, the 3 day rule has worked pretty smoothly. 

Categories: ETL Best Practices

Metadata Searching – Continued

February 18, 2011 Leave a comment

It’s straightforward enough to get all the steps inside a SQL job, and also all the SQL jobs.

image

Categories: ETL Best Practices

Metadata Searching – SQL Jobs, SSIS, SSRS, and Procedures

February 18, 2011 Leave a comment

Is your SQL environment growing so big and intertwined that your memory cannot tell you right away which SSIS package is populating which tables, and witch tables are used by which SSRS reports, and which SQL job is executing which SSIS package or procedure?  Welcome to the real SQL world!

A DBA in my group has built a few life-saving tables populated daily by a few very slick SSIS packages. Now I can search anything that my memory cannot reach.

The basic format he used is:

image

For searching a SSRS report, it’s a little less straightforward.

To be continued on this topic.

Categories: ETL Best Practices

Why are we so afraid to touch business rules built in our ETL process?

February 18, 2011 Leave a comment

When I got my old job back, naturally I’ve noticed many of the existing ETL processes have been modified with new business rules.

However, many of the business rules were added in a way that existing rules were barely touched. This has caused issues in several ways:

1. Existing rules might be in contradiction with the new rule. Without removing the existing rules, newly added rules are either overwritten or generate mixed/incorrect result.

2. Some times there is no need to add a new rule. The clean way is to modify the existing rules.

This leads to the question, why are we so afraid to touch the existing rules.

Some might say, it’s because existing rules weren’t documented. Others might say, the process is so complicated that it’s better not to touch them.

I have to say that none of these is good enough reason. If you have access to the existing process, you have the rules documented. It might not be in the way you prefer, but it’s documented. Many companies also have project management or change management systems. They are also a form of documentation.

As a matter of fact, the first step we should take is to examine the existing rules and articulate them. No change should be made without a clear understanding of where the new rules stand in terms of the existing rules.

I recently got involved in two issues where business rules got piled on over the time, and the data is no longer valid.

I examined the code, and realized that developers never touched existing rules when new rules were added. In another word, bandages were applied over time, and nobody can articulate the rules anymore.

Don’t be afraid. It’s just code anyway.

Categories: ETL Best Practices

QA your own ETL process – no more, no less

January 21, 2011 Leave a comment

The most common task in an ETL process is to determine what to update, what to delete and what to insert using criteria that are specific to what you are doing.

One of the routine tasks in the asset management master database ETL project is to use a composite key of host name + serial number to determine if an asset needs to be updated from a network auto-discovery tool, or needs to be created as a new asset. No assets will be deleted though. They can be de-commissioned in variety of ways if they meet certain criteria.

One QA technique I use routinely is based on an ETL principal, that is “ETL process should not create new data”, or what I called “no more, no less”.

Suppose I have the following 5 milestone points and staging.

  1. Extracting raw data
  2. Cleansing and standardizing and integrating
  3. Conforming data to the master database
  4. Delivering to the target
  5. Process reporting

If I start from the stage 1 and get these row counts from the delivering stage:

  1. Insert: 2,000
  2. Update: 300,000

Now if I re-start from stage 2 (skip stage 1), my process should not create any new data, since my source data remains the same. I’d expect my row counts look like this:

  1. Insert: 0
  2. Update: 302,000 (300,000 + 2,000) 

This simple QA technique helped me tremendously in

1) debugging my own process

2) also discovering new patterns in the data.

Categories: ETL Best Practices
Follow

Get every new post delivered to your Inbox.

Join 26 other followers