Archive

Archive for March, 2011

SSRS #40 – Use Table Filter to Conditionally Show Rows

March 30, 2011 Leave a comment

I recently had a need to show three different view in a employee productivity report with the same data set, two summary views and one account level detail view.

For the two summary views, employees are listed regardless they have data or not in the date range. This will allow the report to do “forced ranking. I’ll blog more about “forced ranking” later. Because I had to show all the FTEs in different grouping, I had to create some dummy rows. I’ve written a post, SSRS – Drill down or Drill through ‏ 2? discussing the approach.

Now if I use the same data set to show the data in account level details, there is no need for forced ranking any more. Now I want to hide those dummy rows.

This is a simple task in SSRS. There is no need for a new data set. Create a new table in SSRS, using the same data set, with a filter.

image

The Expression for the table filter is this:

=IIF(IsNothing(Fields!DECISION_DT.Value),"HideRow","ShowRow")

Note: IsNothing is VB’s equivalent to IsNull in T-SQL.

With the above filter, only rows that have a DECISION_DT will be shown in the account level detail view, and all the dummy rows will not.

ETL Toolkit– Error Event Table and Audit Dimension

March 28, 2011 2 comments

Chapter 4 “Cleaning and Conforming” from Kimball’s The Data Warehouse ETL Toolkit takes about 50 pages, not an extremely long chapter. Here is a warning from the beginning of the chapter.

Please stay with us in this chapter. It is enormously important. This chapter
makes a serious effort to provide specific techniques and structure for an
often amorphous topic. The chapter is long, and you should probably read
it twice, but we think it will reward you with useful guidance for building
the data cleaning and conforming steps of your ETL system.

I have to confess that I read it twice, but never finished reading the entire chapter each time. That doesn’t mean that I totally ignored the data quality issue in my ETL design. Although I didn’t finish reading the entire chapter, I focused on these three sections:

image

Without the effort and hard work during the data profiling phase (or I called it data discovery phase), cleansing and auditing are not going anywhere. Here is an example from the IT Asset Management system from my previous project.

Network auto-discovery software are installed on servers. Multiple such software can be installed on the same server, collecting duplicate data or conflicting data. Only after lots of hard work of data profiling, we were able to create a strategy to resolve duplicate or conflicting data.

Looking at the Error event table schema suggested from the ETL Toolkit, it doesn’t not take long to come with a Screen Type of “Duplicate Composite Key” as a Screen Type for my data exception table. I called it data exception, instead of error event table.

By the way, the suggested schema here is a generic schema. I’d think that you can design your table any way you want to fit into your specific project.

image

As for the #2 Cleaning Deliverable “Audit Dimension”, my understanding is that it is just a summary from your data exception screens, with a score for each exception category. I’ve designed an audit table to summarize all the important measures for data exception and also for normal row counts, but I’ve never created scores for them. I guess I really didn’t know how to score them.

image

Why truncating a target table is a bad practice in ETL?

March 28, 2011 5 comments

I’ve done ETL design and development in both integration projects and in reporting environment.

In master data integration projects, it’s not hard to distinguish staging area from the target area. In the staging area, truncating tables with old data and loading new data is a pretty common practice. I’ have not seen any truncating tables in the target area. This is a good thing. It means that developers understand the role of staging, and understand that loading data into target area needs to be carefully designed, or “reconciled”, as a consultant called it. Depending on the business rules, we will end up either creating new records in the target, or updating existing records, or deleting obsolete records.

In the reporting environment, however, developers have very different attitude and very different understanding of staging VS. target.

This is a very typical architecture I’ve seen so far. The flaw in this design is that there is no target area, only staging. When developers truncate the table in the staging, the report shows empty data between the time when table is truncated and new data finished loading. Or even worse, the report hangs when the loading takes a long time to finish.

image    

You might argue the chance  of the coincidence that users run the report at the same time when the data is in the middle of refreshing. Based on my experience, if I hear about it once from the business users, then the chance is big, because the possibility for it to happen again is always there.

Well, the above is not the only type of architectural design I’ve seen. Here is another one. There is no physical staging area in this design. The report is accessing the data source directly with T-SQL code with OPENQUERY. The developer argued that this type of design will eliminate the need of maintaining the staging area on the SQL server. That part is true. But it does not eliminate the emails and phone calls we get from the business users when they are so frustrated because the report takes for ever to load the data. Reporting Services routinely report time out when trying to retrieve large amount of data via OPENQUERY.

image  

The third type of architectural design can avoid many of the problems of long-running reports, or reports with empty data. By separating target from staging, and not truncating the target tables, the reports will always look good. 

image

Back to the title of this post, “why truncating target tables is a bad practice in ETL”. I have to confess that I was one of those developers who made such mistakes until I received enough emails and phone calls from our business users.

There is more to discuss on how to move data from staging to target. Stay tuned.

Which one to choose – Foreach Loop Container or WHILE Loop in T-SQL

March 27, 2011 Leave a comment

I’ve written a post SSIS – Use “Foreach Loop Container” and pass a variable to a OLE DB Source in a Data Flow. I have to confess that I don’t use Foreach loop container very often. Actually I’ve only used it once in my entire SQL development life. I know how wonderful they are, and how much other developers like it. Deep down, I am a SQL developer. I still resort to T-SQL whenever I feel a bit dizzy thinking about SSIS packages.

I start to feel dizzy when I need to loop through the Foreach loop container more than 10 times. I think that’s my arbitrary limit. Looping through 10 days means connecting to the data sources 10 times. I bet Integration Services does some tricks on pooling connections. Still I don’t quite like it, especially when the data volume is large.

Here is what I would rather do. I’d bring all 10 days ( or even more, depending on the project) worth of data from my data source. The use WHILE loop in T-SQL to process data for each day.

You might wonder what would drive me to process data  for each day in a WHILE loop. That will be the topic for my next post.

SSIS – More secrets about package configuration

March 27, 2011 Leave a comment

I’ve written one post about SSIS – Enable Package Configuration for Deployment, and another one about  SSIS – Add Configuration Files for Deployment.

I am afraid that after reading these two posts you are quite confused. You are probably screaming silently. Why is she talking about environment variables, together with configuration files? Do we need both, or just one of them? Why did my job still fail after I changed my password to DB2? Questions, one after another…

So here I have more writing to do to clarify the confusion.

First of all, after reading those two posts, you already know:

1) how to create the environment variables,

2) how to create the configuration files in XML, and

3) how to enable package configurations in BIDS.

4) You also know how to point to a configuration file for your SQL job step when executing a SSIS package.

The first 3 “hows” are important as a SSIS developer. There are a few more”secretes” regarding package configuration you will also need to know as an experienced SSIS developer.

Here are those “secretes” (assuming that you have done the first 3 hows in BIDS).

1)  There are a couple of ways to test your package in BIDS. Here is one, by right-clicking on the package name in the Solution Explorer.

image

Have you ever thought about how you are connecting to your sources when you test you package in this manner?

Is the connection done through one of the configuration files?

image

Or by the login credential you entered in the Connection Manager?

image

The answer is by the package configuration files you’ve specified in your  environment variables.

If this is your habit to test your package locally, then you will need to maintain your own package configuration files with the correct user id and password (or through Integrated Security mode).

2) Well, the above is not my favorite way for testing locally. I very often test just parts of my package at a time by right-clicking on a sequence container. When I test in this manner, how is my package connecting to the sources?

The correct answer this time is through the connection manager. So you better make sure that you have the correct credentials for each of your connection manager.

3) so what’s the third secrete? The third one has to do with the SQL Agent job you’ve created to execute your package.  

Again, I assume that you have done all the right things in your BIDS locally, and have successfully uploaded your package to the Integration Services, and have created a job step to execute your package.

I will also assume that your server administrator has also done on the server exactly the same things you have done locally. 

The secrete is that you do nothing in the job step. After all the hard work you have put in, you deserve a break.

By nothing, I mean putting nothing on the Configurations tab.

image

I also mean doing nothing on the Data sources tab. Do not check any of the check boxes for the connection managers.

image

By doing nothing, your job step will use the environment variables on the server to point to the correct package configuration files (DTSConfig files we’ve created), which have the correct credentials to connect to the data sources.

4) Last secrete. This is a secrete I’d rather not know. Adding package configuration files for your job step, is only necessary if you want to overwrite the package configuration files defined by your environment variables. There are case where you want to do this, but I am hoping that you will only need to do this for very few of the packages you’ve designed.

image

I am hoping that this post will leave you feel better about package configuration files and package deployment.

Dimension Role-Playing in Employee Comp Reporting

March 17, 2011 Leave a comment

The employee hierarchy table is just one physical table, with a date range to keep employee’s employment history. This same employee hierarchy table can be used sometimes to show different business functional teams, for example, LAS VS. Underwriters on the same fact table. 

I can simply create two SQL views to handle this, or just use two sub-queries if I don’t want to create too many SQL views.

image

image

This is called dimension role-playing. Role-playing in a data warehouse occurs when a single dimension simultaneously appears several times in the same fact table. The underlying dimension exists as a single physical table, but each of the roles should be presented in a separated labeled view.  

Another common example in the employee productivity reporting, is the date dimension. We will just need one single date dimension table, but very often, we need to have a second date, such as application date, besides the primary decision date, on the same fact table. Again we only need to create two different SQL views to handle this, with only one underlying physical date dimension table.

Introduction to Employee Compensation and Productivity Reporting

March 16, 2011 Leave a comment

I came back to my old job, but am not working on exactly same projects. I am very happy that I have new puzzles to solve now.

Obviously employee compensation and productivity reporting has become a hot topic in the business recently.

I am going to blog on this subject as I get to understand more about the business needs.

After completing several ETL processes and reports so far, I gradually start to put all the puzzles together.

In its simplistic form, I can envision a subset of the data warehouse bus matrix. This matrix will revolve as I understand more.

image

The dimensions will involve hierarchies. All the facts basically are from a transactional table. But as we put together all the fact tables for reporting, we will get all three types of fact tables, i.e. the transactional fact tables, snapshot fact tables, and also accumulating fact tables. The third type will be the most common.

The natural granularity for our business transactional fact table will be one row for each action a team member does for each account. For the purpose of employee compensation and productivity reporting, the facts associated with the business process typically just include the counts of the actions, such as the counts all “declines”, or the counts of all “approvals”.

In the next blog I’ll talk about whether I need to normalize my fact tables by fact types.  

DB2 – User defined function to convert list to table ListToTable

March 13, 2011 Leave a comment

Continuing from my previous post SSIS – Pass a list of values to a data flow source.

Step 3: create a user defined function on DB2 to convert the list of the Status IDs into rows.

Now I need to have a function on DB2 to convert the list of the Status IDs into rows.

Actually I really need 2 user defined functions.

First, I need one to create a table to tell me where each comma starts and ends.

image

Run this query.

SELECT * FROM TABLE(mySchema.ListToTable_IDX(’1,101,203′)) AS A

Here is the result telling me where each comma starts and ends.

clip_image002

Second, I  need a function to return me back the actual values as rows in a table.

image

Run this query.

SELECT * FROM TABLE(mySchema.ListToTable(’1,101,203′)) AS A

Here is the result telling me what each value is as rows in a table.

clip_image002[4]

Here is an example how you can call the user defined function in DB2.

image

Now if I need to use this user defined function in a data source of a data flow step from my SSIS package, taking a parameter, I would write the query link this:

image

 

(The script)

SSIS – Pass a list of values to a data flow source

March 13, 2011 1 comment

I wrote a blog post SSIS – Use “Foreach Loop Container” and pass a variable to a OLE DB Source in a Data Flow a few days ago.

But when do we really need the Foreach Loop Container? I find that I do not use it very often. The reason is that I can only pass one at a time the value of the variable to the data source of the data flow inside the Foreach Loop Container. I don’t have a benchmark number of how much overhead that adds to my run time, but if I have 100 values to pass, I start to feel uneasy.

Scenario

I have a list of status IDs I maintain on the SQL server side. Status IDs is one of the dimensions for my fact table. I want to limit the number of records I bring down from our DB2 data warehouse using the list of the Status IDs.

1) Can I maintain the status ID dimension table on DB2? Yes, I can. But I prefer not to do that because by default all our development is supposed to be on SQL server.

2) Can I use MERGE in the data flow, merging my source from DB2 with this Status ID table on SQL Server? Not really. My query on the DB2 side uses several temp tables due to the complexity the query. Without physically staging those temp data, I cannot really use MERGE.

3) Can I create a package variable to hold a dynamic SQL query with a parameter? No. The Status IDs I need to use is not just one value, but a list of many values.

4) How about hard coding those Status IDs in my DB2 query. No. Thanks. No hard coding in this case.

Here is the solution I came up with, and so far it is working perfectly for me.

Step 1: create a SQL table to hold those Status IDs on the SQL Server side

Step 2: in SSIS package, create an Execute SQL Task to read in the Status ID values, and concatenate all the rows into a comma-delimited string, and store the string as a single value into a user defined variable

Step 3: create a user defined function on DB2 to convert the list of the Status IDs into rows. (see my next blog post on how I did that)

Step 4: create a data flow with a parameterized data source, and pass the user defined variable into the data source as the parameter.

More to come on this…

Define the grain of fact tables with unique key constraints

March 13, 2011 Leave a comment

I have been working with SQL Server, and Integration Services and Reporting Services for years. Data from our data warehouse in DB2 are ETL’d to the SQL server to be ready to be consumed by SSRS report. Until recently, I’ve only been using unique key constrains on the fact tables sparsely. I’ve started to do so on all the fact tables, and dimension tables now.

Out of three basic fact tables, transactional fact table, fact table for periodic snapshots, and for accumulating snapshots, we very often deal with the first type of transactional fact table, which holds data at the most detailed level. Generally speaking, this type of fact table will have the most number of dimensions associated with it. Fortunately, it’s not that many, in the type of fact tables I deal with. The most common grain of the fact tables I deal with involves account number, transaction date, employee login code, type of product, type of actions (this can grow to a few sometimes). If it’s an accumulating fact table, several milestone dates can also be part of the grain.

OK, back to the creation of the unique key constraints. Most of the unique key constraints will just be a composite key that is composed of all the dimensions in a grain. The grain of a fact table represents the most atomic level by which the facts may be uniquely defined. There might be also some dimensions (such as employee’s location) that do not contribute to the uniqueness of the fact records. Do not include those dimensions in the unique composite key.

Follow

Get every new post delivered to your Inbox.

Join 26 other followers