SSIS #114 – How many ways can you do INSERT-UPDATE?
For the purpose of this blog, my answer is 4.
I’ve always wanted to do a benchmark comparison so I can proudly say that my “normal” way is the best in terms of run time.
- STAGE-TSQL: Use the data flow to bring the raw data into staging, and use do the INSERT-UPDATE in TSQL. This is my “normal” way. And the best way according to my testing. The name STAGE-TSQL implies: 1) two steps are involved, 2) raw data is staged first, 3) INSERT-UPDATE are done in TSQL only.
- UPDATE-ALL: Only one step is involved. INSERT-UPDATE is done in one data flow step. 1) Transformation OLE DB Command is used for UPDATE, 2) Destination OLE DB Destination is used for INSERT.
- UPDATE-STAGE: Two steps are involved. 1) INSERT is done in the data flow step, 2) but the matching rows are saved to a staging table, and UPDATE is done in TSQL using the matching rows.
- UPDATE-DELTA: similar to number 2 UPDATE-ALL. In stead of directly sending all matching rows to transformation OLE DB Command for UPDATE, the Script Component transformation is used to determine if there are actually changed rows. Send data to Transformation OLE DB Command only if there are changes in the matching rows.
Here is the Control Flow.
There is nothing exciting about the Data Flow. All it does is to bring the source data into a staging table.
The TAREGT_TSQL task does all the wonderful work of INSERT-UPDATE. TAREGT-TSQL is the winner with only 3 seconds and 885 mini-seconds.
The Control Flow is very simple. Only one Data flow is used. No truncating staging tables because no staging tables are used.
But the Data Flow step has two more transformation components: Lookup and OLE DB Command. The Lookup transformation is used to split outputs into No Match Output and Match Output. No Match Output are the rows that will be inserted, and Match Output are the rows that will be updated.
This seems to be an elegant solution but the run time is the worst, comparing almost 3 minutes to only about 4 seconds in STAGE-TSQL.
The Transformation OLE DB Command is to blame for the poor performance here.
Since I blamed the Transformation OLE DB Command for the poor performance, I am going to swap it with a staging table, and finish the UPDATE in a TSQL task.
In the Data Flow step, matching rows are no longer sent to the OLE DB Command transformation, instead, they are saved into a staging table. The actual UPDATE are done in the TSQL task TARGET_MyChangingTable_UPD.
The run time has improved from almost 3 minutes to only 8 seconds and 291 mini-seconds.
Here is another way to improve that 3 minutes run time.
The Control Flow looks the same, but I’ve added a Script Component (SC) Transformation. The SC is used to create two outputs. Only the rows that have changed data are sent to the UpdateRow path, otherwise the rows will be sent to a Row Sampling and are ignored. Adding the SC has indeed improved the run time from 3 minutes to only 7 seconds and 494 mini-seconds.
5. STAGE-TSQL is the winner!
It’s a relief knowing that our skills in TSQL are serving us well.
- My testing involved only INSERT-UPDATE. No DELETE is done in data flow or in TSQL.
- I’ve made sure that all four runs are against the same Source data and Target data to make sure that the run time comparison is valid.
- I don’t think the data size matters much in my testing. No data size is given.
- Using staging tables means more disks spaces are needed. Most teams would prefer faster run time over disks spaces for their ETL processes.