Skip to main content

Incremental Load - TSQL

For my first technical post in a short while I thought I would look at something I have worked on over the summer that I thought would be good for me to document and share. I needed to develop a relatively simple incremental load process for a very small web app that took the contents of a spreadsheet and loaded that spreadsheet into a SQL Server database table.

At this stage I have to point out that I learned many of the techniques I used here from reading Andy Leonards' (Blog | Twitter) Chapter in SQL Server MVP Deep Dives - The Autonomy of an Incremental Load.

I'll provide a brief overview of my process before I get down to the nitty gritty of the load itself.

I take the spreadsheet that holds the data I want to load and load it into a staging table in my database. This was done using SSIS.

I then compare the staging table with actual live table and decide if any updates need to be made.

If they are identical I simply send an email to the 'keepers of the spreadsheet' informing them.

If there are differences between the staging table and live then I want to:

Insert any new rows in staging but not in the destination table

Update any rows in the destination table where the keys match rows in both tables but expected date column is different.

If rows exist in the destination but don't exist in the staging table then it has been deleted from the spreadsheet and should be removed from the destination table.

The 'Keepers of the spreadsheet' should then be emailed informing them that the database table has been updated with the changes that they made.

Before we begin lets create a database, a staging table and a destination table for us to work from:

--CREATE Database
USE master
go

CREATE DATABASE IncLoad
GO

USE IncLoad
GO

--Create Staging Table
CREATE TABLE [dbo].[StagingTable]
(
[ID] INT NOT NULL,
[EXPECTEDCOMPLETION] [nvarchar](255) NOT NULL
)
ON [PRIMARY]

GO

--CREATE Source TABLE


CREATE TABLE [dbo].[DestTable]
(
[ID] [nvarchar](50) NOT NULL,
[EXPECTEDCOMPLETION] [nvarchar](255) NULL,
CONSTRAINT [PK_PER_PatentCalc] PRIMARY KEY CLUSTERED ( [ID] ASC )
WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY]
)
ON [PRIMARY]

GO



 



We now have a database and a staging and destination table.



Next let's insert some rows into the staging table so I can demonstrate how new rows are inserted.



--Insert some rows into staging to demonstrate how the incremental load handles new records.


INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 1, 'Today' )
INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 2, 'Tomorrow' )
INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 3, 'Next Week' )
INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 4, 'Next Month' )
INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 5, 'Within 6 Months' )
INSERT INTO dbo.StagingTable
(
ID,
ExpectedCompletion
)
VALUES ( 6, 'Next Year' )



 



So we now have data in the source that doesn't exist in the destination table. Running a simple SELECT * against both tables will show us this:



--SELECT * shows the differences between the tables
SELECT * FROM dbo.StagingTable
SELECT * FROM dbo.DestTable



 



Part of my requirement was to email the 'Keepers of the Spreadsheet' if the no updates were needed. I decided that it would be good to check if the incremental needs to be run by checking for differences in the staging and destination tables first. If there are differences then I run the incremental load, If there are no differences my script won't do anything but email the necessary people saying there were no changes made.



To do this I stumbled across a function that I have not previously used called CHECKSUM_AGG. Books Online says the following about this function:



CHECKSUM_AGG can be used to detect changes in a table.



The order of the rows in the table does not affect the result of CHECKSUM_AGG. Also, CHECKSUM_AGG functions may be used with the DISTINCT keyword and the GROUP BY clause.



If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change.





If you run the following script against the database tables and data created above you should get a printed message that says there are differences between the source and destination:



DECLARE @DestCheckSum int
DECLARE @SrcCheckSum int
--DECLARE @SCRCount INT

--SET Variable Values
SELECT @DestCheckSum = CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM dbo.DestTable
SELECT @SrcCheckSum = CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM dbo.StagingTable

IF (@DestCheckSum <> @SrcCheckSum OR @DestCheckSum IS NULL)
BEGIN
PRINT 'There are difference between source and destination tables'
END

ELSE
PRINT 'No changes to make'



 



We have run our check which tells us we need to run the load so what do we do, well if rows exist in the staging table but not in the Destination table we want to insert those rows into the destination table. So how do we do that with TSQL? The answer is an INSERT INTO, a SELECT and a LEFT JOIN.



INSERT  INTO dbo.DestTable
(
ID,
EXPECTEDCOMPLETION
)
SELECT s.ID,
s.ExpectedCompletion
FROM dbo.StagingTable s
LEFT JOIN dbo.DestTable d ON s.id = d.id
WHERE d.id IS NULL


We insert into the destination table the result of the select from the source table joined against the destination table with a left join. The left join means that all the rows in the left table (the source table) are returned regardless of whether a match is made on the join key (id) in the other table. If no record exists or is not known for that key in the destination table then a NULL value is returned in the left join result set for the rows in the destination table (the table in the right hand side of the JOIN), The WHERE clause



WHERE   d.id IS NULL



Ensure that the query only pulls out only rows that exist in the source table, as a NULL for the ID column are not allowed in the destination table we can safely assume that these only exist in the source table and should be inserted into the destination table.



If we run the insert above we get (6 row(s) affected) If we run it again straight after we get (0 row(s) affected) because the rows exist in both tables.



To confirm that your tables are now identical you can run the checksum_agg script above. I get a " No changes to make" message.



The next thing we need to do is update columns in the destination table that have the same key as rows in the staging table but the other field has a different value.



Lets update a row in the staging table, so we have the same record ID but different ExpectedCompletion date in the staging and destination tables.



UPDATE  dbo.StagingTable
SET EXPECTEDCOMPLETION = 'Within 6 Months'
WHERE id = 6



 



We then need to run the incremental load but this time not inserting new rows but update existing records where the ID exists but the expectedcompletion date is different between the staging table and the destination table.



This is achieved by running an update statement against the destination table which includes an inner join with the staging table on the ID fields. If the ID's match in both tables but the expected completion date is different, we update the destination table record (d) with the corresponding value for the same record in staging. Here's the TSQL:



--Update changed Records        
UPDATE d
SET d.ID = s.ID,
d.expectedcompletion = s.EXPECTEDCOMPLETION
FROM dbo.DestTable d
INNER JOIN dbo.StagingTable s ON s.ID = d.ID
WHERE s.EXPECTEDCOMPLETION != d.EXPECTEDCOMPLETION



 



We have shown how to load new records and we have shown how to update records that have changed, all that's left is to delete records that have been removed from the spreadsheet and hence needs to be removed from the destination table as part of the incremental load.



Again we will use our staging table and destination table as an example and we will delete a record from the staging table so I can then demonstrate how the incremental load handles deletes. In reality this is very similar to the insertion of new records using an outer join.



First up, lets delete a record from staging so we are in a situation where we need to remove a record from our destination table.




DELETE FROM dbo.StagingTable
WHERE id = 1



 



We now have a record in the destination table that no longer exists in the spreadsheet and staging table. How does the incremental load cope with this? The following TSQL show's how to delete the records from the destination table where the corresponding record in the staging table no longer exists:



--Delete removed records 
DELETE FROM dbo.DestTable
FROM dbo.DestTable d
LEFT JOIN dbo.StagingTable s ON d.ID = s.ID
WHERE s.id IS NULL



 



So there we have it, the components of my incremental load laid out, from inserting new records, updating changed records and deleting removed records. Andy Leonards' chapter in SQL Server Deep Dives also demonstrates how to perform this using SSIS. If you are interested in this I suggest that you check out the book and Andy's chapter.


Comments

Popular posts from this blog

SQL Server 2012 and Virtual Service Accounts

This post is written by David Postlethwaite
If you are using SQL Server 2012 you will probably have noticed that the default account for the SQL services has changed from that used in previous versions. With SQL 2005 and 2008 the default account for SQL service and SQL Agent service was “NT Authority\System”. This is one the built in accounts on a Windows machine, managed by the machine and selectable from a dedicated dropdown list

The Network Service account was introduced in Windows 2003 as an alternative to using the LocalSystem account, which has full local system privileges on the local machine, a major security concern.
The Network Service has limited local privileges easing these security concerns but when many services on a machine use the Network Service account it becomes harder to track which service is actually accessing resources and performing actions, because all the services are using the one Network Service account.
Also, this account, by default, has sysadmin per…

Always Encrypted

By David Postlethwaite

Always Encrypted is new features in SQL Server 2016 and it is also available in Azure SQL Database. Here you can encrypt columns in a table with a master key and a certificate so that they will appear as encrypted strings to those who don’t have the required certificate installed on their pc.
Once the certificate is installed on the computer then the unencrypted data can then be seen as normal.

The data passes from database to your application as the encrypted value, only the application with the correct certificate can unencrypt the data so it is secure across the wire. This will go some way to resolving the concern of people worried about putting their sensitive data on a shared server in the cloud such as Microsoft Azure and accessing the data across the Internet.

At the time of writing Always Encrypted is only supported with ADO.NET 4.6, JDBC 6.0 and ODBC 13.1 but expect other driver to become available.

The calling application (including SSMS) must also hav…

How to Setup Kerberos Correctly

David was in Copenhagen this weekend delivering his Kerberos talk Taming the Beast: Kerberos for the SQL DBA to SQL Saturday Denmark. I have had a quick chat with him via email since he got back and he said he had a great time. The event was very well attended with 280+ attendees and his talk was well attended.

I think David is planning submitting a few sessions to SQL Saturday events in Europe in the next few months so look out for him there and we'll keep you posted as to his whereabouts when schedules get finalised later in the year.

David has pre-recorded his Kerberos talk. You can watch on you tube and I have also embedded it in this post if you want to see what his kerberos talk  covers...



If we can help you with a SQL Sever problem visit our SQL Server Consulting page or contact us