Large scale heavy DML

Using T-SQL to Insert, Update, Delete Millions of Rows

Challenges of Large Scale DML using T-SQL

Using T-SQL to insert, update, or delete large amounts of data from a table will results in some unexpected difficulties if you’ve never taken it to task.

Let’s say you have a table in which you want to delete millions of records. If the goal was to remove all then we could simply use TRUNCATE. However, if we want to remove records which meet some certain criteria then executing something like this will cause more trouble that it is worth.

Why not run the following in a production environment?

  • Transaction log growth – expect a large transaction like this to cause the transaction log to expand. Calls to expand the database files are some of the most expensive operations SQL Server makes to the OS. Hopefully the initial sizing was good and the growth settings are reasonable.
  • Blocking – concurrency is going to increase the likelihood of lock escalation. That will lead to the possibility of excessive blocking.
  • Rollbacks – if the SPID is killed then the transaction will enter a rollback state. For very large transactions this can take a lot of time. Don’t even think about killing or stopping SQL Services during a rollback unless you want a corrupt database!

What can we do to improve this?

We follow an important maxim of computer science – break large problems down into smaller problems and solve them.

We break up the transaction into smaller batches to get the job done. Each of the above points can be relived in this manner. Let’s take a look at some examples.

How to Use T-SQL to Delete Millions of Rows

Combine the top operator with a while loop to specify the batches of tuples we will delete. Tracking progress will be easier by keeping track of iterations and either printing them or loading to a tracking table.

Let’s setup an example and work from simple to more complex.

Now let’s print some indication of progress.

A better way is to store progress in a table instead of printing to the screen.

Finally let’s incorporate dynamic SQL.

How to Use T-SQL to Insert Millions of Rows

Similar principles can be applied to inserting large amounts from one table to another. We cannot use the same code as above with just replacing the DELETE statement with an INSERT statement.

What do you see wrong here?

The problem is a logic error – we’ve created an infinite loop!

To avoid that we will need to keep track of what we are inserting. Here are a few examples:

Assume here we want to migrate a table – move the records in batches and delete from the source as we go.

How to Use T-SQL to Update Millions of Rows

We can employ similar logic to update large tables.

Concluding Thoughts

Breaking down one large transaction into many smaller ones can get job done with less impact to concurrency and the transaction log.

This is just a start. Consider what we left out:

  • T-SQL is not the only way to move data. We can also consider bcp, SSIS, C#, etc.
  • When going across linked servers the waits for network may be the greatest. Keep that in mind as you consider the right method.
  • Joins play a role – whether local or remote. Be mindful of foreign keys and referential integrity.
  • Indexes can make a difference in performance
  • Dynamic SQL and cursors can be useful if you need to iterate through sys.tables to perform operations on many tables

 


UPDATE – 11/3/18

I want to clarify some things about this post. I got good feedback from random internet strangers and want to make sure everyone understands this. But first…

WARNING! If you are not careful when batching you can actually make things worse!

WARNING! These are contrived examples meant to demonstrate a methodology. Please do NOT copy them and run in PROD!

WARNING! There is no “one size fits all” way to do this. Please test this and measure performance before running in PROD! Don’t just take code blindly from the web without understanding it.

Pointing out a couple things:

  • Be mindful of your indexes. Sometimes it can be better to drop indexes before large scale DML operations.
  • The INSERT piece – it is kind of a migration. You can use an output statement on the delete then insert. Can also move the row number calculation out of the loop so it is only executed once.
  • Here is a good example you should read to illustrate testing the performance of batches.

Like what you are reading? Tell your friends. Tell your foes. Please subscribe!

Yes I want to Subscribe!

Just enter your email below and you're part of the club.

4 thoughts on “Using T-SQL to Insert, Update, Delete Millions of Rows

  1. There is a bug in the batch update code. The line ‘update Colors’ should be ‘update cte’ in order for the code to work.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.