How to (efficiently) update millions of records in a SQL table

You need to update one or more columns in a table with millions of records with data from another table, what’s a good way of going this with low risk and in an optimized way? One of my favorite ways of dealing with millions of records in a table is processing inserts, deletes, or updates in batches.

Updating data in batches of 10,000 records at a time and using a transaction is a simple and efficient way of performing updates on millions of records. The key is to batch out your data to allow you to process a smaller number of records at a time.

Example:

use DatabaseName
go

DECLARE @batchId INT
DECLARE @batchSize INT
DECLARE @results INT

SET @results = 1
SET @batchSize = 10000
SET @batchId = 0

-- when 0 rows returned, exit the loop
WHILE (@results > 0)
	BEGIN
		BEGIN TRAN;

		UPDATE Table1 SET columnA = Table2.columnA
		FROM Table2
		INNER JOIN Table1 ON Table2.Id = Table1.Id
		WHERE Table1.columnA is null
		AND (Table2.Id > @batchId
		AND Table2.Id <= @batchId + @batchSize)

		SET @results = @@ROWCOUNT
	
		-- next batch
		SET @batchId = @batchId + @batchSize

		COMMIT TRAN;
	END

-- the sql below is just to measure the performance of this update, it is not needed to update your data.

DECLARE @startTime DATETIME

SELECT  @startTime = GETDATE()
SET STATISTICS PROFILE ON
SELECT  COUNT(*)
FROM Table1 first OPTION (RECOMPILE)
SET STATISTICS PROFILE OFF
SELECT  'It took ' + CONVERT(VARCHAR(8), DATEDIFF(ms, @startTime, GETDATE())) + ' ms'
GO

In the query above we declare the following variables to do the batching:

  • @results – this variable will hold the number of records updated; when zero, the query will stop. This is to avoid an infinite loop.
  • @batchId – this is set to zero initially, and it is used to compare the table id against it, and after each update, it is set to the id plus the batch size. This allows splitting the millions of records into batches.
  • @batchSize – the number of records to update at a time.

It is important to know that the above algorithm will only work when your table’s primary keys are of type int. If your table’s primary keys are guids, the above will not work.

The section at the end of the query is only used to review the performance of your updates, it is not necessary to update the data and it is a nice way to see how the query performs overall.

I hope the information in this post is helpful, and it helps you learn a simple way to update millions of records in batches, making it easier and lower risk than attempting to do it all at once.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.