A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Thursday, April 5, 2018

Sort SQL Server tables into similarly sized buckets

Sort SQL Server Tables into similarly sized buckets

You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance---it really doesn't matter. What does matter is that you'd like to get it done sooner rather than later. If time is no consideration, then you'd likely just do one table at a time until you've done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You're paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I'm staging data in SSIS, I often use a row count as an approximation for a time cost. It's not perfect - a million row table 430 columns wide might actually take longer than the 250 million row key-value table.

A sincere tip of the hat to Daniel Hutmacher (b|t)for his answer on this StackExchange post. He has some great logic for sorting tables into approximately equally sized bins and it performs reasonably well.

SET NOCOUNT ON;
DECLARE
    @bucketCount tinyint = 6;

IF OBJECT_ID('tempdb..#work') IS NOT NULL
BEGIN
    DROP TABLE #work;
END

CREATE TABLE #work (
    _row    int IDENTITY(1, 1) NOT NULL,
    [SchemaName] sysname,
    [TableName] sysname,
    [RowsCounted]  bigint NOT NULL,
    GroupNumber     int NOT NULL,
    moved   tinyint NOT NULL,
    PRIMARY KEY CLUSTERED ([RowsCounted], _row)
);

WITH cte AS (
SELECT B.RowsCounted
,   B.SchemaName
,   B.TableName
    FROM
    (
        SELECT
            s.[Name] as [SchemaName]
        ,   t.[name] as [TableName]
        ,   SUM(p.rows) as [RowsCounted]
        FROM
            sys.schemas s
            LEFT OUTER JOIN 
                sys.tables t
                ON s.schema_id = t.schema_id
            LEFT OUTER JOIN 
                sys.partitions p
                ON t.object_id = p.object_id
            LEFT OUTER JOIN  
                sys.allocation_units a
                ON p.partition_id = a.container_id
        WHERE
            p.index_id IN (0,1)
            AND p.rows IS NOT NULL
            AND a.type = 1
        GROUP BY 
            s.[Name]
        ,   t.[name]
    ) B
)

INSERT INTO #work ([RowsCounted], SchemaName, TableName, GroupNumber, moved)
SELECT [RowsCounted], SchemaName, TableName, ROW_NUMBER() OVER (ORDER BY [RowsCounted]) % @bucketCount AS GroupNumber, 0
FROM cte;


WHILE (@@ROWCOUNT!=0)
WITH cte AS
(
    SELECT
        *
    ,   SUM(RowsCounted) OVER (PARTITION BY GroupNumber) - SUM(RowsCounted) OVER (PARTITION BY (SELECT NULL)) / @bucketCount AS _GroupNumberoffset
    FROM
        #work
)
UPDATE
    w
SET
    w.GroupNumber = (CASE w._row
                 WHEN x._pos_row THEN x._neg_GroupNumber
                 ELSE x._pos_GroupNumber
             END
            )
,   w.moved = w.moved + 1
FROM
    #work AS w
    INNER JOIN
    (
        SELECT TOP 1
            pos._row AS _pos_row
        ,   pos.GroupNumber AS _pos_GroupNumber
        ,   neg._row AS _neg_row
        ,   neg.GroupNumber AS _neg_GroupNumber
        FROM
            cte AS pos
            INNER JOIN
                cte AS neg
                ON pos._GroupNumberoffset > 0
                   AND neg._GroupNumberoffset < 0
                   AND
            --- To prevent infinite recursion:
            pos.moved < @bucketCount
                   AND neg.moved < @bucketCount
        WHERE --- must improve positive side's offset:
            ABS(pos._GroupNumberoffset - pos.RowsCounted + neg.RowsCounted) <= pos._GroupNumberoffset
            AND
            --- must improve negative side's offset:
            ABS(neg._GroupNumberoffset - neg.RowsCounted + pos.RowsCounted) <= ABS(neg._GroupNumberoffset)
        --- Largest changes first:
        ORDER BY
            ABS(pos.RowsCounted - neg.RowsCounted) DESC
    ) AS x
    ON w._row IN
       (
           x._pos_row
       ,   x._neg_row
       );

Now what? Let's look at the results. Run this against AdventureWorks and AdventureWorksDW

SELECT
    W.GroupNumber
,   COUNT_BIG(1) AS TotalTables
,   SUM(W.RowsCounted) AS GroupTotalRows
FROM
    #work AS W
GROUP BY
    W.GroupNumber
ORDER BY
    W.GroupNumber;


SELECT
    W.GroupNumber
,   W.SchemaName
,   W.TableName
,   W.RowsCounted
,   COUNT_BIG(1) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS TotalTables
,   SUM(W.RowsCounted) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS GroupTotalRows
FROM
    #work AS W
ORDER BY
    W.GroupNumber;

For AdventureWorks (2014), I get a nice distribution across my 6 groups. 12 to 13 tables in each bucket and a total row count between 125777 and 128003. That's less than 2% variance between the high and low - I'll take it.

If you rerun for AdventureWorksDW, it's a little more interesting. Our 6 groups are again filled with 5 to 6 tables but this time, group 1 is heavily skewed by the fact that FactProductInventory accounts for 73% of all the rows in the entire database. The other 5 tables in the group are the five smallest tables in the database.

I then ran this against our data warehouse-like environment. We had a 1206 tables in there for 3283983766 rows (3.2 million billion). The query went from instantaneous to about 15 minutes but now I've got a starting point for bucketing my tables into similarly sized groups.

What do you think? How do you plan to use this? Do you have a different approach for figuring this out? I looked at R but without knowing what this activity is called, I couldn't find a function to perform the calculations.

No comments: