In this post I want to go through the Basic operations with Clustered Columnstore Index (CCSI) and show their internal views:
Creation
Will create a test table with CCSI:Use TestDB; GO DROP TABLE IF EXISTS tbl_Test_Columnstore; GO CREATE TABLE tbl_Test_Columnstore( Row_Value INT NOT NULL, Row_Description NVARCHAR (50), Row_Type NVARCHAR(50), Row_KEY int, INDEX idx_Test_Columnstore CLUSTERED COLUMNSTORE ); GO
After table creation we immediately can see it's internals:
Script #1:
SELECT ix.name as Index_Name , ix.type_desc , c.Name as Column_Name , ic.is_included_column FROM sys.indexes as ix INNER JOIN sys.index_columns as ic ON ix.object_id = ic.object_id and ic.index_id = ix.index_id INNER JOIN sys.columns as c ON ix.object_id = c.object_id and ic.index_column_id = c.column_id WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore'); GO
Here we can see that ALL four columns of our table are "columnstored" and even though Columnstore Indexes (CSI) do not support "Included" columns, all of them marked as "included".
Along with CCSI itself, SQL Server creates "COLUMN_STORE_DELETE_BITMAP" for the new index:
Script #2:
SELECT ix.name as Index_Name , ip.internal_object_type_desc , ip.row_group_id , ip.rows , ip.data_compression_desc as Compression_Type FROM sys.indexes as ix INNER JOIN sys.internal_partitions as ip ON ix.object_id = ip.object_id WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore'); GO
"COLUMN_STORE_DELETE_BITMAP" - tracks deleted rows, obviously, number it is still zero.
Inserting rows into CSI Delta Store
At first will try to insert only one row into the tableScript #3:
DECLARE @i INT = 10 WHILE @i > 0 BEGIN INSERT INTO tbl_Test_Columnstore WITH (TABLOCK) SELECT @i, 'Row Description: ' + CAST(@i as varchar) , 'Row Type: ' + CAST(@i as varchar) , Row_KEY = @i % 10000000 ; SET @i -= 1; END GO
SQL Server created "Delta store" for our CCSI and inserted 10 new uncompressed rows.
Lets change variable @i to "100000" in Script #3 and run it again. Script #2 will show following:
As you can see all 100K rows also get into the Delta Store
If we try different method of inserting records we can get pretty interesting results:
Script #4:
DELETE FROM tbl_Test_Columnstore; GO ;WITH Pass0 as (select 1 as C union all select 1), Pass1 as (select 1 as C from Pass0 as A, Pass0 as B), Pass2 as (select 1 as C from Pass1 as A, Pass1 as B), Pass3 as (select 1 as C from Pass2 as A, Pass2 as B), Pass4 as (select 1 as C from Pass3 as A, Pass3 as B), Pass5 as (select 1 as C from Pass4 as A, Pass4 as B), DataSet as ( SELECT top 100000 F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT) FROM Pass5) INSERT INTO tbl_Test_Columnstore WITH (TABLOCK) SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 FROM DataSet; GO
(Run script #2 right after)
SQL Server calculated an amount of data to insert and decided to use Parallelism for CSI insertion.
Parallel CSI insert used all four CPUs and loaded data equally into 4 Delta Store groups.
Inserting rows into Column Store using Bulk Insert
At first, clean CSI:
DELETE FROM tbl_Test_Columnstore; GO
Accordingly to MSDN "Columnstore Indexes Data Loading" document, in order to avoid placing your data into the Delta Store and move them directly to Compressed storage you have to keep your batch size equal or bigger than 102,400 rows.
Lets test that. I've modified script #4 to insert 102400 records into CSI:
DELETE FROM tbl_Test_Columnstore GO ;WITH Pass0 as (select 1 as C union all select 1), Pass1 as (select 1 as C from Pass0 as A, Pass0 as B), Pass2 as (select 1 as C from Pass1 as A, Pass1 as B), Pass3 as (select 1 as C from Pass2 as A, Pass2 as B), Pass4 as (select 1 as C from Pass3 as A, Pass3 as B), Pass5 as (select 1 as C from Pass4 as A, Pass4 as B), DataSet as ( SELECT top 102400 F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT) FROM Pass5) INSERT INTO tbl_Test_Columnstore WITH (TABLOCK) SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 FROM DataSet; GO
As you can see, SQL Server split all records across your CPUs and as a result you still have your records in the Delta Store.
There are 2 ways to overcome that situation:
1. Make batch size bigger than number of CPUs used by your SQL Server multiplied by 102,400. In my case that number is 4 x 102,400 = 409,600;
2. Lover degree of parallelism. Yes if you want to load data into smaller chunks the process will be slower.
So, lets test the second scenario. Delete all records again from "tbl_Test_Columnstore" table and run following modified script:
Script #5:
;WITH Pass0 as (select 1 as C union all select 1), Pass1 as (select 1 as C from Pass0 as A, Pass0 as B), Pass2 as (select 1 as C from Pass1 as A, Pass1 as B), Pass3 as (select 1 as C from Pass2 as A, Pass2 as B), Pass4 as (select 1 as C from Pass3 as A, Pass3 as B), Pass5 as (select 1 as C from Pass4 as A, Pass4 as B), DataSet as ( SELECT top 102400 F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT) FROM Pass5) INSERT INTO tbl_Test_Columnstore WITH (TABLOCK) SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 FROM DataSet OPTION (MAXDOP 1); GO
Now we expect to have some data in our Columnstore. So, we need a new script to report it:
Script #6:
SELECT ix.name as Index_Name , CASE WHEN gps.delta_store_hobt_id Is Null THEN 'Column Store' ELSE 'Delta Store' END Group_Type , gps.state_desc , gps.total_rows , gps.deleted_rows , gps.row_group_id , gps.size_in_bytes FROM sys.indexes as ix INNER JOIN sys.dm_db_column_store_row_group_physical_stats as gps ON ix.object_id = gps.object_id and ix.index_id = gps.index_id WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore'); GO
As you can see, this time all 102,400 rows went directly to the Columnstore. Also, you can notice that actual columnstore uses almost 10 times less space than the same data in Delta Store.
Deleting rows from CSI
Will use very simple script to delete rows:DELETE TOP (1000) FROM tbl_Test_Columnstore; GO
All 1000 rows were deleted from the Columnstore!
Lets delete more and run script #6 again:
DELETE TOP (101400) FROM tbl_Test_Columnstore; GO
Continue deleting:
DELETE TOP (102400) FROM tbl_Test_Columnstore; GO
Try to re-insert records using script #5 and see what will happen:
SQL Server Created New Columnstore row group to keep newly inserted records.
Updating data in CSI
Will do very simple update and see what happen:UPDATE tbl_Test_Columnstore SET Row_Description = 'New ' + Row_Description; GO
Moving Data from Delta Store to Column Store
That is pretty simple "ALTER INDEX" operation:ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON); GO
As you can see, data from Delta Store row group #3 were moved to Column Store row group #6 and row groups 3,4, & 5 were marked for system deletion by "TOMBSTONE".
Some time later, marked row groups disappear:
Merging two Column Stores
In order to generate new Column Store will run script #5 again:In order to merge two small row groups into a larger one we can just use reorganize without any parameters:
ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore REORGANIZE; GO
Reporting CSI columns' space allocation
Script #7:SELECT ix.name as Index_NAme , c.name as Column_Name , css.segment_id , css.row_count , css.on_disk_size FROM sys.indexes as ix INNER JOIN sys.index_columns as ic ON ix.object_id = ic.object_id and ic.index_id = ix.index_id INNER JOIN sys.columns as c ON c.object_id = ix.object_id and ic.column_id = c.column_id INNER JOIN sys.partitions as p ON p.object_id = ix.object_id and p.index_id = ix.index_id INNER JOIN sys.column_store_segments as css ON css.hobt_id = p.hobt_id and ic.column_id = css.column_id WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore'); GO
If you sum the size of all columns you'll get the same number as the one, given by script #6.