Wednesday, June 22, 2016

Basics of Clustered Columnstore Index in SQL Server 2016


In this post I want to go through the Basic operations with Clustered Columnstore Index (CCSI) and show their internal views:

Creation

Will create a test table with CCSI:

Use TestDB;
GO
DROP TABLE IF EXISTS tbl_Test_Columnstore;
GO
CREATE TABLE tbl_Test_Columnstore(  
     Row_Value INT NOT NULL,  
     Row_Description NVARCHAR (50),  
     Row_Type NVARCHAR(50),  
     Row_KEY int,  
     INDEX idx_Test_Columnstore CLUSTERED COLUMNSTORE  
);
GO

After table creation we immediately can see it's internals:
Script #1:
SELECT ix.name as Index_Name
 , ix.type_desc 
 , c.Name as Column_Name
 , ic.is_included_column
FROM sys.indexes as ix
INNER JOIN sys.index_columns as ic
ON ix.object_id = ic.object_id and ic.index_id = ix.index_id
INNER JOIN sys.columns as c
ON ix.object_id = c.object_id and ic.index_column_id = c.column_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO

Here we can see that ALL four columns of our table are "columnstored" and even though Columnstore Indexes (CSI) do not support "Included" columns, all of them marked as "included".

Along with CCSI itself, SQL Server creates "COLUMN_STORE_DELETE_BITMAP" for the new index:
Script #2:
SELECT ix.name as Index_Name
 , ip.internal_object_type_desc
 , ip.row_group_id
 , ip.rows
 , ip.data_compression_desc as Compression_Type
FROM sys.indexes as ix
INNER JOIN sys.internal_partitions as ip
ON ix.object_id = ip.object_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO

"COLUMN_STORE_DELETE_BITMAP" - tracks deleted rows, obviously, number it is still zero.

Inserting rows into CSI Delta Store

At first will try to insert only one row into the table
Script #3:
DECLARE @i INT = 10
WHILE @i > 0
BEGIN 
 INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
 SELECT @i, 'Row Description: ' + CAST(@i as varchar)
  , 'Row Type: ' + CAST(@i as varchar)
  , Row_KEY = @i % 10000000 ;
 SET @i -= 1;
END
GO
Then immediately run Script #2 again and get following result:

SQL Server created "Delta store" for our CCSI and inserted 10 new uncompressed rows.

Lets change variable @i to "100000" in Script #3 and run it again. Script #2 will show following:
As you can see all 100K rows also get into the Delta Store

If we try different method of inserting records we can get pretty interesting results:
Script #4:

DELETE FROM tbl_Test_Columnstore; 
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 100000
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet;
GO

(Run script #2 right after)


SQL Server calculated an amount of data to insert and decided to use Parallelism for CSI insertion.
Parallel CSI insert used all four CPUs and loaded data equally into 4 Delta Store groups.

Inserting rows into Column Store using Bulk Insert


At first, clean CSI:

DELETE FROM tbl_Test_Columnstore; 
GO

Accordingly to MSDN "Columnstore Indexes Data Loading" document, in order to avoid placing your data into the Delta Store and move them directly to Compressed storage you have to keep your batch size equal or bigger than 102,400 rows.
Lets test that. I've modified script #4 to insert 102400 records into CSI:

DELETE FROM tbl_Test_Columnstore 
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 102400
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet;
GO
Run Script #2 after that:

As you can see, SQL Server split all records across your CPUs and as a result you still have your records in the Delta Store.
There are 2 ways to overcome that situation:
1. Make batch size bigger than number of CPUs used by your SQL Server multiplied by 102,400. In my case that number is 4 x 102,400 = 409,600;
2. Lover degree of parallelism. Yes if you want to load data into smaller chunks the process will be slower.
So, lets test the second scenario. Delete all records again from "tbl_Test_Columnstore" table and run following modified script:
Script #5:
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 102400
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet
OPTION (MAXDOP 1);
GO

Now we expect to have some data in our Columnstore. So, we need a new script to report it:

Script #6:
SELECT ix.name as Index_Name
 , CASE WHEN gps.delta_store_hobt_id  Is Null 
  THEN 'Column Store' ELSE 'Delta Store' END Group_Type
 , gps.state_desc
 , gps.total_rows
 , gps.deleted_rows
 , gps.row_group_id
 , gps.size_in_bytes
FROM sys.indexes as ix
INNER JOIN sys.dm_db_column_store_row_group_physical_stats as gps
ON ix.object_id = gps.object_id and ix.index_id = gps.index_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO
Here is the result:

As you can see, this time all 102,400 rows went directly to the Columnstore. Also, you can notice that actual columnstore uses almost 10 times less space than the same data in Delta Store.

Deleting rows from CSI

Will use very simple script to delete rows:
DELETE TOP (1000) FROM tbl_Test_Columnstore;
GO
After running Script #6 we notice following:
All 1000 rows were deleted from the Columnstore!


Lets delete more and run script #6 again:
DELETE TOP (101400) FROM tbl_Test_Columnstore;
GO
We deleted 102,400 rows and all were deleted from the Columnstore repository, without touching the Delta Store. Notice that the size of CSI has not changed.

Continue deleting:
DELETE TOP (102400) FROM tbl_Test_Columnstore;
GO
Now you can see that Delta Store is significantly decreased in size.

Try to re-insert records using script #5 and see what will happen:

SQL Server Created New Columnstore row group to keep newly inserted records.

Updating data in CSI

Will do very simple update and see what happen:
UPDATE tbl_Test_Columnstore 
SET Row_Description = 'New ' + Row_Description;
GO
All rows were deleted and re-inserted into the Delta Store. I can guess it was single threaded operation.

Moving Data from Delta Store to Column Store

That is pretty simple "ALTER INDEX" operation:
ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore   
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
GO
After data are moved from Delta Store it will look like this:

As you can see, data from Delta Store row group #3 were moved to Column Store row group #6 and row groups 3,4, & 5 were marked for system deletion by "TOMBSTONE".

Some time later, marked row groups disappear:

Merging two Column Stores

In order to generate new Column Store will run script #5 again:

In order to merge two small row groups into a larger one we can just use reorganize without any parameters:
ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore REORGANIZE;
GO
It merged two row groups and marked them for deletion.

Reporting CSI columns' space allocation

Script #7:
SELECT ix.name as Index_NAme
 , c.name as Column_Name
 , css.segment_id
 , css.row_count
 , css.on_disk_size
FROM sys.indexes as ix
INNER JOIN sys.index_columns as ic
ON ix.object_id = ic.object_id and ic.index_id = ix.index_id
INNER JOIN sys.columns as c
ON c.object_id = ix.object_id and ic.column_id = c.column_id
INNER JOIN sys.partitions as p 
ON p.object_id = ix.object_id and p.index_id = ix.index_id
INNER JOIN sys.column_store_segments as css
ON css.hobt_id = p.hobt_id and ic.column_id = css.column_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO
That script reported an allocation within the CSI by individual columns within individual row groups.
If you sum the size of all columns you'll get the same number as the one, given by script #6.


Sunday, June 19, 2016

The Most of what you have to know about Columnstore Indexes in SQL Server 2016

In SQL Server 2016 Columstore Indexes are really matured and it is really time to start using them.
In order to use something you have to have a simple guidance, a map, which will show you The Big Picture.

And here is the little "pocket" guidance with all major highlights:

Columnstore Indexes in SQL Server 2016.

Description

Columnstore Indexes (CSI) are data structures, which aggregate stored data not by Rows, but by Columns.
They are not very good for Index Seek operations (OLTP), but very good for Index Scans (OLAP).
CSI are modifiable, but CRUD operations are behave differently:
  • Bulk Insert: When inserting more than 102,400 rows data go directly to Columnstore;
  • Insert: When number of inserted rows is less than 102,400, new rows will be inserted they into a Delta Storage;
  • Delete: Rows are not actually deleted from the CSI, they are just marked to be deleted;
  • Update: Rows are not actually updated. Updated rows are marked as deleted and then new modified rows are inserted into Delta Storage;

Categorizing

  • Updatable CSI can be Clustered and Non-Clustered;
  • CSI can be built on a Disk-Based table and on In-Memory optimized table.

Main Features

  • It can be Filtered;
  • Can include all or just specified columns;
  • For Disk-Based tables it can have Compression Delay (in Minutes). Good for OLTP, might decrease fragmentation.
  • Max degree of parallelism can be control for the index usage;
  • CSI can be partitioned;
  • “COLUMNSTORE” is Default compression level. Old data can be over compressed by “COLUMNSTORE_ARCHIVE”;
  • Can be created on temporary table;
  • CSI can be disabled or rebuilt by “ALTER INDEX” command;

Limitations

  • Can have no more than 1024 columns;
  • Constraints are not included into non-clustered CSI;
  • Can be created only on a table, not on a view;
  • CSI can’t be Altered. For any changes it has to be re-create;
  • CSI can’t have Included columns;
  • CSI can’t have Sparse columns;
  • CSI are sorted by compression algorithm, not by ASC/DESC;
  • CSI can’t be combined with Page/Row compression, Replication, Filestream;

CSI Timeline

SQL Server 2012

  • Read-only non-clustered CSI;

SQL Server 2014

  • Updateable clustered CSI;

SQL Server 2016

  • Updateable non-clustered CSI;
  • Btree index on a clustered CSI;
  • Snapshot isolation and read-committed snapshot isolation;
  • CSI on a memory-optimized table;
  • CSI supports filtering;

Reference Links

Columnstore Indexes Guide :

https://msdn.microsoft.com/en-us/library/gg492088.aspx?f=255&MSPPError=-2147217396

Columnstore Indexes Data Loading

https://msdn.microsoft.com/en-us/library/dn935008.aspx?f=255&MSPPError=-2147217396

Get started with Columnstore for real time operational analytics

https://msdn.microsoft.com/en-us/library/dn817827.aspx?f=255&MSPPError=-2147217396

Columnstore Indexes Defragmentation

https://msdn.microsoft.com/en-us/library/dn935013.aspx

Columnstore Indexes Versioned Feature Summary

https://msdn.microsoft.com/en-us/library/dn934994.aspx?f=255&MSPPError=-2147217396

Columnstore Indexes Query Performance

https://msdn.microsoft.com/en-us/library/dn935005.aspx?f=255&MSPPError=-2147217396

Saturday, June 18, 2016

Usage of Sparsed columns

Found interesting SQL Server feature "Sparsed columns" (see BOL: https://msdn.microsoft.com/en-us/library/cc280604%28v=sql.105%29.aspx?f=255&MSPPError=-2147217396), which has been with SQL Server for a long while since 2008R2.

What is "Sparsing"?
"Sparsing" is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.

I've seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if "SPARSE" can be a good tool to gain some space.

I've tested it on a small table:


CREATE TABLE tbl_TestSparse (ID INT IDENTITY(1,1)
       CONSTRAINT PK_ID PRIMARY KEY WITH (FILLFACTOR=100)
       , Sparsed_Column INT SPARSE NULL);


I inserted million rows into that table and started my testing by changing "Sparsed_Column" type and percentage of nulls in that column. Here are results in Megabytes I've got:


As you can see, SPARSE become profitable only for NVARCHAR(10) column when 50% of it's values were NULLs. I'd argue on this gain. If you know you'll have too many nulls in text columns - make them VARCHAR/NVARCHAR because as you probably noticed, those data types are handled by SQL Server so good that SPARSE does not any positive impact on them.

The only case of usage for SPARSE I see for really large tables where fixed size columns, such as INT, BIGINT,  MONEY, DATETIME are mostly having NULL values.

Would say our table had Billion records, then by Sparsing empty INT column we could potentially save 3 GB - 3.5 GB. It might be not very big percentage from the entire table, but still pretty big number from the Disk Space perspective.

If you are not sure how many NULLs you'll have in a column, do not use "SPARSE". It will add the complexity and might negatively impact performance in the future.

In any case, if you see a column, where 99% values are NULLs it could be the sign of bad design or bad data.




Friday, June 17, 2016

Three ways of Inserting empty values into a table with Single Identity Column

I know, that is very unusual situation when you need to fill out a table with only one Identity Column.
However, when you need it, you need to do it quickly and efficiently.

At first, will create a table:
CREATE TABLE tbl_Single_Identity(ID INT IDENTITY(1,1) 
 CONSTRAINT PK_ID PRIMARY KEY WITH (FILLFACTOR=100));


Here is how you would do this:

Approach #1


INSERT INTO tbl_Single_Identity DEFAULT VALUES;
That is easy and fast, but what would you do if you need to insert more values than one?
Clicking on F5 again and again?

Approach #2

INSERT INTO tbl_Single_Identity DEFAULT VALUES;
GO 1000
That is quick and dirty way to insert 1000 records, but what if you need to insert million or more?
That "GO" will be too slow. So, approach #3 comes.

Approach #3

Third approach is to load all values in bulk and I currently see two pretty easy ways of doing it:

A. Adding extra column, fill it in and then delete it.


ALTER TABLE tbl_Single_Identity ADD Dummy TINYINT NOT NULL;
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B)
INSERT INTO tbl_Single_Identity(Dummy)
SELECT TOP (1000000) C FROM Pass5;
GO
ALTER TABLE tbl_Single_Identity DROP COLUMN Dummy;
GO
ALTER INDEX PK_ID ON tbl_Single_Identity REBUILD WITH (FILLFACTOR=100);
GO

I do not think it is the most effective way because requires extra space for a column and then for index rebuilt to gain that space back.

B. Allowing insertion into the Identity column
SET IDENTITY_INSERT tbl_Single_Identity ON;
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B)
INSERT INTO tbl_Single_Identity(ID)
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Pass5;
GO
SET IDENTITY_INSERT tbl_Single_Identity OFF;
GO

Both Inserts A and B have the very same result and produce the same size data, but surprisingly, even though Approach 3.A. looks uglier and requires Index Rebuild it works much faster than Approach 3.B. because of Window function "ROW_NUMBER" and associated with it spill in tempDB for Sorting operation.

Thursday, June 16, 2016

Parameterizing Connections and Variables in SSIS

Being on SSIS presentation recently, I've realized that a lot of people, who are working with SSIS for years, still do not know what "Parameterizing" is and how to do it.

SSIS has been changed a lot in SQL Server 2012, where Microsoft announced "Project Deployment Model". Since then you can deploy Project, and you can assign Parameters to that project, which can be passed to it for execution. Before that, developers used Configurations to supply values for internal variables and connections.


For this test I've created one database "Project" scoped connection "AdventureWorks2016", which has to be available for all packages within my project, and "Package" scoped variable "TestVariable":


Parameterizing Connection

1. Do a right click on "AdventureWorks2016" connection and "Parameterize..."

2. Then you'll see "Parameterize" window. There you can change name of your new parameter or even not create a new parameter, but assign an existing one. You can also choose which property of your connection you want to parameterize. By default it offers "ConnectionString".

3. After necessary changes are done, press "OK". Then you'll see little "Function" sign near your Connection.


4. To see your newly created parameter just double click on "Project.Params" in "Solution Explorer":

It will open new "Project.Params" window with our new parameter:

ParameterizingVariable

That procedure is even easier.
1. In your opened "Project.Params" window click "New" icon:


2. Rename it to "TestParameter", change type to "String", and assign "Test Value":


3. Then switch back to your "Package" window and click on "Expression Builder" button for "TestVariable" in the "Variable" window:

4. You'll have "Expression Builder" window opened:

5.  Expand "Variables and Parameters" section and "Drag-n-Drop" "TestParameter" to the Expression small window. Please note the difference between Variables and Parameters: Parameters have "$Project" signature.

6. Than click OK and we will see the parameterized value for our "TestVariable":
Note the "Function" sign assigned to the variable.


As you can see it is very easy to parameterize.