SQL Server Techniques: Design

Showing posts with label Design. Show all posts

Saturday, June 18, 2016

Usage of Sparsed columns

Found interesting SQL Server feature "Sparsed columns" (see BOL: https://msdn.microsoft.com/en-us/library/cc280604%28v=sql.105%29.aspx?f=255&MSPPError=-2147217396), which has been with SQL Server for a long while since 2008R2.

What is "Sparsing"?
"Sparsing" is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.

I've seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if "SPARSE" can be a good tool to gain some space.

I've tested it on a small table:

CREATE TABLE tbl_TestSparse (ID INT IDENTITY(1,1)

CONSTRAINT PK_ID PRIMARY KEY WITH (FILLFACTOR=100)

, Sparsed_Column INT SPARSE NULL);

I inserted million rows into that table and started my testing by changing "Sparsed_Column" type and percentage of nulls in that column. Here are results in Megabytes I've got:

As you can see, SPARSE become profitable only for NVARCHAR(10) column when 50% of it's values were NULLs. I'd argue on this gain. If you know you'll have too many nulls in text columns - make them VARCHAR/NVARCHAR because as you probably noticed, those data types are handled by SQL Server so good that SPARSE does not any positive impact on them.

The only case of usage for SPARSE I see for really large tables where fixed size columns, such as INT, BIGINT, MONEY, DATETIME are mostly having NULL values.

Would say our table had Billion records, then by Sparsing empty INT column we could potentially save 3 GB - 3.5 GB. It might be not very big percentage from the entire table, but still pretty big number from the Disk Space perspective.

If you are not sure how many NULLs you'll have in a column, do not use "SPARSE". It will add the complexity and might negatively impact performance in the future.

In any case, if you see a column, where 99% values are NULLs it could be the sign of bad design or bad data.

Thursday, February 25, 2016

Simple Table Partitioning

I've recently presented my SSMS Query Shortcuts and had a question from the audience if my queries cover Database Partitions and Partitioned tables.

I know everybody want that and I wanted to write these queries long time ago. So, it is the time now.

At first, we will establish Partitioning Case.

File Partitioning

First, will create test database with additional partitions and files.

USE master
GO
CREATE DATABASE [TestPartitions] ON  PRIMARY 
 ( NAME = N'TestPartitions1', 
 FILENAME = N'D:\SS2014\Data\TestPartitionsPrimary1.mdf', SIZE = 20MB),
 ( NAME = N'TestPartitions2', 
 FILENAME = N'D:\SS2014\Data\TestPartitionsPrimary2.ndf', SIZE = 1MB),
FILEGROUP FileGroup1 
 ( NAME = N'Partition1', 
 FILENAME = N'D:\SS2014\Data\TestPartitions1.ndf', SIZE = 1MB),
FILEGROUP FileGroup2 
 ( NAME = N'Partition2', 
 FILENAME = N'D:\SS2014\Data\TestPartitions2.ndf', SIZE = 1MB),
FILEGROUP FileGroup3 
 ( NAME = N'Partition3', 
 FILENAME = N'D:\SS2014\Data\TestPartitions3.ndf', SIZE = 1MB),
 ( NAME = N'Partition3A', 
 FILENAME = N'D:\SS2014\Data\TestPartitions3A.ndf', SIZE = 1MB)
LOG ON 
 ( NAME = N'TestPartitions_log', 
 FILENAME = N'D:\SS2014\Data\TestPartitions_log.ldf', SIZE = 1MB)
GO

By that code we create new Database "TestPartitions" with two files in PRIMARY file group. Besides of "Primary" we created two groups with one file each and third group with two files.
Here is how result of that creation looks in SSMS:

But we do not like to go for that information manually to Management Studio for each database on every server. Right?

To automate the process will write a query:

USE [TestPartitions]
GO
SELECT ISNULL(fg.name,'LOG') AS File_Group_Name
, CASE fg.is_read_only WHEN 0 THEN 'No' 
 WHEN 1 THEN 'Yes' ELSE 'N/A' END AS [Group Read Only]
, CASE fg.is_default WHEN 0 THEN 'No' 
 WHEN 1 THEN 'Yes' ELSE 'N/A' END AS [Default Group]
, mf.file_id
, mf.name AS [File_Name]
, mf.type_desc AS [File_Type]
, mf.physical_name
, mf.state_desc AS [File_State]
, CAST(mf.size/131072. AS DECIMAL(7,3)) AS [File_Size_Gb]
, CASE mf.max_size WHEN 0 THEN 'No Growth' WHEN -1 THEN 'Full Disk' 
 ELSE CAST(CAST(mf.max_size/131072. AS DECIMAL(7,3)) AS VARCHAR) END 
 AS [Max_Size_Gb]
, CASE mf.is_percent_growth WHEN 0 
 THEN CAST(CAST(mf.growth/128. AS DECIMAL(10,3)) as VARCHAR) + ' Mb'
 ELSE CAST(growth as VARCHAR) + ' %' END as [AutoGrowth]
, CASE mf.is_read_only WHEN 0 THEN 'No' ELSE 'Yes' END AS [File Read Only]
, CAST(ROUND(( fs.io_stall_read_ms / ( 1.0 + fs.num_of_reads ) ),3) as FLOAT) 
 as [Avg Read Wait, ms]
, CAST(ROUND(( fs.io_stall_write_ms / ( 1.0 + fs.num_of_writes ) ),3) as FLOAT) 
 as [Avg Write Wait, ms]
FROM sys.master_files AS mf
INNER JOIN sys.dm_io_virtual_file_stats(DB_ID(), NULL) AS fs 
 ON fs.file_id = mf.file_id AND fs.database_id = mf.database_id
LEFT JOIN sys.filegroups AS fg 
 ON mf.data_space_id = fg.data_space_id
ORDER BY mf.data_space_id, mf.file_id
OPTION (RECOMPILE);

Here we can see not only how File Groups and Files were setup, but also some operational information
like: File_id, File Physical Name, File State, Average Read/Write File statistics.

Table Partitioning

For Table Partitioning we have to create Partitioning Functions and Partitioning Schemas.
In this example I will create Functions' Partitioning by DateTime field.

use [TestPartitions]
GO
CREATE PARTITION FUNCTION
prt_Test_Left (DATETIME) AS RANGE LEFT FOR VALUES('2014-01-01','2016-01-01');
GO
CREATE PARTITION SCHEME sc_Test_Left
AS PARTITION prt_Test_Left TO (FileGroup1,FileGroup2,FileGroup3);
GO
CREATE PARTITION FUNCTION
prt_Test_Right (DATETIME) AS RANGE RIGHT FOR VALUES('2014-01-01','2016-01-01');
GO
CREATE PARTITION SCHEME sc_Test_Right
AS PARTITION prt_Test_Right TO (FileGroup3,FileGroup2,FileGroup1);
GO

I've created two functions with "Left" and "Right" side boundaries and with three ranges:
Range 1: Before 1/1/2014
Range 2: Between 1/1/2014 and 1/1/2016
Range 3: After 1/1/2016

As you can see, in the second partitioning schema, just for sake of experiment, I've purposely mixed File Groups in reverse order.

Here is what we can see in SSMS for Functions and Schemas we've just created:

That is really not so much.That means we need a query for more details and some automation:

SELECT ps.name as Partition_Schema, pf.name as Partition_Function
, pf.modify_date as Last_Modified
, CASE pf.boundary_value_on_right 
 WHEN 0 THEN 'LEFT' ELSE 'RIGHT' END as Function_Type
, R1.value as Min_Border_Value, R2.value as Max_Border_Value
, ds.destination_id as Partition_Order
, FG.name as [FileGroup_Name]
, SUM(IsNull(AU.total_pages,0)) as total_pages
, SUM(IsNull(AU.used_pages,0)) as used_pages
, SUM(IsNull(AU.data_pages,0)) as data_pages
, sf.name as [File_Name], sf.filename as Physical_File_Name
FROM sys.partition_schemes AS ps WITH (NOLOCK)
INNER JOIN sys.destination_data_spaces AS ds WITH (NOLOCK)
 ON ps.data_space_id = ds.partition_scheme_id
INNER JOIN sys.partition_functions AS pf WITH (NOLOCK)
 ON pf.function_id = ps.function_id 
INNER JOIN sys.filegroups AS FG WITH (NOLOCK)
 ON FG.data_space_id = ds.data_space_id
INNER JOIN sys.sysfiles AS sf  WITH (NOLOCK)
 ON sf.groupid = ds.data_space_id
LEFT JOIN sys.partition_range_values AS R1 WITH (NOLOCK)
 ON R1.function_id = pf.function_id 
  and R1.boundary_id + 1 = ds.destination_id
LEFT JOIN sys.partition_range_values AS R2 WITH (NOLOCK)
 ON R2.function_id = pf.function_id 
  and R2.boundary_id = ds.destination_id
LEFT JOIN sys.allocation_units AS AU WITH (NOLOCK)
 ON AU.data_space_id = ds.data_space_id
GROUP BY  ps.name, pf.name, pf.modify_date, pf.boundary_value_on_right
 , R1.value, R2.value, ds.destination_id, FG.name, sf.name, sf.filename
ORDER BY ps.name, ds.destination_id
OPTION (RECOMPILE);

As the result we can see following:

We can see here:
1. Function to Schema relationships;
2. Function Range Type;
3. Bordering values;
4. Associated File Groups and Partitions;
5. Space usage statistics per File/Partition.

Create Partitioned Tables

We will create two tables. One in each schema and populate them with the similar data:

CREATE TABLE tbl_Test_Left(F1 DATETIME, F2 VARCHAR(MAX)) ON sc_Test_Left(F1);
GO
CREATE TABLE tbl_Test_Right(F1 DATETIME, F2 VARCHAR(MAX)) ON sc_Test_Right(F1);
GO
INSERT INTO tbl_Test_Left(F1) VALUES ('2013-06-01')
GO 15
INSERT INTO tbl_Test_Left(F1) VALUES ('2014-06-01')
GO 20
INSERT INTO tbl_Test_Left(F1) VALUES ('2015-06-01')
GO 10
INSERT INTO tbl_Test_Left(F1) VALUES ('2016-06-01')
GO 5
INSERT INTO tbl_Test_Right(F1) VALUES ('2013-06-01')
GO 15
INSERT INTO tbl_Test_Right(F1) VALUES ('2014-06-01')
GO 20
INSERT INTO tbl_Test_Right(F1) VALUES ('2015-06-01')
GO 10
INSERT INTO tbl_Test_Right(F1) VALUES ('2016-06-01')
GO 5

Following insertion was performed:
1. 15 records with before 1/1/2014 - are supposed to go to Range 1;
2. 20+10 records between 1/1/2014 and 1/1/2016 - are supposed to go to Range 2;
3. 5 records with after 1/1/2016 - are supposed to go to Range 3;

In order to see records' allocations we can run following query:

SELECT s.name as [Schema]
 , o.name as Table_Name
 , IsNull(i.name,'HEAP') as Index_Name
 , ps.name as Partition_Schema
 , pf.name as Partition_Function
 , pf.modify_date as Last_Modified
 , PA.partition_number as [Partition]
 , CASE pf.boundary_value_on_right 
  WHEN 0 THEN 'LEFT' ELSE 'RIGHT' END as Function_Type
 , R1.value as Min_Border_Value
 , R2.value as Max_Border_Value
 , FG.name as [FileGroup_Name]
 , PA.rows
 , SUM(AU.total_pages) as total_pages
 , SUM(AU.used_pages) as used_pages
 , SUM(AU.data_pages) as data_pages
 , sf.name as [File_Name]
 , sf.filename as Physical_File_Name
FROM sys.indexes AS i WITH (NOLOCK)
INNER JOIN sys.partition_schemes AS ps WITH (NOLOCK)
 ON ps.data_space_id = i.data_space_id
INNER JOIN sys.partition_functions AS pf WITH (NOLOCK)
 ON pf.function_id = ps.function_id 
INNER JOIN sys.partitions AS PA WITH (NOLOCK) 
 ON PA.object_id = i.object_id AND PA.index_id = i.index_id 
INNER JOIN sys.allocation_units AS AU WITH (NOLOCK) 
 ON (AU.type IN (1, 3) AND AU.container_id = PA.hobt_id) 
  OR (AU.type = 2 AND AU.container_id = PA.partition_id) 
INNER JOIN sys.objects AS o WITH (NOLOCK)
 ON i.object_id = o.object_id 
INNER JOIN sys.schemas AS s WITH (NOLOCK)
 ON o.schema_id = s.schema_id 
INNER JOIN sys.filegroups AS FG WITH (NOLOCK)
 ON FG.data_space_id = AU.data_space_id
INNER JOIN sys.sysfiles AS sf WITH (NOLOCK)
 ON sf.groupid = AU.data_space_id
LEFT JOIN sys.partition_range_values as R1 WITH (NOLOCK)
 ON R1.function_id = pf.function_id 
  AND R1.boundary_id + 1 = PA.partition_number
LEFT JOIN sys.partition_range_values as R2 WITH (NOLOCK)
 ON R2.function_id = pf.function_id 
  AND R2.boundary_id = PA.partition_number
GROUP BY s.name, o.name, i.name, PA.partition_number, R1.value, R2.value
 , ps.name, pf.name, pf.boundary_value_on_right, pf.modify_date
 , FG.name, PA.rows, sf.name, sf.filename
ORDER BY o.name, PA.partition_number 
OPTION (RECOMPILE);

Unfortunately, because we use more than one file per File Group, our Row and Page Counts are aggregated by a Group and it might bring some inconvenience because of the duplication.

That duplication can be avoided by using "sys.dm_db_database_page_allocations", but for huge tables, which we usually partition, that DMV will work too slow and I wouldn't recommend using it.

Left and Right Ranges

I did not stop on that before because I want demonstrate it with an example. Jut do couple of more inserts and then run the previous query.

INSERT INTO tbl_Test_Left(F1) VALUES ('2014-01-01'), ('2016-01-01')
GO
INSERT INTO tbl_Test_Right(F1) VALUES ('2014-01-01'), ('2016-01-01')
GO

As you can see, dates 1/1/2014 and 1/1/2016 were placed on the left side of tbl_Test_Left Table and on the right side of tbl_Test_Right Table.
That is their difference.

Splitting the Range

Usually new Partition Ranges are added to the end of a Function/Schema, but in my example I'll demonstrate how to split middle Partition.
At first we create new File Group. Then make it next within a schema and then split a Function:

ALTER DATABASE TestPartitions ADD FILEGROUP FileGroup4
GO
ALTER DATABASE TestPartitions 
 ADD FILE ( NAME = N'Partition4', 
 FILENAME = N'D:\SS2014\Data\TestPartitions4.ndf', SIZE = 1024KB) 
TO FILEGROUP FileGroup4;
GO
ALTER PARTITION SCHEME sc_Test_Left NEXT USED FileGroup4;
GO
ALTER PARTITION FUNCTION prt_Test_Left () SPLIT RANGE ('2015-01-01');
GO

All records, which are Less than 1/1/2015 and more than 1/1/2014, were moved to the new partition.

I hope you liked this post and if you know any other interesting way to play with partitions please let me know.

Thursday, October 29, 2015

Cloud, Data Lake .... What Next?

At PASS Summit 2015 had to drink a lot of coffee and I've always looked at my cup bottom to see with hope to see where IT and database industry is currently moving.

The major trends are Cloud and Hadoop, but my curiosity went much further.
When all of us will accomplish going to Big Data in a Cloud what will be next?

At PASS Summit I captured new HOT word, which will be the BUZZ word for the next few years.
That is "DATA LAKE".

What is "Data Lake"?

How do we see our current infrastructure?
We have our application/database on premise. We possibly stretch our data to the Cloud.
And now we are doing collaboration of ALL OUR data in the Cloud, we doing the transformation and produce new instances of the data completely in the Cloud.
All of these databases, data warehouses and data systems in OUR Cloud for OUR company we can call OUR "Data Lake".

That is our current trend, but what next???

Now my coffee starts to play it's role.
We have a Cloud, then we have a Data Lake .... Do you see the pattern?

We have rivers, moving from Lakes to the Ocean!

"Data Ocean" - is it a buzz word from 2020's?

What that will be? Can we think about something which does not exist and completely artificially pictured by caffeinated imagination?

Why not?

What is an Ocean?
That is something, where ALL rivers go and ALL Clouds get their content.
Did you get the Idea?

Companies' Clouds and Data Lakes will collaborate with each other.
For sure it won't be free for one company to use somebody's else data, but that will create completely new servicing industry.

We see beginning of the Data Ocean era now, companies providing data as a service right now via APIs. The perfect example is Google Maps. That is just a little run of a data. In 10-15 years it will spin if not to an entire Ocean, but at least flow as wide as Amazon river.

Friday, September 11, 2015

How many Clustered Indexes you may have?

Have you ever got that question on a job interview?

You've been smart and answered: "- Only One!"
Have you?

Is that the right answer?

I like tricky questions and that one is one of them.
At first, you might have no clustered indexes at all and at second, question did not specify "a single table". That means you can have as many clustered indexes in your database as many tables you already have.
Even more! Do not forget about views, you can have clustered indexes on views as well.

So, the really true answer is "- As many as you want, but no more than one per table or view."

Why is that?
Let's look at the nature of Clustered Index. Just forget everything what you know about indexes for a second and just imagine Clustered Index as your table data set sorted in a certain order. Would say you sorted your data by Sale's Person ID and Sale Date and data lay in that order - that will be your Clustered Index. You physical can't have the same data set sorted in more than one way, otherwise it will be different data set.
That the reason why we can't have more than one Clustered index per Table or view.

Can we live without clustered index at all?
Sure we can. Table without clustered index is called Heap.

Do we always have to have Clustered Index?
Not at all. For huge tables with multiple indexes and a lot of lookup operations it might be beneficial is not having Clustered Index!

Look at two same size tables in the sample AdventureWorks2014 database:
- Production.ProductProductPhoto - 504 records without Clustered Index
- Production.Product - 504 records with Clustered Index
One of them has Clustered Index and another do not.

Run simple selects against these tables:

SET STATISTICS IO ON
SELECT * FROM Production.ProductProductPhoto
WHERE ProductID = 1 AND ProductPhotoID = 1;
GO
SELECT * FROM Production.Product
WHERE ProductNumber = 'AR-5381';
SET STATISTICS IO OFF

They have almost identical execution plans with only difference that Heap using RID (Row Identifier) Lookup and Clustered table uses Key Lookup.
When we look in "Messages" tab we can see how many IO operations were used for each select:

(1 row(s) affected)
Table 'ProductProductPhoto'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
Table 'Product'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Select against table with clustered index took 4 IO reads (2 reads for index + 2 reads for Clustered Index) and 3 IO reads for Heap (2 reads for index + ONLY 1 read for Heap!)
Those are extremely small tables and if you have 6-7 Clustered Index levels you will have all these extra reads vs only one read for Heap.

Did I just say Heap is better than Clustered Index? - Absolutely NOT!
I've said that you should not fallow "Must Have Clustered Index" strategy and be more flexible.

Heaps also have their own caveats. The biggest one is that if you need and have only one index on a table better to have it Clustered.

In this case you will avoid extra lookup operations and won't fail in a nightmare of RID Lookups.

In addition to this there is not very well known Heap issue as Extra Spacing.

Run a script, which creates a simple table with only one varchar column and inserts there 700 records:

USE TestDB;
GO
CREATE TABLE tbl_Test_Heap( TextField VARCHAR(1000) );
GO
;WITH a1 AS (SELECT 1 AS a UNION ALL SELECT 1),
a2 AS (SELECT a FROM a1 UNION ALL SELECT a FROM a1),
a3 AS (SELECT a FROM a2 UNION ALL SELECT a FROM a2)
INSERT INTO dbo.tbl_Test_Heap ( TextField )
SELECT TOP 7 REPLICATE('A',950) FROM a3;
GO 100
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap'; GO

As a result we have 101 pages of used space.

Now will add new Integer ID column, which is supposed to add some space to the table and build Clustered Index on it:

ALTER TABLE tbl_Test_Heap ADD ID INT IDENTITY(1,1);
GO
CREATE CLUSTERED INDEX CLIX_Test_Heap ON tbl_Test_Heap(ID);
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap' GO
DROP TABLE tbl_Test_Heap;
GO

As a result, size of our table reduced by 10% !
Tables with Clustered Indexes could be little bit smaller than Heaps.

Clustered Indexes have their own problems. Large, rapidly growing tables can fail into a fragmentation and page split issues unless the primary field of the index is growing DATETIME, IDENTITY or any other growing over the time value. At the same time that exact solution can generate page contention problem, which could lead to locks and even deadlocks.

Do you see how many different problems and questions bring Clustered Indexes?
Do you still want to have them for every table even if they will save space?