Friday, September 11, 2015

How many Clustered Indexes you may have?


Have you ever got that question on a job interview?

You've been smart and answered: "- Only One!"
Have you?

Is that the right answer?

I like tricky questions and that one is one of them.
At first, you might have no clustered indexes at all and at second, question did not specify "a single table". That means you can have as many clustered indexes in your database as many tables you already have.
Even more! Do not forget about views, you can have clustered indexes on views as well.

So, the really true answer is "- As many as you want, but no more than one per table or view."


Why is that?
Let's look at the nature of Clustered Index. Just forget everything what you know about indexes for a second and just imagine Clustered Index as your table data set sorted in a certain order. Would say you sorted your data by Sale's Person ID and Sale Date and data lay in that order - that will be your Clustered Index. You physical can't have the same data set sorted in more than one way, otherwise it will be different data set.
That the reason why we can't have more than one Clustered index per Table or view.

Can we live without clustered index at all?
Sure we can. Table without clustered index is called Heap.

Do we always have to have Clustered Index?
Not at all. For huge tables with multiple indexes and a lot of lookup operations it might be beneficial is not having Clustered Index!

Look at two same size tables in the sample AdventureWorks2014 database:
- Production.ProductProductPhoto - 504 records without Clustered Index
- Production.Product - 504 records with Clustered Index
One of them has Clustered Index and another do not.

Run simple selects against these tables:

SET STATISTICS IO ON  
SELECT * FROM Production.ProductProductPhoto
WHERE ProductID = 1 AND ProductPhotoID = 1;
GO
SELECT * FROM Production.Product  
WHERE ProductNumber = 'AR-5381';  
SET STATISTICS IO OFF


They have almost identical execution plans with only difference that Heap using RID (Row Identifier) Lookup and Clustered table uses Key Lookup.
When we look in "Messages" tab we can see how many IO operations were used for each select:

(1 row(s) affected)
Table 'ProductProductPhoto'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
Table 'Product'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Select against table with clustered index took 4 IO reads (2 reads for index + 2 reads for Clustered Index) and 3 IO reads for Heap (2 reads for index + ONLY 1 read for Heap!)
Those are extremely small tables and if you have 6-7 Clustered Index levels you will have all these extra reads vs only one read for Heap.

Did I just say Heap is better than Clustered Index? - Absolutely NOT!
I've said that you should not fallow "Must Have Clustered Index" strategy and be more flexible.

Heaps also have their own caveats. The biggest one is that if you need and have only one index on a table better to have it Clustered.

In this case you will avoid extra lookup operations and won't fail in a nightmare of RID Lookups.

In addition to this there is not very well known Heap issue as Extra Spacing.

Run a script, which creates a simple table with only one varchar column and inserts there 700 records:

USE TestDB;
GO
CREATE TABLE tbl_Test_Heap( TextField VARCHAR(1000) );
GO
;WITH a1 AS (SELECT 1 AS a UNION ALL SELECT 1),
        a2 AS (SELECT a FROM a1 UNION ALL SELECT a FROM a1),
        a3 AS (SELECT a FROM a2 UNION ALL SELECT a FROM a2)
INSERT INTO dbo.tbl_Test_Heap ( TextField )
SELECT TOP 7 REPLICATE('A',950) FROM a3;
GO 100
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap'; GO
As a result we have 101 pages of used space.

Now will add new Integer ID column, which is supposed to add some space to the table and build Clustered Index on it:

ALTER TABLE tbl_Test_Heap ADD ID INT IDENTITY(1,1);
GO
CREATE CLUSTERED INDEX CLIX_Test_Heap ON tbl_Test_Heap(ID);
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap' GO
DROP TABLE tbl_Test_Heap;
GO
As a result, size of our table reduced by 10% !
Tables with Clustered Indexes could be little bit smaller than Heaps.

Clustered Indexes have their own problems. Large, rapidly growing tables can fail into a fragmentation and page split issues unless the primary field of the index is growing DATETIME, IDENTITY or any other growing over the time value. At the same time that exact solution can generate page contention problem, which could lead to locks and even deadlocks.

Do you see how many different problems and questions bring Clustered Indexes?
Do you still want to have them for every table even if they will save space?

Wednesday, September 9, 2015

Defining a monthly period for an application

Sometimes it happens when by application's business logic you have to define a period, which is associated to a particular month.

Currently I see at least three possible solutions of implementing it:

Solution 1: Use first day of the month. 

That solution is simplest one. You just use a DATE data type to identify a period and use for it the first day of a given month.
Pros:
- Simple to implement, use only one column with simple basic data type;
- DATE data type uses only 3 bytes;
Cons: 
- That is possible to have in that column value other than first day of the month. If that happen your search criteria or joins won't work correctly;

Solution 2: SMALLINT data type for monthly period identification.

If your application starts it's periods after year 2000 - 21-st Century gives you unique opportunity starting making new mistakes from scratch. As before the "Problem-2000" we can start using just two or even three digits to identify a year and two digits to identify a month. For instance September of 2015 can be coded as number "1509". Because SMALLINT max limit is 32K your application's periods should be OK until end of December of 2327.
Pros:
- Only 2 bytes to store the period value, which is good for huge data sets and multi-joins;
- Simple reading form;
Cons: 
- People acceptance barrier. Some people can't easily switch from form of "20150901" to simple "1509";
- Necessity to transform the value to a DATE format for most of the reporting. For that purpose to get date in the format of "YYYYMMDD" you can create computed columns with formula like: "CAST(2000 + Period/100 + Period %100 as VARCHAR)+'01'";
- That is possible to insert period value for a month with number bigger than 12. To prevent this you have to add check constraint with a check like: "(Period %100 <= 12)";
- Partial inability to represent periods before year of 2000. Sure you can use negative values, but it will loose a benefit of being simple.

Solution 3: Use reference table.

That is a heavy weight solution. You create a table where you specify your periods. Here is a possible table design:
Pros:
- You have very informative reference table, which you can possibly use for any kind of filtering or calculations;
- You can supply number of individual attributes to list of your periods;
- Potentially you can switch in the future to different length of a periods and be completely move away from "Monthly" periods;
Cons: 
- Reference table requires at least minimal maintenance;
- Period ID is not informative and you have to always join "Period" table to get a date.

Resume:

There can be also other solutions to keep monthly periods along with their combinations.
For instance my favorite one would be combination of #2 & #3.
With adding Foreign Key check constraint gives you full benefits of both solutions with mitigation of their weaknesses.

Then sample data would look like this:


Tuesday, September 8, 2015

SQL Server 2016 Query Store

I've already described some of "Query Store" functionality in my blog post "Newest SQL Server Feature – “Query Store”" in last November, when SQL Server 2016 was not available yet and nobody could try it.
Now I want to try that functionality alive and make sure all my expectations are true.

At first will enable Query Store functionality for a database in three easy steps:
Step 1. In SSMS 2016 CTP2 create a test database and go to it's properties:

Step 2. In Properties select the last in the list "Query Store" Option:

Step 3. In "General" section change "Enable" to "True". That will fill all default values.

Same result as Steps 1-2-3 you can get just by running following script:
USE [AdventureWorks2014];
GO
ALTER DATABASE [AdventureWorks2014] SET QUERY_STORE = ON;
GO

Now will review the Case:

Problem Description:

Company "ABC" has a database where they place thousands of sales' orders on daily basis.
At the same time couple of hundreds clients every few seconds sending heavy queries requests to the SQL Server to monitor these sales orders.

Problem Solution:

Because over the time data are changing and after first time server was down, somebody very smart said: "-We have to update statistics!" and somebody added: "-And we will schedule that statistic update to prevent any problems in the future!"

Still have a problem:

Even though statistics have been updated on the regular basis, performance problem occurred even more often now! Why?

Problem Simulation:

I will try to simulate that particular problem using very small AdventureWorks database. To get some visible results I will use one month as a period.

In preparation I will run following script to Enable Query Store, Clear it (just in case), then re-create one additional index, clear query cache (NEVER do it in PRODUCTION) with "UPDATE STATISTICS" to simulate nightly maintenance job:

USE [AdventureWorks2014];
GO
ALTER DATABASE [AdventureWorks2014] SET QUERY_STORE = ON;
GO
ALTER DATABASE [AdventureWorks2014] SET QUERY_STORE CLEAR ALL;
GO
IF Exists (SELECT TOP 1 1 FROM sys.indexes WHERE name = 'IX_SalesOrderHeader_Date')
    DROP INDEX IX_SalesOrderHeader_Date ON Sales.SalesOrderHeader;
GO
CREATE INDEX IX_SalesOrderHeader_Date ON Sales.SalesOrderHeader (OrderDate);
GO
DBCC FREEPROCCACHE;
GO
UPDATE STATISTICS Sales.SalesOrderHeader; 
GO 

Mr. Martin is coming to work pretty early and executes his query first:

DECLARE @SQL NVARCHAR(1000) = '
     SELECT h.AccountNumber FROM Sales.SalesOrderHeader AS h
     INNER JOIN Person.Person AS p ON h.CustomerID = p.BusinessEntityID
     WHERE p.LastName = @LastName and h.OrderDate >= @OrderDate;';
DECLARE @LastName NVARCHAR(100) = N'Martin';
DECLARE @OrderDate DATETIME = '2014-07-01';

SET STATISTICS IO ON
EXECUTE sp_executesql @SQL,  
     N'@LastName NVARCHAR(100), @OrderDate DATETIME',
     @LastName = @LastName, @OrderDate = @OrderDate;
SET STATISTICS IO OFF
You might have noticed that I added "SET STATISTICS IO ON" command to capture IO Stats:
Very nice. In the beginning of the period there were only 2 logical reads because period is still empty.
Maintenance job updated statistics and everything works perfectly. Everybody are happy.

Now let's add some data to a period:
INSERT INTO Sales.SalesOrderHeader(
    RevisionNumber, OrderDate, DueDate, ShipDate, Status,
    OnlineOrderFlag, PurchaseOrderNumber,
    AccountNumber, CustomerID, SalesPersonID, TerritoryID,
    BillToAddressID, ShipToAddressID, ShipMethodID,
    CreditCardID, CreditCardApprovalCode, CurrencyRateID,
    SubTotal, TaxAmt, Freight, Comment,
    rowguid, ModifiedDate)
SELECT     RevisionNumber
    , DATEADD(MONTH,1,OrderDate)
    , DATEADD(MONTH,1,DueDate)
    , DATEADD(MONTH,1,ShipDate)
    , Status, OnlineOrderFlag, PurchaseOrderNumber,
    AccountNumber, CustomerID, SalesPersonID, TerritoryID,
    BillToAddressID, ShipToAddressID, ShipMethodID,
    CreditCardID, CreditCardApprovalCode, CurrencyRateID,
    SubTotal, TaxAmt, Freight, Comment,
    NewId(), ModifiedDate
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '2014-06-01';

Now will re-run Mr. Martin's script. We shouldn't be surprised to see new big numbers:
Number of our reads increased almost 3000 times! That is in a very small database with no activity.
Now imagine busy server with hundreds of clients....
That means DBA team should use Admin connection to bring server alive.

Guess what they do to fix the problem....

UPDATE STATISTICS Sales.SalesOrderHeader;

Let's simulate their activity too and after that run Mr. Martin's query again.
  
What do we see: Number of logical reads dropped to a level SQL Server can handle hundreds of requests.
Did "UPDATE STATISTICS" helped? Yep! ....

Everything looks bright, until the next period when maintenance job updated statistics...

Let's describe the real problem: updating statistics sometimes helps, but sometimes harms.
In order to fix this issue once and for all will use "Query Store".

Open AdventureWorks database where you can notice some new items in the list:

Will briefly review these new reports:
Regressed Queries report contains list of queries in Query Store with their plans. In my case I have only one.
Overall resources consumption report for all of your queries.

And here is our "winner" in the "Top Resource consuming queries":


 You can notice that our slow query #17 has two plans associated with it #17 and #20



Why is that?
The plan #17 was created by SQL Server when there was no Sales orders in July of 2014 and SQL Server decided that it will be less expensive to build a plan using indexing by Order Date and associated Key Lookup. However, when number of records for the period started growing, that plan became more and more expensive.
Updating statistics invalidated that plan and when we run the query again SQL Server created plan #20, which looks for person's Last Name first and then does a full scan over Sales table. For that particular query plan #20 is more appropriate in a long run.

What would you do in these kind of situations in the previous versions of SQL Server:
- You might specify usage of Last Name Index in a hint or use an option to optimize for unknown.
The problem with these old solution is that they are applicable only when you have full control over the SQL code, in case of third party application you have nothing else other than just update statistics when problem occurs.

In SQL Server 2016 we have an option to force particular execution plan to be executed for a query.
Let's press the button "Force Plan" for plan #20.

Now will prove that it is working:
At first Enable actual execution plan (Ctrl+M), Update statistics again and execute our query for the next period - August of 2014:

UPDATE STATISTICS Sales.SalesOrderHeader; 
GO
DECLARE @SQL NVARCHAR(1000) = '
     SELECT h.AccountNumber FROM Sales.SalesOrderHeader AS h
     INNER JOIN Person.Person AS p ON h.CustomerID = p.BusinessEntityID
     WHERE p.LastName = @LastName and h.OrderDate >= @OrderDate;';
DECLARE @LastName NVARCHAR(100) = N'Martin';
DECLARE @OrderDate DATETIME = '2014-08-01';

SET STATISTICS IO ON
EXECUTE sp_executesql @SQL,  
     N'@LastName NVARCHAR(100), @OrderDate DATETIME',
     @LastName = @LastName, @OrderDate = @OrderDate;
SET STATISTICS IO OFF

Here is our plan for the query, which is supposed to be plan #20 forced by "Query Store" manager to be executed. However it isn't. It is very close plan, but it does not have Filtering operation for Sales Data.

That is strange. Let's go to the fourth report provided by query store: "Tracked Queries"

 At first it is completely empty. We have to specify Query #17 to see query stats and plans:

As you can see, SQL Server came up with the third plan #26 for that query.
MSDN says: "When a plan is forced for a particular query, every time SQL Server encounters the query, it tries to force the plan in the optimizer. If plan forcing fails, a XEvent is fired and the optimizer is instructed to optimize in the normal way."

I've tried to look at Extended Events and there was nothing. Absolutely no reason for plan #20 to fail.
I do not know what it can be. It can be CTP 2 bug or maybe a feature.
I've clicked on "Compare Plans" button and got all three plans at glance:

 It seems like plan #26 is just a little bit better than #20 and SQL Server preferred it over, without any firing of "Plan execution Failure" in the extended events.


So far what we've learned about "Query Store":

1. "Query Store" manager collects all query execution plans along with their statistics and visually present it with pretty friendly interface. I may predict that it will shake position of major monitoring tools unless they figure out better strategies.

2. We can force specific plan to be executed for a query.

3. (Undocumented) It looks like when we force a plan SQL Server tries to improve it if possible.

Other cull "Query Store" features will be in my further posts.


Monday, September 7, 2015

Cool New feature "Include Live Query Statistics"

Exploring SQL Server Management Studio Version 2016 CTP 2.

It has very cool feature "Include Live Query Statistics"

It has an new icon in the Toolbar right near "Actual Execution Plan" icon:




What is cool with that feature?
That is a live view of current query execution.

I've tried to come up with a query, which would demonstrate that:

USE AdventureWorks2014;
GO
DBCC DROPCLEANBUFFERS;
GO
;WITH Prices AS (
    SELECT dbo.ufnGetProductDealerPrice(d.ProductID, h.OrderDate) AS Price,
        ROW_NUMBER() over(ORDER BY p.MiddleNAme) rn, p.PersonType, pr.Color
    FROM Sales.SalesOrderDetail AS d
    INNER JOIN Sales.SalesOrderHeader AS h ON h.SalesOrderID = d.SalesOrderID
    INNER JOIN Person.Person AS p ON h.CustomerID = p.BusinessEntityID
    INNER JOIN Production.Product AS pr ON d.ProductID = pr.ProductID
) SELECT * FROM Prices;

And I was be able to capture a moment on the screen when Hash Match operation is already done and Sort operation just started.

In this case it might be not so useful, but for complex queries which run for several minutes minutes it would be very useful to see what SQL Server is doing right now and what a bottleneck there can be.

It looks like Microsoft finally decided to put more improvements into SSMS.
SSMS Rocks!