Wednesday, October 23, 2013

SQL DB Architecturing: How to reuse one child table for two parent tables.

The question raised long time ago:
How to reuse one child table for two different parent tables?

The Classic example is address table. Would say in our database we handle companies and individuals separately and we have to keep their addresses in outside reference table.

CREATE TABLE Company(
  Company_ID INT IDENTITY(1,1),
  Company_Name VARCHAR(100),
  CONSTRAINT PK_Company PRIMARY KEY CLUSTERED (Company_ID ASC)
)
GO
CREATE TABLE Person(
  Person_ID INT IDENTITY(1,1),
  Person_Name VARCHAR(100),
  CONSTRAINT PK_Person PRIMARY KEY CLUSTERED (Person_ID ASC)
)
GO
CREATE TABLE Address(
  Address_ID INT IDENTITY(1,1),
  Address VARCHAR(100),
  Address_Type VARCHAR(20),
  CONSTRAINT PK_Address PRIMARY KEY CLUSTERED (Address_ID ASC)
)
GO
* For the sake of simplicity I have only ID and a name


What choices we have to establish relationship from Company and Person to an Address?
1. First option is to create separate address tables:
2. Second option is to create reference tables:


I really did not like both options and wanted to do something like this:
The problem with that is that I can't build a foreign key that will be related to both parent tables at once.

At firs, I tried to use aggregated foreign key against a view that would be a combination of "Company" and "Person" tables.
However I badly failed on that.
I tried unionize these tables, built a CTE query, query with FULL OUTER JOIN - nothing helped I couldn't build a clustered index over that view.

So, I decided to go other way.
I've created TWO reference ID columns in the address table and added Check Constraint to prevent these two values being NULL or NOT NULL:
ALTER TABLE Address ADD Company_ID INT 
GO 
ALTER TABLE Address ADD Person_ID INT 
GO 
ALTER TABLE Address ADD CONSTRAINT CHK_Address
CHECK (NOT (Company_ID is null and Person_ID is null) 
        and NOT (Company_ID is not null and Person_ID is not null))

Then I've created Foreign Key relationships to parent table:
ALTER TABLE Address ADD CONSTRAINT FK_Address_Company
       FOREIGN KEY (Company_ID) REFERENCES Company (Company_ID)
              ON UPDATE NO ACTION ON DELETE NO ACTION
GO
ALTER TABLE
Address ADD CONSTRAINT FK_Address_Person
       FOREIGN KEY (Person_ID) REFERENCES Person (Person_ID)
              ON UPDATE NO ACTION ON DELETE NO ACTION
GO

Then I insert The data
INSERT INTO Company (Company_Name) VALUES ('Microsoft'),('Oracle'),('Dell'),('Idera'),('HortonWorks')  
GO
INSERT INTO Person (Person_Name) VALUES ('Bill Gates'),('John Doe'),('Joe the Plumber')  
GO
INSERT INTO Address(Company_ID,Address, Address_Type)
VALUES (1,'33 Oak St.',('Headquarter'),
       (2,'1235 Limpopo Ave.',('Main Office'),
       (3,'1547 Caiman Isl.',('Mailing'),
       (4,'WWW.IDERA.COM',('Web'),
       (5,'127.0.0.1',('Home')
GO
INSERT INTO Address(Person_ID,Address, Address_Type)  
VALUES (1,'Hawaii',('Vacational'),
       (2,'1507 Malholland Dr.',('Second Home'),
       (3,'Corner of 5-th Ave. and 7-th St.',('Sleeping Place')
GO

And here is a query to extract the data:
SELECT 
     CASE WHEN c.Company_ID IS NULL THEN 'Person' ELSE 'Company' END as Client_Type,
      a.Address as Client_Address, a.Address_Type  
FROM Address as a
LEFT JOIN Company as c on c.Company_ID = a.Company_ID
LEFT JOIN Person as p on p.Person_ID = a.Person_ID  
GO

Below are results of the query:


Now will try to justify if my solution is bulletproof enough and can keep database integrity.
Will run following queries trying to violate all constraints:

GO
-- Error 1: Trying to insert address without corresponding Company or Person
INSERT INTO Address(Address, Address_Type)
VALUES ('4905 Ocean Pkwy.', 'Office')
GO
-- Error 2: Trying to insert address for two parents at onceINSERT INTO Address(Company_ID, Person_ID, Address, Address_Type)
VALUES (1,1,'4905 Ocean Pkwy.', 'Office')
GO
-- Error 3: Trying to insert address for not existing CompanyINSERT INTO Address(Company_ID, Address, Address_Type)
VALUES (100,'4905 Ocean Pkwy.', 'Office')
GO
-- Error 4: Trying to insert address for not existing Person INSERT INTO Address(Person_ID, Address, Address_Type)
VALUES (100,'4905 Ocean Pkwy.', 'Office')
GO
-- Error 5: Trying to Delete a Company which has linked address DELETE FROM Company WHERE Company_ID = 5
GO
-- Error 6: Trying to Delete a Person which has linked address DELETE FROM Person WHERE Person_ID = 2
GO
-- Error 7: Trying to Drop Company table DROP TABLE Company
GO
-- Error 8: Trying to Drop Person table DROP TABLE Person
GO
-- Here is an example of successful deletes: DELETE FROM Address WHERE Company_ID = 3 or Person_ID = 3
GO
DELETE FROM Company WHERE Company_ID = 3
GO
DELETE FROM Person WHERE Person_ID = 3
GO

Here are results of that query:


Congratulations to myself! Everything work as expected.

Disclaimer:
That solution has never been implemented into production. So, please be careful to use it in your system.

I know the solution looks ugly, but it works. Please criticize it. Than together we can define areas of it's proper usage.

Thanks.

Wednesday, October 2, 2013

Why do we need clustered index



You've probably heard it many times that you have to have Clustered index on each table in your environment and I've also had.


But, I'm mistrustful regarding everything. So, I want to proof or disproof the concept of Clustered Index Necessity.
(All samples are applicable for on premise SQL Server systems. Windows Azure requires to have Clustered index to allow update a tavle)

Test 1. Looking for Data Size.

Use following script to create and populate a table:




/* Script #1 */
 
USE TEMPDB;
GO
SET NOCOUNT ON
GO
IF EXISTS (
  SELECT TOP 1 1 FROM sys.tables
  WHERE object_id = OBJECT_ID('tbl_Test_Table_A') )

DROP TABLE tbl_Test_Table_A;
GO
CREATE TABLE tbl_Test_Table_A(RID int IDENTITY(1,1), F1 char(1), F2 char(1))
GO
INSERT INTO tbl_Test_Table_A(F1,F2)
VALUES ('0','J'),('1','I'),('2','H'),
('3','G'),('4','F'),('5','E'),
('6','D'),('7','C'),('8','B'),('9','A')
GO 53

Now will look at the table size before and after Clustered Index



/* Script #2 */



SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
i.name AS TableIndexName,
ps.index_id,
ps.index_type_desc,
ps.index_depth,
ps.index_level,
ps.page_count,
ps.record_count,
ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
OBJECT_ID('tbl_Test_Table_A'),
NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
ON i.OBJECT_ID = ps.OBJECT_ID
AND i.index_id = ps.index_id;
GO
CREATE CLUSTERED INDEX CLIX_Test_Table_A
ON tbl_Test_Table_A(RID);
GO
SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
i.name AS TableIndexName,
ps.index_id,
ps.index_type_desc,
ps.index_depth,
ps.index_level,
ps.page_count,
ps.record_count,
ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
OBJECT_ID('tbl_Test_Table_A'),
NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
ON i.OBJECT_ID = ps.OBJECT_ID
AND i.index_id = ps.index_id;
GO

Here is a result of that query:


As you can see, the number of used pages has decreased.
That means Clustered Index is more compact than a HEAP.

I've looked at the page structure of the table and discovered that HEAP can't handle more than 520 records of that type on a page. It still has free space for extra records, but does not use it.

Were these records compressed by clustered index?  Lets check it.

I will drop clustered index and check table size again:


/* Script #3 */
SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
i.name AS TableIndexName,
ps.index_id,
ps.index_type_desc,
ps.index_depth,
ps.index_level,
ps.page_count,
ps.record_count,
ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
OBJECT_ID('tbl_Test_Table_A'),
NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
ON i.OBJECT_ID = ps.OBJECT_ID
AND i.index_id = ps.index_id;
GO
DROP INDEX [CLIX_Test_Table_A] ON [dbo].[tbl_Test_Table_A]
GO
SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
i.name AS TableIndexName,
ps.index_id,
ps.index_type_desc,
ps.index_depth,
ps.index_level,
ps.page_count,
ps.record_count,
ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
OBJECT_ID('tbl_Test_Table_A'),
NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
ON i.OBJECT_ID = ps.OBJECT_ID
AND i.index_id = ps.index_id;
GO

Here is a result:

As you can see, now 530 records perfectly fit into one page.

Right now I do not have an explanation for that phenomena and might come back to it later.

Test 2. Looking for Index Size.

Run following script to re-create and populate a table by bigger number of records (50K).
That script also creates non-clustered index on column "F2".
/* Script #4 */
 
USE TEMPDB
GO
SET NOCOUNT ON
GO
IF EXISTS (
  SELECT TOP 1 1 FROM sys.tables
  WHERE object_id = OBJECT_ID('tbl_Test_Table_A') )
DROP TABLE tbl_Test_Table_A
GO
CREATE TABLE tbl_Test_Table_A(RID int IDENTITY(1,1), F1 char(1), F2 char(1))
GO
CREATE INDEX IX_Test_Table_A ON tbl_Test_Table_A(F2);
GO
INSERT INTO tbl_Test_Table_A(F1,F2)
VALUES ('0','J'),('1','I'),('2','H'),
('3','G'),('4','F'),('5','E'),
('6','D'),('7','C'),('8','B'),('9','A')
GO 5000

After that script I run the script #2 and here are it's results:



As you can see, clustered index added one extra page on a tree level 1, but reduced number of data pages from 97 to 93. So, total reduction of space used by the data is about 3% - not a big deal.

However, number of pages used by non-clustered index dropped significantly..
So, just having clustered index on a table reduces sizes of all other non clustered indexes by up to 60%.

Test 3. In the first two tests I looked mostly at advantages of clustered indexes. Now I want to show disadvantage of HEAP structure.


Lets recreate a table and non-clustered index

/* Script #5 */
 
USE TEMPDB
GO
SET NOCOUNT ON
GO
IF EXISTS (
  SELECT TOP 1 1 FROM sys.tables
  WHERE object_id = OBJECT_ID('tbl_Test_Table_A') )
DROP TABLE tbl_Test_Table_A
GO
CREATE TABLE tbl_Test_Table_A(


RID int IDENTITY(1,1), 
F1 char(1), 
F2 varchar(7000))
GO
INSERT INTO tbl_Test_Table_A(F1,F2)
VALUES ('0','J'),('1','I'),('2','H'),
('3','G'),('4','F'),('5','E'),
('6','D'),('7','C'),('8','B'),('9','A')
GO 52
GO
CREATE INDEX IX_Test_Table_A ON tbl_Test_Table_A(F1);
GO
*Notice that field F2 now is a varchar(7000).


Now I will update only one record by increasing the size of data to 7000 characters.


/* Script #6 */

SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
  i.name AS TableIndexName,
  ps.index_id,
  ps.index_type_desc,
  ps.index_depth,
  ps.index_level,
  ps.page_count,
  ps.record_count,
  ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
  OBJECT_ID('tbl_Test_Table_A'),
  NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
  ON i.OBJECT_ID = ps.OBJECT_ID
  AND i.index_id = ps.index_id;
GO
UPDATE tbl_Test_Table_A

SET F1 = 'Y', F2 = REPLICATE('Z',7000)

WHERE RID = 1
GO
SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName,
  i.name AS TableIndexName,
  ps.index_id,
  ps.index_type_desc,
  ps.index_depth,
  ps.index_level,
  ps.page_count,
  ps.record_count,
  ps.avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(),
  OBJECT_ID('tbl_Test_Table_A'),
  NULL, NULL, 'DETAILED') ps
INNER JOIN sys.indexes i
  ON i.OBJECT_ID = ps.OBJECT_ID
  AND i.index_id = ps.index_id;
GO

Here is a result of that script:





As you can see, new page has been created, that is normal.
What is unexpected that number of records in the table has increased by one.

Now I will investigate what was happened.
Run following script to determine Page ID of index page


DBCC IND('Tempdb','tbl_Test_Table_A',2)
Here is a result. Our Index page Id is 339


Now we will look at records within that table


DBCC TRACEON(3604)
GO
DBCC PAGE('Tempdb',1,339,2)

At first I got a header:
PAGE: (1:339)


BUFFER:


BUF @0x0000000080749F00

bpage = 0x00000000286D8000          bhash = 0x0000000000000000          bpageno = (1:339)
bdbid = 2                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 47223                       bstat = 0x10b
blog = 0xccdcccdc                   bnext = 0x0000000000000000          

PAGE HEADER:


Page @0x00000000286D8000

m_pageId = (1:339)                  m_headerVersion = 1                 m_type = 2
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 2731977                                    m_indexId (AllocUnitId.idInd) = 5120
Metadata: AllocUnitId = 1441152059801403392                              
Metadata: PartitionId = 1657324842033217536                              Metadata: IndexId = 2
Metadata: ObjectId = 949578421      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 10                        m_slotCnt = 100                     m_freeCnt = 6596
m_freeData = 1409                   m_reservedCnt = 0                   m_lsn = (0:119329:0)
m_xactReserved = 0                  m_xdesId = (0:76179)                m_ghostRecCnt = 0
m_tornBits = 0                      DB Frag ID = 1                      

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED              
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = NOT CHANGED
ML (1:7) = NOT MIN_LOGGED           
Index page has 100 records


Record we need is supposed to be the last on that page because 'Y' > '9'
OFFSET TABLE:

Row - Offset                      
99 (0x63) - 1396 (0x574)          
98 (0x62) - 1383 (0x567)          
97 (0x61) - 1370 (0x55a)
Offset for the last record is 0x574

Here is the dump:
000000000080A564:   02000016 39510100 00010063 00020000 16595101  ....9Q.....c.....YQ.
000000000080A578:   00000100 00000200 00000053 01000001 00510002  ...........S.....Q..
We got it!
- The whole index record is in bold.
- 'Y' is in red
- Reference page number and slot on that page is in blue.
So, we expect our record to be on page 0x151 (337)

Now will prove it:


DBCC IND('Tempdb','tbl_Test_Table_A',0)
Here is a result
:
And page 337 is there.


DBCC PAGE('Tempdb',1,337,2)

 Lets check what is on that page. Header first:


PAGE: (1:337)


BUFFER:


BUF @0x000000008074A080

bpage = 0x00000000286D6000          bhash = 0x0000000000000000          bpageno = (1:337)
bdbid = 2                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 48341                       bstat = 0x10b
blog = 0xcccccccc                   bnext = 0x0000000000000000          

PAGE HEADER:


Page @0x00000000286D6000

m_pageId = (1:337)                  m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 2731352                                    m_indexId (AllocUnitId.idInd) = 4864
Metadata: AllocUnitId = 1369094465722515456                              
Metadata: PartitionId = 1585267247954329600                              Metadata: IndexId = 0
Metadata: ObjectId = 949578421      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 9                         m_slotCnt = 100                     m_freeCnt = 6204
m_freeData = 1796                   m_reservedCnt = 8                   m_lsn = (543:365:443)
m_xactReserved = 8                  m_xdesId = (0:76179)                m_ghostRecCnt = 0
m_tornBits = 0                      DB Frag ID = 1                      

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED              
PFS (1:1) = 0x61 MIXED_EXT ALLOCATED  50_PCT_FULL                        DIFF (1:6) = NOT CHANGED
ML (1:7) = NOT MIN_LOGGED           
As you can see there are also 100 records on the page

Look for a slot #1 (which is actually 0x00)
2 (0x2) - 130 (0x82)              
1 (0x1) - 113 (0x71)              
0 (0x0) - 96 (0x60)
It is pointed to the first record on the page with offset 0x60
Here is the record:
000000000060A050:   00000000 00000000 00000000 00000000 04560100  .................V..
000000000060A064:   00010000 00030000 01001100 4a300009 00020000  ............J0.    ....
Record type in this case is 04 - which is reference
- Page Address 0x156 (342) - which is the next page
- Slot #1 on that page

Lets look for page 342

DBCC PAGE('Tempdb',1,342,2)
Page has only one record that starts like this:
000000000080A050:   00000000 00000000 00000000 00000000 32000900  ................2.    .
000000000080A064:   01000000 59030000 02006a1b 749b5a5a 5a5a5a5a  ....Y.....j.t›ZZZZZZ
Record type 32 (referenced record)
- Our 'Y' and all 'Zs' are in red
- Offset 0x1b6a (7018) to the end of the record is in blue.

Now we know that extra record that appears in the data set is not a record, but a reference to a real record.
That demonstrates that not having clustered index can generate unnecessary extra reads and corresponding CPU usage.

Hope I've convinced everybody in the necessary of having clustered index.