Friday, July 19, 2019

SSIS Data Profiling of Multiple data sets in multiple databases

      I assume you hit my post because you are already know how to do SSIS Data Profiling for a single table, but now you want to know how to do it for all tables in your database, or even more, Profile ALL data on ALL Databases on your Server and maybe profile whole data universe of your SQL environment.

      If you just need to Profile several of your tables, than there is pretty good old solution in the Internet: http://agilebi.com/jwelch/2008/03/11/using-the-data-profiling-task-to-profile-all-the-tables-in-a-database/
In this post I'll go further using "step-by-step" methodology.

Step 1. Preparing variables.

After you created new SSIS package set necesary variables first.
Most of the variables are regular strings, some are calculations and one object:

Strings: 

"SQLServer" - Name of your target SQL Server - has to be set;
"SourceDatabase" - Name of a currently targeted Database;
"ResultPath" - Folder, where all XML results will be placed. In my Example that is:
                "C:\Projects\DataProfiling\";
"ProfileXMLHeader" - XML Header of Data Profiling Request - must be set;
"ProfileXMLFooter" - XML Footer of Data Profiling Request - must be set;
"ProfileXMLBody" - XML Body of Current Data Profiling Request;

Calculatables:

"ResultFolder" - Folder, which will be created for your SQL Server instance. The formula is:
                "@[User::ResultPath] +  @[User::SQLServer]"
"ResultFile" - File name, which will be created for an individual database. The formula is:
                "@[User::ResultFolder] + "\\" +  @[User::SourceDatabase] + ".xml""
"ProfileXML" - Current XML Data Profiling Request. The formula is:
                "@[User::ProfileXMLHeader] +  @[User::ProfileXMLBody] +  @[User::ProfileXMLFooter]"

Object:

"DatabaseList" - That will be collection of database names on your server to loop through;

Variables settings:

Besides of "SQLServer" and "ResultPath" you have to setup Header and Footer of your XML request. For Visual Studio Version 16.0.1 there should be following values, but you can extract and use them from your own requests:
"ProfileXMLHeader"
<?xml version="1.0" encoding="utf-16"?>
<DataProfile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://schemas.microsoft.com/sqlserver/2008/DataDebugger/">
  <ProfileVersion>1.0</ProfileVersion>
  <DataSources />
  <DataProfileInput>
    <ProfileMode>Exact</ProfileMode>
    <Timeout>0</Timeout>
    <Requests>
"ProfileXMLFooter"
    </Requests>
  </DataProfileInput>
  <DataProfileOutput>
    <Profiles />
  </DataProfileOutput>
</DataProfile>
As you can see, the Body of your XML Data Profiling Request will be placed between the tags "<Requests>" and "</Requests>"

As a result, your variables' settings should look like this:

Step 2. Preparing Connections

You will need only two connections in that project: to your SQL Server and to an output file.
First connection is "SearchedDataSource" - SQL Server connection:

After you created the connection you have to go to its properties and set Expressions as following:
"InitialCatalog" = @[User::SourceDatabase]
"ServerName" = @[User::SQLServer]
Setting "ServerName" as a variable will give you a freedom to point to different servers and possibly to go through their list you you wish.

Second connection is "ResearchResult" - File connection:
 
For that connection you also set an Expression in its properties:
"ConnectionString" = @[User::ResultFile]

Step 3. Building SSIS Package

Here is what you are going to build:

"Create Folder for SQL Server" - File System task to create a new folder for your data files:

"Get List of databases" - "Execute SQL Task" Fills out "DatabaseList" object with list of databases to process.
The most important Points are:
- Connection type: ADO.NET
- ResultSet: Full result set
- Result Set: 0 = User::DatabaseList
- SQL Statement: List of your databases. In this example I've excluded system databases, but you can do your own exclusions or build custom lists.

SELECT name FROM sys.databases WHERE database_id > 4

"Foreach Loop Container" - Loops through the list of your databases. There you use your object "DatabaseList" as Looping source.
Besides of looping, you need to extract current database name into "SourceDatabase" variable:

Inside of the loop you need to build SQL Task "Get List of tables" as following.
The most important Points are:
- Connection type: ADO.NET
- ResultSet: Single Row
- Result Set: 0 = User::ProfileXMLBody
- SQL Statement: The most crucial part of the project and I'll explain it below.


Here is not too complicated, but very important SQL Script.
At first, it does "DBCC UPDATEUSAGE(0);" for your database. That is important to prevent SSIS bug described in my previous post (https://slavasql.blogspot.com/2019/07/ssis-Data-Profiling-UPDATEUSAGE.html)
At second we build list of all tables in the database. As you can see in the "WHERE" clause I exclude system tables and empty ones. Also, it might be necessary to exclude tables, which are too big and won't fit into the memory.
The third part is just looping through the tables and building XML body for the request.
SET NOCOUNT ON
DBCC UPDATEUSAGE(0);

DECLARE @xml NVARCHAR(MAX)=N'';
DECLARE  @i INT, @m INT;
DECLARE @tbls TABLE (ID INT IDENTITY(1,1), [schema] SYSNAME, [name] SYSNAME)
INSERT INTO @tbls([schema], [name])
SELECT DISTINCT SCHEMA_NAME(t.schema_id), t.name 
FROM sys.dm_db_partition_stats as s
INNER JOIN sys.tables as t on t.object_id = s.object_id
WHERE t.type = 'U' and s.row_count > 0;

SELECT @m = MAX(ID), @i = MIN(ID) FROM @tbls;
WHILE @i <= @m
BEGIN
SELECT TOP 1 @xml += N'
<ColumnStatisticsProfileRequest ID="StatisticsReq' + CAST(@i as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><Column IsWildCard="true" /></ColumnStatisticsProfileRequest>
<ColumnValueDistributionProfileRequest ID="ValueDistReq' + CAST(ID as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><Column IsWildCard="true" /><Option>FrequentValues</Option><FrequentValueThreshold>0.001</FrequentValueThreshold></ColumnValueDistributionProfileRequest>
<ColumnPatternProfileRequest ID="PatternReq' + CAST(ID as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><Column IsWildCard="true" /><MaxNumberOfPatterns>20</MaxNumberOfPatterns><PercentageDataCoverageDesired>95</PercentageDataCoverageDesired><CaseSensitive>false</CaseSensitive><Delimiters> \t\r\n</Delimiters><Symbols>,.;:-"''`~=&amp;/\\@!?()&lt;&gt;[]{}|#*^%</Symbols><TagTableName /></ColumnPatternProfileRequest>
<ColumnNullRatioProfileRequest ID="NullRatioReq' + CAST(@i as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><Column IsWildCard="true" /></ColumnNullRatioProfileRequest>
<ColumnLengthDistributionProfileRequest ID="LengthDistReq' + CAST(@i as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><Column IsWildCard="true" /><IgnoreLeadingSpace>false</IgnoreLeadingSpace><IgnoreTrailingSpace>true</IgnoreTrailingSpace></ColumnLengthDistributionProfileRequest>'
+ '<CandidateKeyProfileRequest ID="KeyReq' + CAST(@i as NVARCHAR) + N'"><DataSourceID>SearchedDataSource</DataSourceID><Table Schema="' + [schema] + N'" Table="' + [name] + N'" /><KeyColumns><Column IsWildCard="true" /></KeyColumns><ThresholdSetting>Specified</ThresholdSetting><KeyStrengthThreshold>0.95</KeyStrengthThreshold><VerifyOutputInFastMode>false</VerifyOutputInFastMode><MaxNumberOfViolations>100</MaxNumberOfViolations></CandidateKeyProfileRequest>'
FROM @tbls
WHERE ID = @i;
SET @i += 1;
END
SELECT ProfileXMLBody = @xml;

The Last part is "Data Profiling Task". As Destination, you have to specify FileConnection "ResearchResult". The most important here is to set an expression "ProfileInputXML" as "@[User::ProfileXML]" variable. Also, very important to set Property "DelayValidation" to True.



Step 4. Run the Package!

That might take a while or even error out.

Step 5. Analyze the results

You can analyze individual Database Profiling results using Microsoft's "DataProfileViewer.exe" utility, which can be found in "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\SSIS\150\Binn\" folder. 
The result should look like this:

Caveats of that method:

1. If you have a lot of big databases on your server, it might take forever to scan all of them. The only way to monitor the progress to look in the newly created folder for that server for separate XML files for each already scanned database.
2. If data sets are too big for your computer to comprehend you might get 'System.OutOfMemoryException' error. Separating by single table and task does not help. Only solution I've found is to run that package on a computer with a lot of free memory.
3. SSIS does not scan BLOB columns like VARCHAR(MAX), NVARCHAR(MAX), XML, Etc. It simply ignores them. So, if you still need to scan those columns you need to use another Profiling tool.


5 comments:

  1. Hi Slava, could you share this package with me ..it is super helpful blog

    ReplyDelete
    Replies
    1. Sorry, I do not think I have it since I've done it.
      It should be easy to reconstruct by given instructions.

      Delete
    2. Hi slava, i am getting sytem out of memory exception when trying to open the xml that is created..any ideas

      Delete
  2. Hi slava, i am getting sytem out of memory exception when trying to open the xml that is created..any ideas

    ReplyDelete
    Replies
    1. Obviously, your DB is too big or you do not have enough memory to process it.
      What you can do is to limit result by only certain tables or even just one.

      Delete