Tuesday, September 1, 2015

Identify Unicode columns to convert to Non-Unicode

In my yesterday's post I reviewed the situation when Unicode data type is not necessary for string values in SQL Server database.
In today's post I want to present the script, to identify all Unicode columns in your database, which do not contained ANY unicode characters and might be eligible to be converted to Non-Unicode.

SELECT o.object_id
    , s.name AS SchemaName
    , o.name AS TableName
    , c.name AS ColumnName
    , t.name AS ColumnType
    , 0 AS CanNotConvert
    , CAST(0 AS FLOAT) AS CanSaveMB
INTO #ColumnList
FROM sys.schemas AS s
INNER JOIN sys.tables AS o
    ON o.schema_id = s.schema_id
INNER JOIN sys.columns AS c
    ON c.object_id = o.object_id
INNER JOIN sys.types AS t
    ON t.system_type_id = c.system_type_id
    AND t.name in ('nchar','nvarchar')
GO
DECLARE @Schema SYSNAME, @Table SYSNAME, @Column SYSNAME, @Type SYSNAME;
DECLARE @SQL NVARCHAR(1000), @Count INT, @BUsed INT, @object_id INT;

SELECT @object_id = MIN(object_id) FROM #ColumnList;

WHILE @object_id is NOT NULL
BEGIN
    SELECT @SQL = N'SELECT @Count = SUM(CASE WHEN ['
        + ColumnName + N'] != CAST(CAST(['
        + ColumnName + N'] AS ' + SUBSTRING(ColumnType,2,7)
        + N') AS ' + ColumnType + N') THEN 1 ELSE 0 END), @BUsed = SUM(LEN(['
        + ColumnName + N'])) FROM ['
        + SchemaName + N'].[' + TableName + N'];'
    FROM #ColumnList WHERE object_id = @object_id;

    PRINT @SQL;
    EXEC sp_executesql @SQL,
        N'@Count INT OUTPUT, @BUsed INT OUTPUT',
        @Count=@Count OUTPUT, @BUsed=@BUsed OUTPUT;
   
    UPDATE #ColumnList
    SET CanNotConvert = IsNull(@Count,0), CanSaveMB = IsNull(@BUsed/1048576.,0)
    WHERE object_id = @object_id;
   
    ;WITH NextID AS (
        SELECT MIN(object_id) AS NextID
        FROM #ColumnList
        WHERE object_id > @object_id
    )
    SELECT @object_id = NextID
    FROM NextID;
       
END
GO
SELECT *
FROM #ColumnList
WHERE CanNotConvert = 0 and CanSaveMB > 0
ORDER BY CanSaveMB DESC;
GO
DROP TABLE #ColumnList;
GO

Column "CanSaveMB" shows the number of Megabytes, which could be saved by using Non-Unicode data type.


Monday, August 31, 2015

Why do you need a Unicode?

Databases do not store only transnational data, sometimes that happens that the very big chunk of the data is a text.
You might try to normalize it, but it is not always possible and you dump text into VARCHAR columns.

Sometimes it happen that majority of the data in a huge table are just those text data.
Just imagine, every time you are doing table scan request your server reads all these data from disk into memory to process it.
It is not only the disk issue, these heavy read operations might push out of memory other database components, which may cause even more future reads.

On the top of that problem your database architect decided to use Unicode for all CHARs and VARCHARs...

What are database architect's reasons for using Unicode?
- Customers say they will always put it in English now, but they have worldwide client. What if you want to store a customer name which uses non-English characters? Or the name of a place in another country?
- Unicode supports Scientific, Technical and Mathematical character sets;

- To have ALL text columns in Unicode is much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on;
- It's easier to support Unicode from the beginning, than try to retrofit it later;
- I do not want to put time bomb in the application. It is sound always better to be safe than sorry.

All these arguments are from the category "WHAT IF".
There are sometimes reasonable "WHAT IFs", then that has to be addressed in business requirements, but if business is not require something explicitly, that should be avoided.

I know a lot of people will disagree with me and claim that I'm not experienced enough in international database development.
Just want to answer them that I actually do a lot of international database development experience.
I've started my IT career as non-English speaker in not localized DOS 3.0 environment. I went through a character coding nightmare, which native English speakers can't even imagine and I know what I'm talking about.

Let's start.

You've asked to use Unicode because the company might have clients with foreign names, addresses and product names.

You have to ask the question: Will you have in you have in your database any international charsets from the list below?:
- Indian: "কখগঘঙচছজঝঞট";
- Arabic: "ݐݑݒݓݔݕݖݗݘݙ";
- Georgian: "ᦍᦎᦏᦐᦑᦒᦓᦔᦕᦖᦗᦘᦙ";
- Cyrillic: "БГДЖЗЛФХЦЧШЩЪЫЭЮЯ";
- Korean: "ᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟ"
- Japanese: "༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯";
- Chinese: "樂洛烙珞落酪駱亂卵欄爛蘭";
- Braille: "⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙"
- Mathematical: "∭∛∜∮∯∰∱∲≉≊≋≌≍";

If his/her answer will be: "- No, we must not allow any of those to be allowed as Names, Addresses or Products. Otherwise nobody will be able to read it!"

Hold on, but you can have a business in Canada, Mexico, Germany or any other country, which uses Latin alphabet with extra non-English characters. Do you still need Unicode?

Here is a demo:
USE TestDB
GO
CREATE TABLE
tbl_Test_Unicode(
    [Language_ID] TINYINT IDENTITY (1,1),
    [Language] VARCHAR(20) NOT NULL,
    [Unicode] NVARCHAR(100) NOT NULL,
    [NonUnicode] VARCHAR((100) NULL
);
GO
INSERT INTO
tbl_Test_Unicode([Language], [Unicode])
VALUES ('Spanish',N'ÁÉÍÑÓÚÜ¿¡áéíñóúü'),
('French',N'ÙÛÜŸÀÂÆÇÉÈÊËÏÎÔŒùûüÿàâæçéèêëïîôœ'),
('German',N'ÉËÏÓÖÜéëïóöü'),
('Norwegian',N'ÅÆÂÉÈÊØÓÒÔåæâéèêøóòô'),('Italian',N'ÀÈÉÌÒÓÙàèéìòóù'),('Chech',N'ÚŮÝŽÁČĎÉĚÍŇÓŘŠŤúůýžáčďéěíňóřšť'),('Polish',N'ĄĆĘŁŃÓŚŹŻąćęłńóśźż'),('Belarusian',N'АаБбВвГ㥴ДдЖжЕеЁёЖжЗзЬьІіЙйКкЛлМмНнОоПпРрСсТтУуЎўФфХхЦцЧчШшЫыЭэЮюЯя');
GO
UPDATE tbl_Test_Unicode
SET [NonUnicode] = [Unicode];
GO
SELECT *,
    CASE [NonUnicode] WHEN [Unicode] THEN 'Y' ELSE 'N' END "Are they equal?"
FROM tbl_Test_Unicode;
GO
DROP TABLE tbl_Test_Unicode;
GO

Here is a result:

The secret is that default SQL Server Collation and code page already contain all necessary characters. You can use following script to prove it:
DECLARE @b TINYINT = 0, @String VARCHAR(300) = '';
WHILE @b < 255 SELECT @b += 1, @String += CASE @b % 32 WHEN 0 THEN CHAR(10) ELSE '' END + CHAR(@b);
PRINT @String;

As you can see Spanish, French, German, Norwegian, Italian are NOT REQUIRE Unicode!
Discrimination starts with Eastern Europe languages, but they are translated pretty well and still readable.
Problem only starts with Cyrillic, Arabic, Chinese, etc.
Their symbols are replaced by question marks, but 99% of people in your company can't read it anyway, right?

Is that acceptable? Can you live with that?

If answer is "Yes" - you are lucky: can use Non-Unicode and save a lot of disk space!


Friday, August 28, 2015

Two and a half ways to do "Safe" delete.

Sometimes you have to manually delete some data in your database and sometimes things go wrong.

Once I had to delete the latest period from production database, but simply did not highlighted WHERE clause and executed statement for deletion of ALL data within the table.
Fortunately for me, table was huge and after 5 minutes of waiting I realized the problem and cancelled the transaction. Since then I'm using ONLY "Safe" deletes.

I put word "safe" in quotes because there is no safe way to delete. If you crazy enough you can skip all safety rules and still make a damage, but if you know what you are doing couple of my hints can save your day and maybe your resume from update.

The simplest way to make your deletion safe is to make sure you are deleting what you want before the action. That means you just do a select with the same WHERE clause parameters as you plan do it in the delete statement.

For example:

USE [AdventureWorks2014];
GO
SET NOCOUNT OFF
GO
 
SELECT * -- DELETE
FROM dbo.DatabaseLog WHERE DatabaseLogID = 1;

At first, you highlight the select statement and only if it returns what you want to delete you highlight it starting from the word DELETE. That you will be 100% sure that you are using the exactly same parameters in the WHERE clause.

For medium data sets you might use record count to ensure correct deletion. For that reason I specified "SET NOCOUNT OFF" in case it turned ON.

The second "Safe" way to delete is using TRANSACTIONS. In this case you verify your deletion after the fact and COMMIT your transaction if everything is OK and ROLLBACK it if not.

USE [AdventureWorks2014];
GO
SET NOCOUNT OFF
GO
 
BEGIN TRAN

DELETE FROM
dbo.DatabaseLog
OUTPUT DELETED.*
WHERE DatabaseLogID = 1

-- COMMIT
-- ROLLBACK

OUTPUT clause in that command returns you the list of deleted records. Then you make the decision if you like it or not and then execute the following COMMIT or ROLLBACK.

That method is little safer than the first, but still not perfect.
OUTPUT clause has some limitations. It can't run if:
- Table has enabled triggers;
- Table is involved in Foreign Key relationships on either side;
- Table has check constraints or enabled rules.

In these cases you would have to combine first and second method together like this:


USE [AdventureWorks2014];
GO
SET NOCOUNT OFF
GO
 
BEGIN TRAN

SELECT *
FROM dbo.DatabaseLog  
WHERE DatabaseLogID = 1;

-- DELETE FROM dbo.DatabaseLog WHERE DatabaseLogID = 1

-- COMMIT
-- ROLLBACK

There were the demonstrations of two safety tactics. What is "a half"?

The "half" was embedded into these queries and maybe you already noticed that.
Just as a general rule: Put double dashes in front of all irreversible actions  such as DELETE, UPDATE, COMMIT.

That will prevent any unexpected unfortunate event ещ happen in case you run the script in the window without selection.


Decimal vs Money. What consumes less space?

I love to use SMALLMONEY data type. it holds decimal amounts up to about 200K and have 4 digits in the scale and for all these features it consumes only four bytes, same as a regular integer.

But "WHAT IF?"....

What if your system will hit a value over the magic number?
What if you specified it for a "Price" columns and inflation in your country has driven prices to the roof?

Would you use MONEY data type instead?

Yes, you can, but in this case your column will consume four additional bytes. If your table is small it it nothing to worry about, but if it is huge.

Here are the data type limits for MONEY data types:


Let's try to use DECIMAL data type instead.
Here are the data type limits for DECIMAL/NUMERIC data types:

As you can see, SMALLMONEY data type in translation to DECIMAL is already goes to Precision of 10 and accordingly to the second line will consume 9 bytes instead of 4. That is even more than MONEY.

Is it the end?

Actually not.

You will be surprised, that DECIMAL value of "429496.7295", which can not fit into SMALLMONEY still uses 5 bytes!

Now, would say your currency has only 2 digits after decimal point and you do not need four digits, which SMALLMONEY provides.
Then you can go beyond and your "5 bytes Limit" provides ability to store a number up to 42,949,672.95!

42 Millions!!!!

Would say you need a very good precision for financial operations and still need four digits.

You have a contradiction: You need to keep size as small as possible, but it might cause an issue.
In that case you can use data type from the second decimal line "DECIMAL(19,4)"!

Even though MSDN says it consumes 9 bytes it still consumes only 5 until it reaches magic number of  "429,496.7295".

Here is the proof:
DECLARE @d DECIMAL(19,4) = 429496.7295
SELECT @d, DATALENGTH(@d);





So, if you plan to have mostly numbers below of SMALLMONEY limit, but sometimes little bit higher than it might worth to use "DECIMAL(19,4)" instead. Yes, you will consume 25% more space, but you will be ensured from any surprises and that is much smaller insurance than 100% of MONEY data type


Wednesday, August 26, 2015

Caveats of "CASE" statement testing for "NULL" values.

"CASE" statement syntax seems pretty easy.

There is only two different cases of usage:



Here are couple of samples
;WITH SampleData AS (
    SELECT 'AZ' AS TestColumn1, 1 AS TestColumn2
    UNION SELECT 'BL', 2 UNION SELECT 'CK', 3
    )
SELECT TestColumn1,
    CASE
        WHEN LEFT(TestColumn1,1) = 'A' THEN TestColumn2*10
        WHEN LEFT(TestColumn1,1) = 'B' THEN TestColumn2*100
        WHEN LEFT(TestColumn1,1) = 'C' THEN TestColumn2*1000
   END AS TestMultiplication
FROM SampleData;

;WITH SampleData AS (
    SELECT 'AZ' AS TestColumn1, 1 AS TestColumn2
    UNION SELECT 'BL', 2 UNION SELECT 'CK', 3
    )
SELECT TestColumn1,
    CASE LEFT(TestColumn1,1)
        WHEN 'A' THEN TestColumn2*10
        WHEN 'B' THEN TestColumn2*100
        WHEN 'C' THEN TestColumn2*1000
   END AS TestMultiplication
FROM SampleData;


These two scripts produce the same result, but the second one is much smaler.
Moreover, if instead of "LEFT" function you are using user defined function (which BTW is extremely bad practice for large data sets) then that function will be executed individually for every "WHEN" clause.

Is it the reason to use shorter version?

Usually it is not. In the second version you loose ability to search by ranges and able to search for only one parameter/field.

And that is not everything.

In certain situations it even can produce wrong results:

;WITH SampleData AS (SELECT NULL AS TestColumn UNION SELECT 1)
SELECT TestColumn
    , CASE
            WHEN TestColumn IS NULL THEN 'It is Null'
        ELSE 'It is Not Null'END AS Regular_Case
    , CASE TestColumn
            WHEN NULL THEN 'It is Null'
        ELSE 'It is Not Null'END AS Search_Case
FROM SampleData;


As you can see, the "Search" did not recognized "NULL" value

Why that was happen?

When SQL Server compares any value to a NULL it returns NULL, which immediately go to the "ELSE" portion of the "CASE" statement.

By knowing that behavior you will avoid these kind of collisions in your CASE statements.

Tuesday, August 25, 2015

Problem of "IsNumeric" Function

IsNumeric is wonderful function, it allows you easily separate numbers and text within a column

Look for the following example:

;WITH SampleData AS (SELECT '1' AS TestColumn UNION SELECT 'It is not a number')
SELECT CASE ISNUMERIC(TestColumn) WHEN 1 THEN CAST(TestColumn AS INT) END NumericColumn
    , CASE ISNUMERIC(TestColumn) WHEN 0 THEN TestColumn END NonNumericColumn
    , ISNUMERIC(TestColumn) AS ISNUMERIC_FUNCTION
FROM SampleData;


However, that not always works. The following script will produce an error.

;WITH SampleData AS (SELECT '1' AS TestColumn UNION SELECT '.' UNION SELECT '-' UNION SELECT '    ')
SELECT CASE ISNUMERIC(TestColumn) WHEN 1 THEN CAST(TestColumn AS FLOAT) END NumericColumn
    , CASE ISNUMERIC(TestColumn) WHEN 0 THEN TestColumn END NonNumericColumn
    , ISNUMERIC(TestColumn) AS ISNUMERIC_FUNCTION
FROM SampleData;

Why?
Just replace CASE Statements and see what happens

;WITH SampleData AS (SELECT '1' AS TestColumn UNION SELECT '.' UNION SELECT '-' UNION SELECT '    ')
SELECT TestColumn, ISNUMERIC(TestColumn) AS ISNUMERIC_FUNCTION
FROM SampleData;


What the hell? Why ISNUMERIC returns "1" for not really numeric strings?

Let's do a research on what SQL Server claims as "numbers":
DECLARE @i INT = 256
WHILE @i > 0
BEGIN
  SET @i -= 1;
    IF @i NOT BETWEEN 48 AND 57 AND    ISNUMERIC(CHAR(@i)) = 1
        PRINT CAST(@i AS VARCHAR) + ' = "' + CHAR(@i) + '"';
END

Now that makes sense.
At first SQL considered as numbers all values, which might represent currency: "¥","¤","£","¢","$","€"

It also considered as numbers characters, which can be used in numbers: ",",".","-","+"

Still it is not very clear why "\", "TAB", "Carriage return", etc. counted as numbers.
These symbols do not impact numbers only if they are used as the very first symbol

The best way to overcome IsNumeric function problem is to use TRY_CONVERT function:

;WITH SampleData AS (SELECT '1' AS TestColumn UNION SELECT '.' UNION SELECT '2A' UNION SELECT '-5' UNION SELECT '0.3')
SELECT TestColumn, ISNUMERIC(TestColumn) AS ISNUMERIC_FUNCTION
    , CASE WHEN TRY_CONVERT(FLOAT, TestColumn) IS NULL THEN 'It Is Not a Number' ELSE 'It Is Number' END AS Is_It_Number
    , TRY_CONVERT(FLOAT, TestColumn)
    , ISNULL(TRY_CONVERT(FLOAT, TestColumn),0)
FROM SampleData;

You can make determination if Cell value numeric or not,
You can immediately convert value to a number with having "Null" for not-numeric values
Or you can replace non-numeric values by "0" or any other number using "ISNULL" function.

Wonderful?
Sure!

But not perfect. TRY_CONVERT function is available only starting SQL Server version 2012.
For old SQL Servers developers still have to write special case statements

Also, do not forget, even if you are running latest SQL Server edition that functionality won't work if your current database is still in old compatibility level
Your compatibility level has to be bigger than 100:

SELECT compatibility_level
FROM sys.databases
WHERE DATABASE_id = DB_ID();

Monday, August 24, 2015

SSIS conversion from Unicode to Non-Unicode characters data set.

During extraction of data from MySQL and moving data to SQL Server database I hit a problem of having extra Unicode characters within VARCHAR fields.

Usually, project managers and lazy team leads make simple decision of using Unicode (NVARCHAR) for SQL Server fields.
As a result, Accounts, Social Security Numbers, and all other 100% non-unicode character fields take double space on disk and in memory. Their arguments are simple: It is easier/faster/cheaper to have all unicodes, than deal with unicode conversion problems.

As a SQL Server consultant, who makes money on performance troubleshooting, I appreciate their intention to make their databases and SQL Server slower. That might be my place of work in the future to make their system more reliable and faster.

However, as Database architect I have to make sure that all columns have appropriate data type and that data type uses minimum storage. For that purpose I have to do conversion of unicode strings to non-unicode.

In this blog post for simplicity I will use SQL server as a data source, but it can be MySQL, Oracle, Text or Excel file or anything else.

Disclaimer:
In my assumption, if you are reading this blog you have pretty good SSIS skills and I do not have to explain how to create a package, data flow task, connection and add script component to your task.

Preparation.

At first, in TestDB database (NON-PRODUCTION) we will create source table "tbl_Test_Unicode", target table "tbl_Test_NonUnicode" and insert some bad data into the source:
USE TestDB
GO
CREATE TABLE tbl_Test_Unicode(UColumn NVARCHAR(10));
GO
CREATE TABLE tbl_Test_NonUnicode(NUColumn VARCHAR(10));
GO
INSERT INTO tbl_Test_Unicode(UColumn) SELECT N'ABC' + CAST(0xC296 AS NVARCHAR) + N'DEF';
GO
SELECT * FROM tbl_Test_Unicode
GO

As a result of that script we will see following:


Then will create Simple SSIS package with only one data flow task:



Problem 1.

Will create OLE DB Source and OLE DB Destination for tables "tbl_Test_Unicode" and "tbl_Test_NonUnicode":

As you can see we immediately have a problem:
"Validation error. Data Flow Task 1: Data Flow Task 1: Columns "UColumn" and "NUColumn" cannot convert between unicode and non-unicode string data types."


That was expected and we will easily solve that problem:

Problem 2.

We solve problem #1 in three easy steps:
(That method is described in more details by Greg Robidoux in MSSQLTips)
First: Add "Data Conversion" task between Source and Destination:
Second: Inside "Data Conversion" task we check "UColumn" column for data Input.
 Then will change data type to "string [DT_STR]" and correct column's length to 10 characters.
 Leave "Output Alias" as "Copy of UColumn" by default and press OK.

 As a third step we open OLE DB destination and redirect new column:

 As a result we have data transformation task without any validation errors:

However, when we try to execute it we immediately have an error during conversion:
How come? Why SSIS can't convert unicode string into non-unicode?

Look at the error:

"[Data Conversion [39]] Error: Data conversion failed while converting column "UColumn" (34) to column "Copy of UColumn" (47).  The conversion returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page."."

SSIS simply can't translate unicode character to the default dataset. That is understandable, but how to get rid of that annoying extra symbol? Why just simply not ignore it?

Problem 3.

Will try to solve problem #1 differently, by adding "Derived Column" task instead of "Data Conversion":
 Will use formula "(DT_STR, 10, 1252) [UColumn]" and give column name for a new column how it was before "Copy of UColumn".
After little adjustments within OLE DB Destination our task has to look like this:

After we try to execute the package we got an error on the conversion step:


Error is little bit different. This time it does not say anything about mismatch, only about truncation:
"[Derived Column [52]] Error: The "component "Derived Column" (52)" failed because truncation occurred, and the truncation row disposition on "output column "Copy of UColumn" (62)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component."

Why it is saying about truncation???
I've tried following code: "(DT_STR,10,1252)SUBSTRING(UColumn,1,4)", which also failed, while "(DT_STR,10,1252)SUBSTRING(UColumn,1,3)" worked perfectly.

That means during "Derived Column" conversion, SSIS can't even correctly recognize the problem.

Solution.

At this point somebody would offer to use a query to convert the data on the source.
That totally works with SQL Server, MySQL and Oracle, but if you have Text or Excel file that would be a problem.
Also, somebody can give up and change column's data type in the source to unicode, or just make a trick and use unicode-based staging table for two step conversion.

However, I wouldn't write this blog if I hadn't have a solution.
In my case to solve that problem I've used "Script Component" - one of the the most scariest components in SSIS.

When you place "Script Component" into "Data Flow" task it immediately asked about type of component you want. In our case we need third option: transformation:

Then put new component between source and destination like this:
Click on the "Script Component" and in "Input Columns" section select input column "UColumn"

Then in "Inputs and Outputs" select "Output Columns" and click "Add Column".  

After new column is added rename it to "NUColumn", change it's type to "string "DT_STR" and fix string length by making it "10"

After that you are ready for "Script" section specify "Visual Basic" and click Edit Script.

When Visual Studio window will open in a procedure called "Input0_ProcessInputRow" add a line of code: "Row.NUColumn = Row.UColumn.ToString()" like this:


Then close Visual Studio and click OK in "Script Component".

 After that "OLE DB Destination" will need some adjustments, but when it is fixed your package should work perfectly:


Package finished without any error, but destination table still has some problems:

The unicode character has been replaced by question mark.
You might be OK with that or you can add one more step to replace it by empty space if you are 100% sure that you do not have question marks in your text.

Anyway, that is not easy, but possible way to get rid of unicode characters in SSIS data source.