Monday, August 31, 2015

Why do you need a Unicode?

Databases do not store only transnational data, sometimes that happens that the very big chunk of the data is a text.
You might try to normalize it, but it is not always possible and you dump text into VARCHAR columns.

Sometimes it happen that majority of the data in a huge table are just those text data.
Just imagine, every time you are doing table scan request your server reads all these data from disk into memory to process it.
It is not only the disk issue, these heavy read operations might push out of memory other database components, which may cause even more future reads.

On the top of that problem your database architect decided to use Unicode for all CHARs and VARCHARs...

What are database architect's reasons for using Unicode?
- Customers say they will always put it in English now, but they have worldwide client. What if you want to store a customer name which uses non-English characters? Or the name of a place in another country?
- Unicode supports Scientific, Technical and Mathematical character sets;

- To have ALL text columns in Unicode is much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on;
- It's easier to support Unicode from the beginning, than try to retrofit it later;
- I do not want to put time bomb in the application. It is sound always better to be safe than sorry.

All these arguments are from the category "WHAT IF".
There are sometimes reasonable "WHAT IFs", then that has to be addressed in business requirements, but if business is not require something explicitly, that should be avoided.

I know a lot of people will disagree with me and claim that I'm not experienced enough in international database development.
Just want to answer them that I actually do a lot of international database development experience.
I've started my IT career as non-English speaker in not localized DOS 3.0 environment. I went through a character coding nightmare, which native English speakers can't even imagine and I know what I'm talking about.

Let's start.

You've asked to use Unicode because the company might have clients with foreign names, addresses and product names.

You have to ask the question: Will you have in you have in your database any international charsets from the list below?:
- Indian: "কখগঘঙচছজঝঞট";
- Arabic: "ݐݑݒݓݔݕݖݗݘݙ";
- Georgian: "ᦍᦎᦏᦐᦑᦒᦓᦔᦕᦖᦗᦘᦙ";
- Korean: "ᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟ"
- Japanese: "༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯";
- Chinese: "樂洛烙珞落酪駱亂卵欄爛蘭";
- Braille: "⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙"
- Mathematical: "∭∛∜∮∯∰∱∲≉≊≋≌≍";

If his/her answer will be: "- No, we must not allow any of those to be allowed as Names, Addresses or Products. Otherwise nobody will be able to read it!"

Hold on, but you can have a business in Canada, Mexico, Germany or any other country, which uses Latin alphabet with extra non-English characters. Do you still need Unicode?

Here is a demo:
    [Language_ID] TINYINT IDENTITY (1,1),
    [Language] VARCHAR(20) NOT NULL,
    [Unicode] NVARCHAR(100) NOT NULL,
    [NonUnicode] VARCHAR((100) NULL
tbl_Test_Unicode([Language], [Unicode])
VALUES ('Spanish',N'ÁÉÍÑÓÚÜ¿¡áéíñóúü'),
UPDATE tbl_Test_Unicode
SET [NonUnicode] = [Unicode];
    CASE [NonUnicode] WHEN [Unicode] THEN 'Y' ELSE 'N' END "Are they equal?"
FROM tbl_Test_Unicode;
DROP TABLE tbl_Test_Unicode;

Here is a result:

The secret is that default SQL Server Collation and code page already contain all necessary characters. You can use following script to prove it:
DECLARE @b TINYINT = 0, @String VARCHAR(300) = '';
WHILE @b < 255 SELECT @b += 1, @String += CASE @b % 32 WHEN 0 THEN CHAR(10) ELSE '' END + CHAR(@b);
PRINT @String;

As you can see Spanish, French, German, Norwegian, Italian are NOT REQUIRE Unicode!
Discrimination starts with Eastern Europe languages, but they are translated pretty well and still readable.
Problem only starts with Cyrillic, Arabic, Chinese, etc.
Their symbols are replaced by question marks, but 99% of people in your company can't read it anyway, right?

Is that acceptable? Can you live with that?

If answer is "Yes" - you are lucky: can use Non-Unicode and save a lot of disk space!

Friday, August 28, 2015

Two and a half ways to do "Safe" delete.

Sometimes you have to manually delete some data in your database and sometimes things go wrong.

Once I had to delete the latest period from production database, but simply did not highlighted WHERE clause and executed statement for deletion of ALL data within the table.
Fortunately for me, table was huge and after 5 minutes of waiting I realized the problem and cancelled the transaction. Since then I'm using ONLY "Safe" deletes.

I put word "safe" in quotes because there is no safe way to delete. If you crazy enough you can skip all safety rules and still make a damage, but if you know what you are doing couple of my hints can save your day and maybe your resume from update.

The simplest way to make your deletion safe is to make sure you are deleting what you want before the action. That means you just do a select with the same WHERE clause parameters as you plan do it in the delete statement.

For example:

USE [AdventureWorks2014];
FROM dbo.DatabaseLog WHERE DatabaseLogID = 1;

At first, you highlight the select statement and only if it returns what you want to delete you highlight it starting from the word DELETE. That you will be 100% sure that you are using the exactly same parameters in the WHERE clause.

For medium data sets you might use record count to ensure correct deletion. For that reason I specified "SET NOCOUNT OFF" in case it turned ON.

The second "Safe" way to delete is using TRANSACTIONS. In this case you verify your deletion after the fact and COMMIT your transaction if everything is OK and ROLLBACK it if not.

USE [AdventureWorks2014];

WHERE DatabaseLogID = 1


OUTPUT clause in that command returns you the list of deleted records. Then you make the decision if you like it or not and then execute the following COMMIT or ROLLBACK.

That method is little safer than the first, but still not perfect.
OUTPUT clause has some limitations. It can't run if:
- Table has enabled triggers;
- Table is involved in Foreign Key relationships on either side;
- Table has check constraints or enabled rules.

In these cases you would have to combine first and second method together like this:

USE [AdventureWorks2014];

FROM dbo.DatabaseLog  
WHERE DatabaseLogID = 1;

-- DELETE FROM dbo.DatabaseLog WHERE DatabaseLogID = 1


There were the demonstrations of two safety tactics. What is "a half"?

The "half" was embedded into these queries and maybe you already noticed that.
Just as a general rule: Put double dashes in front of all irreversible actions  such as DELETE, UPDATE, COMMIT.

That will prevent any unexpected unfortunate event ещ happen in case you run the script in the window without selection.

Decimal vs Money. What consumes less space?

I love to use SMALLMONEY data type. it holds decimal amounts up to about 200K and have 4 digits in the scale and for all these features it consumes only four bytes, same as a regular integer.

But "WHAT IF?"....

What if your system will hit a value over the magic number?
What if you specified it for a "Price" columns and inflation in your country has driven prices to the roof?

Would you use MONEY data type instead?

Yes, you can, but in this case your column will consume four additional bytes. If your table is small it it nothing to worry about, but if it is huge.

Here are the data type limits for MONEY data types:

Let's try to use DECIMAL data type instead.
Here are the data type limits for DECIMAL/NUMERIC data types:

As you can see, SMALLMONEY data type in translation to DECIMAL is already goes to Precision of 10 and accordingly to the second line will consume 9 bytes instead of 4. That is even more than MONEY.

Is it the end?

Actually not.

You will be surprised, that DECIMAL value of "429496.7295", which can not fit into SMALLMONEY still uses 5 bytes!

Now, would say your currency has only 2 digits after decimal point and you do not need four digits, which SMALLMONEY provides.
Then you can go beyond and your "5 bytes Limit" provides ability to store a number up to 42,949,672.95!

42 Millions!!!!

Would say you need a very good precision for financial operations and still need four digits.

You have a contradiction: You need to keep size as small as possible, but it might cause an issue.
In that case you can use data type from the second decimal line "DECIMAL(19,4)"!

Even though MSDN says it consumes 9 bytes it still consumes only 5 until it reaches magic number of  "429,496.7295".

Here is the proof:
DECLARE @d DECIMAL(19,4) = 429496.7295

So, if you plan to have mostly numbers below of SMALLMONEY limit, but sometimes little bit higher than it might worth to use "DECIMAL(19,4)" instead. Yes, you will consume 25% more space, but you will be ensured from any surprises and that is much smaller insurance than 100% of MONEY data type

Wednesday, August 26, 2015

Caveats of "CASE" statement testing for "NULL" values.

"CASE" statement syntax seems pretty easy.

There is only two different cases of usage:

Here are couple of samples
;WITH SampleData AS (
    SELECT 'AZ' AS TestColumn1, 1 AS TestColumn2
SELECT TestColumn1,
        WHEN LEFT(TestColumn1,1) = 'A' THEN TestColumn2*10
        WHEN LEFT(TestColumn1,1) = 'B' THEN TestColumn2*100
        WHEN LEFT(TestColumn1,1) = 'C' THEN TestColumn2*1000
   END AS TestMultiplication
FROM SampleData;

;WITH SampleData AS (
    SELECT 'AZ' AS TestColumn1, 1 AS TestColumn2
SELECT TestColumn1,
    CASE LEFT(TestColumn1,1)
        WHEN 'A' THEN TestColumn2*10
        WHEN 'B' THEN TestColumn2*100
        WHEN 'C' THEN TestColumn2*1000
   END AS TestMultiplication
FROM SampleData;

These two scripts produce the same result, but the second one is much smaler.
Moreover, if instead of "LEFT" function you are using user defined function (which BTW is extremely bad practice for large data sets) then that function will be executed individually for every "WHEN" clause.

Is it the reason to use shorter version?

Usually it is not. In the second version you loose ability to search by ranges and able to search for only one parameter/field.

And that is not everything.

In certain situations it even can produce wrong results:

SELECT TestColumn
    , CASE
            WHEN TestColumn IS NULL THEN 'It is Null'
        ELSE 'It is Not Null'END AS Regular_Case
    , CASE TestColumn
            WHEN NULL THEN 'It is Null'
        ELSE 'It is Not Null'END AS Search_Case
FROM SampleData;

As you can see, the "Search" did not recognized "NULL" value

Why that was happen?

When SQL Server compares any value to a NULL it returns NULL, which immediately go to the "ELSE" portion of the "CASE" statement.

By knowing that behavior you will avoid these kind of collisions in your CASE statements.

Tuesday, August 25, 2015

Problem of "IsNumeric" Function

IsNumeric is wonderful function, it allows you easily separate numbers and text within a column

Look for the following example:

;WITH SampleData AS (SELECT '1' AS TestColumn UNION SELECT 'It is not a number')
    , CASE ISNUMERIC(TestColumn) WHEN 0 THEN TestColumn END NonNumericColumn
FROM SampleData;

However, that not always works. The following script will produce an error.

    , CASE ISNUMERIC(TestColumn) WHEN 0 THEN TestColumn END NonNumericColumn
FROM SampleData;

Just replace CASE Statements and see what happens

FROM SampleData;

What the hell? Why ISNUMERIC returns "1" for not really numeric strings?

Let's do a research on what SQL Server claims as "numbers":
DECLARE @i INT = 256
WHILE @i > 0
  SET @i -= 1;
    IF @i NOT BETWEEN 48 AND 57 AND    ISNUMERIC(CHAR(@i)) = 1
        PRINT CAST(@i AS VARCHAR) + ' = "' + CHAR(@i) + '"';

Now that makes sense.
At first SQL considered as numbers all values, which might represent currency: "¥","¤","£","¢","$","€"

It also considered as numbers characters, which can be used in numbers: ",",".","-","+"

Still it is not very clear why "\", "TAB", "Carriage return", etc. counted as numbers.
These symbols do not impact numbers only if they are used as the very first symbol

The best way to overcome IsNumeric function problem is to use TRY_CONVERT function:

    , CASE WHEN TRY_CONVERT(FLOAT, TestColumn) IS NULL THEN 'It Is Not a Number' ELSE 'It Is Number' END AS Is_It_Number
    , TRY_CONVERT(FLOAT, TestColumn)
    , ISNULL(TRY_CONVERT(FLOAT, TestColumn),0)
FROM SampleData;

You can make determination if Cell value numeric or not,
You can immediately convert value to a number with having "Null" for not-numeric values
Or you can replace non-numeric values by "0" or any other number using "ISNULL" function.


But not perfect. TRY_CONVERT function is available only starting SQL Server version 2012.
For old SQL Servers developers still have to write special case statements

Also, do not forget, even if you are running latest SQL Server edition that functionality won't work if your current database is still in old compatibility level
Your compatibility level has to be bigger than 100:

SELECT compatibility_level
FROM sys.databases

Monday, August 24, 2015

SSIS conversion from Unicode to Non-Unicode characters data set.

During extraction of data from MySQL and moving data to SQL Server database I hit a problem of having extra Unicode characters within VARCHAR fields.

Usually, project managers and lazy team leads make simple decision of using Unicode (NVARCHAR) for SQL Server fields.
As a result, Accounts, Social Security Numbers, and all other 100% non-unicode character fields take double space on disk and in memory. Their arguments are simple: It is easier/faster/cheaper to have all unicodes, than deal with unicode conversion problems.

As a SQL Server consultant, who makes money on performance troubleshooting, I appreciate their intention to make their databases and SQL Server slower. That might be my place of work in the future to make their system more reliable and faster.

However, as Database architect I have to make sure that all columns have appropriate data type and that data type uses minimum storage. For that purpose I have to do conversion of unicode strings to non-unicode.

In this blog post for simplicity I will use SQL server as a data source, but it can be MySQL, Oracle, Text or Excel file or anything else.

In my assumption, if you are reading this blog you have pretty good SSIS skills and I do not have to explain how to create a package, data flow task, connection and add script component to your task.


At first, in TestDB database (NON-PRODUCTION) we will create source table "tbl_Test_Unicode", target table "tbl_Test_NonUnicode" and insert some bad data into the source:
CREATE TABLE tbl_Test_Unicode(UColumn NVARCHAR(10));
CREATE TABLE tbl_Test_NonUnicode(NUColumn VARCHAR(10));
INSERT INTO tbl_Test_Unicode(UColumn) SELECT N'ABC' + CAST(0xC296 AS NVARCHAR) + N'DEF';
SELECT * FROM tbl_Test_Unicode

As a result of that script we will see following:

Then will create Simple SSIS package with only one data flow task:

Problem 1.

Will create OLE DB Source and OLE DB Destination for tables "tbl_Test_Unicode" and "tbl_Test_NonUnicode":

As you can see we immediately have a problem:
"Validation error. Data Flow Task 1: Data Flow Task 1: Columns "UColumn" and "NUColumn" cannot convert between unicode and non-unicode string data types."

That was expected and we will easily solve that problem:

Problem 2.

We solve problem #1 in three easy steps:
(That method is described in more details by Greg Robidoux in MSSQLTips)
First: Add "Data Conversion" task between Source and Destination:
Second: Inside "Data Conversion" task we check "UColumn" column for data Input.
 Then will change data type to "string [DT_STR]" and correct column's length to 10 characters.
 Leave "Output Alias" as "Copy of UColumn" by default and press OK.

 As a third step we open OLE DB destination and redirect new column:

 As a result we have data transformation task without any validation errors:

However, when we try to execute it we immediately have an error during conversion:
How come? Why SSIS can't convert unicode string into non-unicode?

Look at the error:

"[Data Conversion [39]] Error: Data conversion failed while converting column "UColumn" (34) to column "Copy of UColumn" (47).  The conversion returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page."."

SSIS simply can't translate unicode character to the default dataset. That is understandable, but how to get rid of that annoying extra symbol? Why just simply not ignore it?

Problem 3.

Will try to solve problem #1 differently, by adding "Derived Column" task instead of "Data Conversion":
 Will use formula "(DT_STR, 10, 1252) [UColumn]" and give column name for a new column how it was before "Copy of UColumn".
After little adjustments within OLE DB Destination our task has to look like this:

After we try to execute the package we got an error on the conversion step:

Error is little bit different. This time it does not say anything about mismatch, only about truncation:
"[Derived Column [52]] Error: The "component "Derived Column" (52)" failed because truncation occurred, and the truncation row disposition on "output column "Copy of UColumn" (62)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component."

Why it is saying about truncation???
I've tried following code: "(DT_STR,10,1252)SUBSTRING(UColumn,1,4)", which also failed, while "(DT_STR,10,1252)SUBSTRING(UColumn,1,3)" worked perfectly.

That means during "Derived Column" conversion, SSIS can't even correctly recognize the problem.


At this point somebody would offer to use a query to convert the data on the source.
That totally works with SQL Server, MySQL and Oracle, but if you have Text or Excel file that would be a problem.
Also, somebody can give up and change column's data type in the source to unicode, or just make a trick and use unicode-based staging table for two step conversion.

However, I wouldn't write this blog if I hadn't have a solution.
In my case to solve that problem I've used "Script Component" - one of the the most scariest components in SSIS.

When you place "Script Component" into "Data Flow" task it immediately asked about type of component you want. In our case we need third option: transformation:

Then put new component between source and destination like this:
Click on the "Script Component" and in "Input Columns" section select input column "UColumn"

Then in "Inputs and Outputs" select "Output Columns" and click "Add Column".  

After new column is added rename it to "NUColumn", change it's type to "string "DT_STR" and fix string length by making it "10"

After that you are ready for "Script" section specify "Visual Basic" and click Edit Script.

When Visual Studio window will open in a procedure called "Input0_ProcessInputRow" add a line of code: "Row.NUColumn = Row.UColumn.ToString()" like this:

Then close Visual Studio and click OK in "Script Component".

 After that "OLE DB Destination" will need some adjustments, but when it is fixed your package should work perfectly:

Package finished without any error, but destination table still has some problems:

The unicode character has been replaced by question mark.
You might be OK with that or you can add one more step to replace it by empty space if you are 100% sure that you do not have question marks in your text.

Anyway, that is not easy, but possible way to get rid of unicode characters in SSIS data source.

Wednesday, August 19, 2015

"Altering Computed Column" Bug or Feature?

At some point I wanted to change a formula within calculated column.

At first, I've tried to use ALTER TABLE ... ALTER COLUMN to do so, but got a syntax error.

Then I've tried to look in MSDN and there is mentioning of "<computed_column_definition>" only in "ADD" column section and nothing in the "ALTER COLUMN"

I was tied in time and simply dropped the column and then created a new one instead.

What the problem?

Now I think that is not a SQL Server bug but a feature, which protects persisted data in computed caolumn.

Would say SQL allow me to alter the defining formula for a computed column. It might easily happen that some of calculated values would be created by one rule and some by another, which is completely not acceptable.

However, that is true only for Persisted calculated column, which has it's values physically stored within a table and it has nothing with not-Persisted one, which calculated on the fly.

In my assumption Microsoft developers just simplified their job by not segregating Persisted and not-Persisted calculated columns and not allowing to alter any computed column at all.

Monday, August 17, 2015

Decoding COLUMNS_UPDATED Function's results

From MSDN:
"Returns a varbinary bit pattern that indicates the columns in a table or view that were inserted or updated. COLUMNS_UPDATED is used anywhere inside the body of a Transact-SQL INSERT or UPDATE trigger to test whether the trigger should execute certain actions. "

Everything looks easy and cool until you have very wide table and you want to know exactly which column was changed.

COLUMNS_UPDATED() will return strange binary numbers, which represent updated column IDs.

Here I'll present a script for a trigger to get a list of all updated columns:

CREATE TRIGGER <test_trigger> ON <test_Table>


/* Here you capture Binary coded list of changed columns */

WHILE @k < 128
  WHILE @b > 0
    IF CAST(SUBSTRING(@Changed_Columns,@k,1) as SMALLINT) & @m = @m
      INSERT INTO @t(Column_ID) VALUES (@i);
    SELECT @i += 1, @b -= 1, @m *= 2;
  SELECT @b = 8, @m = 1, @k += 1;

/* Here you extract list of fields from the schema */
SELECT as Modified_Column
FROM sys.triggers as r
INNER JOIN sys.columns as c
  ON c.object_id = r.parent_id
INNER JOIN @t as t ON t.column_id = c.column_id
WHERE r.object_id = @@PROCID;

Note: Because SQL Server supports up to 1024 columns in single table my assumption is that output of COLUMNS_UPDATED() can not be larger than 128 bytes.

Sunday, August 16, 2015

SSMS Ctrl+F1 and Ctrl+R keyboard shortcuts do not work

Had that issue with SQL Server Management Studio several times and now I hit it with SQL server 2016.

Wanted to make a post because I've found an active bug report "SSMS Ctrl+F1 keyboard shortcut doesn't work in 2014"

It seems to me Microsoft does not want to put a lot of effort in SSMS improvements.
There is a long list of problems associated with it and MS put them at lowest priorities.
Here is a short list I've personally had to deal with:
- Inability to Export/Import user defined query shortcuts;
- Ctrl+R button stopped working after applying Service pack and this bug goes with all newer versions and some SSMS updates;
- Use of deprecated functionality;

Now I want to cover "Ctrl+F" problem.
1. Problem description:
- In SSMS go to "Tools" -> "Options" -> "Environment" -> "Keyboard" -> "Query Shortcuts".
There you can develop and store your own 32K long script and run it by pressing of combination of "Ctrl+Number" buttons.

2. After you set your shortcuts press "OK" and open new query tab. Your queries supposed to work except "Ctrl+F1". That is a reserved key for Help Viewer.

3. To fix it simply go this way: "Tools" -> "Options" -> "Environment" -> "Keyboard" -> "Keyboard"

4. Within "Press Shortcut keys:" box press "Ctrl+F1" and you'll get the default assigned item for that combination: "Help.ViewHelp (Ctrl+F1 (Global))".

5. Press "Assign" button. Then current assignment will disappear.

6. Then Press "Remove" button. That will remove assignment for "Ctrl+F1".

7. Then choose "Global" in "Use new shortcut in:" drop down box, within "Press Shortcut keys:" box press "Ctrl+F1", in "Show commands containing" put "Query.CustomSP1" and press "Assign" button.

8. After that press "OK" button and you should open new query tab to try your "Ctrl+F1" query to work.

9. The similar solution I've used for "Ctrl+R".
You have to find "Window.ShowResultsPane" option and assign it like this: