Thursday, December 5, 2019

Handling Forbidden XML characters in SQL Server

That is very known issue that SQL Server's XML does not accept characters "&", "<" and ">".
There are two more forbidden XML characters " ' " and " " " (single and double quotes), but SQL Server mostly accept them.

The common solution is to replace these characters by their codes.
Would say we have a silly sentence: "Anne & Robin collect > "berries" than Jane & Kevin, but < than Ivan & Lucy."

If we try to replace forbidden characters by their codes we get something like this:

It does not look like readable text.

So, here is the solution:

In case you do not care about special character coding and care ONLY about text visual representation you can replace forbidden symbols by their siblings from other Unicode pages:
"&" - "&" (65286)
"<" - "<" (65308)
">" - ">" (65310)
" ' " - " ʹ " (697)
" " " - " ʺ " (698)

Then we can do replacement before converting to XML like this:
DECLARE @MyText VARCHAR(1000) =
'Anne & Robin collect > "berries" than Jane & Kevin, but < than Ivan & Lucy.';
PRINT 'My Text: "' + @MyText + '";';
SET @MyText = REPLACE(REPLACE(REPLACE(REPLACE(@MyText
,'&','&#65286;'),'<','&#65308;'),'>','&#65310;'),'"','&#698;') ;
PRINT 'My Converted XML: "'
+ CAST(CAST('<MyXML>' + @MyText + '</MyXML>' as XML) as VARCHAR(MAX)) + '";';

SELECT CAST('<MyXML>' + @MyText + '</MyXML>' as XML);

The XML results will be like this:

If you try to open that XML in SSMS you'll see it clear:

And if you try to convert it back to VARCHAR you'll get following:

Note: when you convert XML back to VARCHAR, SQL Server will convert forbidden symbols back to the default code page, which might be very convenient.

Caveats:

You might try to replace problematic symbols directly, but in this case you would have to use NVARCHAR data type instead of VARCHAR to preserve Unicode symbols from being auto-converted back to the default code page:
DECLARE @MyText NVARCHAR(1000) =
'Anne & Robin collect > "berries" than Jane & Kevin, but < than Ivan & Lucy.';
SET @MyText =
       REPLACE(REPLACE(REPLACE(REPLACE(@MyText,'&',N''),'<',N''),'>',N''),'"',N'ʺ');
SELECT @MyText FOR XML PATH('MyXML');

In some of the cases you'd be forced to use direct Unicode characters to avoid placement of escape codes. I'd recommend use of NCHAR command to convert your symbols, then you won't loose "Unicode siblings" if you decide to store your SQL Script in a file:
DECLARE @MyText NVARCHAR(1000) =
'Anne & Robin collect > "berries" than Jane & Kevin, but < than Ivan & Lucy.';
SET @MyText = REPLACE(REPLACE(REPLACE(REPLACE(@MyText
,'&',NCHAR(65286)),'<',NCHAR(65308)),'>',NCHAR(65310)),'"',NCHAR(698)) ;
SELECT 1 as Tag, Null as Parent,
       @MyText as [MyXML!1!MyText]
FOR XML EXPLICIT;

Here are the results of that code.


Hope you can use that solution in your work.
Please let me know if you hit any other unexpected issues related to XML symbols, I'd be glad to include more solutions in my blog.

1 comment:

  1. This was very helpful to me to fix the error that my code ran into...thanks again

    ReplyDelete