Thursday, May 26, 2016

Building a Box-and-Whisker Plot in SSMS and Interpreting the results

Recent work with reporting statistical data gave me an idea to use SQL Server Management Studio to build a Box-and-Whisker Plot right in SQL.

If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots

At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.

We will analyze amounts of Individual lines of Sales Orders within each month.

The first step is to create a Data Set to process.  That Data Set will contain a Month, Single Line amount and order number of that record within a month.

Here is how I did it:

```USE AdventureWorks2014
GO
IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL
DROP TABLE #DataSet;
GO
SELECT AxisX = CAST(FLOOR(CAST(
, AxisY = d.LineTotal
, RowNum = Row_number() over (partition by
CAST(
AS DATETIME) order by d.LineTotal)
INTO #DataSet
FROM Sales.SalesOrderDetail as d
ON h.SalesOrderID = d.SalesOrderID
GO
CREATE UNIQUE CLUSTERED INDEX #CUIX_DataSet ON #DataSet([AxisX],[RowNum]);
GO```

Here is TOP 10 of what I have in my Data Set:

Here is the Main script to extract the data into a Box-and-Whisker Plot:
```/* Building a Box-and-Whisker Plot */
DECLARE @Period FLOAT, @Width FLOAT, @Line FLOAT, @g geometry;

SELECT @Period = (MAX(AxisY) - MIN(AxisY)) * 2. / COUNT(DISTINCT AxisX)
, @Width = @Period / 4, @Line = @Width / 40
FROM #DataSet

;WITH LineCounts as (
SELECT AxisX,
SUM(AxisY) as Total,
MIN(AxisY) as MinValue,
COUNT(*) as OrderLines,
COUNT(*) / 2. as OrderMedian,
COUNT(*) / 4. as OrderQuarter
FROM #DataSet GROUP BY AxisX
),
BoxPlot as (
SELECT lc.AxisX,
lc.Total,
COUNT(*) as Items,
MIN(ds.AxisY) as MinValue,
MAX(ds.AxisY) as MaxValue,
-- Finding Median Value for a Set
SUM(CASE
WHEN (lc.OrderLines % 2 = 0 and
(ds.RowNum = lc.OrderMedian or ds.RowNum = lc.OrderMedian + 1) ) OR
(lc.OrderLines % 2 = 1 and ds.RowNum = CEILING(lc.OrderMedian))
THEN ds.AxisY ELSE 0
END) / ( lc.OrderLines % 2 * (-1) + 2) as MedianValue,
-- Finding 1st Quartile Value for a Set
SUM(CASE
WHEN (lc.OrderLines % 4 = 0 and
(ds.RowNum = lc.OrderQuarter or ds.RowNum = lc.OrderQuarter + 1) ) OR
(lc.OrderLines % 4 > 0 and ds.RowNum = CEILING(lc.OrderQuarter))
THEN ds.AxisY ELSE 0
END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as FirstQValue,
-- Finding 3rd Quartile Value for for a Set
SUM(CASE
WHEN (lc.OrderLines % 4 = 0 and
(ds.RowNum = lc.OrderLines - lc.OrderQuarter or
ds.RowNum = lc.OrderLines - lc.OrderQuarter + 1) ) OR
(lc.OrderLines % 4 > 0 and ds.RowNum = lc.OrderLines - FLOOR(lc.OrderQuarter))
THEN ds.AxisY ELSE 0
END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as ThirdQValue,
BoxNum = Row_number() over (order by lc.AxisX)
FROM LineCounts as lc
INNER JOIN #DataSet as ds ON lc.AxisX = ds.AxisX
GROUP BY lc.AxisX, lc.Total, lc.OrderLines, lc.OrderMedian, lc.OrderQuarter
)
SELECT AxisX, Items, Total, MinValue, MaxValue, MedianValue, FirstQValue, ThirdQValue, BoxNum,
CAST('GEOMETRYCOLLECTION(MULTIPOLYGON(((
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ')),
((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(FirstQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(FirstQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MinValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(ThirdQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(ThirdQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MaxValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + '))),

LINESTRING(' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ')
)' as GEOMETRY)
FROM BoxPlot
ORDER BY AxisX;```

Here is the diagram I've got:

Starting Interpretation:
1. We can divide the diagram on three about equal sections by Inter-quartile range: Wide, Medium and Short. All three sections are skewed top.
2. In the first section with Wide Inter-quartile range we see three outliers where data skeweddown.
3. The middle section has pretty stable distribution of first three quartiles
4. In the third, Short Inter-quartile range section, we also have three outliers with very condense data set.
5. All data sets, excluding the very last one, have oulier values at the top.

What else I can say without knowing particular business details just by looking at the data?
Obviously, Box-and-Whisker Plot alone, does not give us anything else.
That means we have to add some other metrics to better understand the business situation.

In that script I include two additional lines: Total amount of sales lines and total number of sale lines:
```/* Building a Box-and-Whisker Plot */
DECLARE @Period FLOAT, @Width FLOAT, @Line FLOAT, @Point FLOAT, @MinPoint FLOAT, @Volume FLOAT, @g geometry;

SELECT @Period = (MAX(AxisY) - MIN(AxisY)) * 2. / COUNT(DISTINCT AxisX)
, @Volume = MAX(AxisY) - MIN(AxisY), @MinPoint = MIN(AxisY)
, @Width = @Period / 4, @Line = @Width / 40, @Point = @Period / 20
FROM #DataSet

;WITH LineCounts as (
SELECT AxisX,
SUM(AxisY) as Total,
SUM(AxisY) / @Volume as VolumeRate,
MIN(AxisY) as MinValue,
COUNT(*) as OrderLines,
COUNT(*) / @Volume as CountRate,
COUNT(*) / 2. as OrderMedian,
COUNT(*) / 4. as OrderQuarter
FROM #DataSet GROUP BY AxisX
),
BoxPlot as (
SELECT lc.AxisX,
lc.Total,
lc.Total / r.VolumeRate + r.MinValue as TotalVolume,
lc.OrderLines / r.CountRate + r.MinValue as Counts,
COUNT(*) as Items,
MIN(ds.AxisY) as MinValue,
MAX(ds.AxisY) as MaxValue,
-- Finding Median Value for a Set
SUM(CASE
WHEN (lc.OrderLines % 2 = 0 and
(ds.RowNum = lc.OrderMedian or ds.RowNum = lc.OrderMedian + 1) ) OR
(lc.OrderLines % 2 = 1 and ds.RowNum = CEILING(lc.OrderMedian))
THEN ds.AxisY ELSE 0
END) / ( lc.OrderLines % 2 * (-1) + 2) as MedianValue,
-- Finding 1st Quartile Value for a Set
SUM(CASE
WHEN (lc.OrderLines % 4 = 0 and
(ds.RowNum = lc.OrderQuarter or ds.RowNum = lc.OrderQuarter + 1) ) OR
(lc.OrderLines % 4 > 0 and ds.RowNum = CEILING(lc.OrderQuarter))
THEN ds.AxisY ELSE 0
END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as FirstQValue,
-- Finding 3rd Quartile Value for for a Set
SUM(CASE
WHEN (lc.OrderLines % 4 = 0 and (ds.RowNum = lc.OrderLines -
lc.OrderQuarter or ds.RowNum = lc.OrderLines - lc.OrderQuarter + 1) ) OR
(lc.OrderLines % 4 > 0 and ds.RowNum = lc.OrderLines - FLOOR(lc.OrderQuarter))
THEN ds.AxisY ELSE 0
END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as ThirdQValue,
BoxNum = Row_number() over (order by lc.AxisX)
FROM LineCounts as lc
INNER JOIN #DataSet as ds ON lc.AxisX = ds.AxisX
INNER JOIN (SELECT MAX(VolumeRate) as VolumeRate, MAX(CountRate) as CountRate, MIN(MinValue) as MinValue FROM LineCounts) as r ON 1 = 1
GROUP BY lc.AxisX, lc.Total, lc.OrderLines, lc.OrderMedian, lc.OrderQuarter, r.VolumeRate, r.MinValue, r.CountRate
),
Volumes as (
SELECT Volumes = CAST((SELECT 'POINT(' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(TotalVolume as VARCHAR) + '),'
FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX))
),
VolumeLine as (
SELECT VolumeLine = CAST((SELECT '' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(TotalVolume as VARCHAR) + ','
FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX))
),
CountLine as (
SELECT CountLine = CAST((SELECT '' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(Counts as VARCHAR) + ','
FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX))
)
SELECT CAST(AxisX as VARCHAR) as Attribute, Items, Total, MinValue, MaxValue, MedianValue, FirstQValue, ThirdQValue, BoxNum, 'Box Plot' as [Dataset Name],
CAST('GEOMETRYCOLLECTION(
MULTIPOLYGON(((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ')),
((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(FirstQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(FirstQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MinValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(ThirdQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(ThirdQValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MaxValue  as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ')),

((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ',
' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + '))),

LINESTRING(' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ',
' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ')
)' as GEOMETRY)
FROM BoxPlot
UNION ALL
SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'Items Count',
CAST('GEOMETRYCOLLECTION(LINESTRING(' + LEFT(CountLine,LEN(CountLine)-1)+ '))' as GEOMETRY)
FROM CountLine
UNION ALL
SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A',
CAST('LINESTRING(' + CAST(@Line as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ',
' + CAST(@Line * 2 as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ')' as GEOMETRY)
FROM Volumes
UNION ALL
SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A',
CAST('LINESTRING(' + CAST(@Line as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ',
' + CAST(@Line * 2 as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ')' as GEOMETRY)
FROM Volumes
UNION ALL
SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'Volume',
CAST('GEOMETRYCOLLECTION(LINESTRING(' + LEFT(VolumeLine,LEN(VolumeLine)-1)+ '))' as GEOMETRY)
FROM VolumeLine
UNION ALL
SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A',
CAST('GEOMETRYCOLLECTION(' + LEFT(Volumes,LEN(Volumes)-1)+ ')' as GEOMETRY).STBuffer(@Point)
FROM Volumes
ORDER BY Attribute;```

Here is what I have as a result:

Red line indicates Total Monthly Amount and Blue line indicates total number of items in the data set.

Those two lines proove stability of sales' data in the middle section and lead us to an explanation of Short Inter-quartile range in the third section. As you can see, at the right side of the diagram drammatically increased number of individual sales and outperformed rise of the total amount. That means that sales amount in the third section were generally very small.

From this analysis I can conclude that Box-and-Whisker Plot can help us in certain situations, but does not represent the whole picture without knowing number of records in the individual data set.

At the end, will draw one more diagram based on Celestial data (For more details see my earlier blog post: http://slavasql.blogspot.com/2015/01/skymap.html)

Here is the data set creation for celestial data.

```USE Stars_DB
GO
IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL
DROP TABLE #DataSet;
GO
SELECT RowNum = Row_number() over (partition by LEFT(spect,1) order by CAST(Mag as DECIMAL(7,3))),
AxisX = LEFT(spect,1),
AxisY = CAST(Mag as DECIMAL(7,3))
INTO #DataSet
FROM tbl_Stellar_List
WHERE Con != '' and LEFT(spect,1) >= 'a'
GO
CREATE UNIQUE CLUSTERED INDEX #CUIX_DataSet ON #DataSet([AxisX],[RowNum]);
GO```
Here I've extracted Magnitude of a star and categorized the by First letter of their spectrum.

Here is a Box-and-Whisker Plot I've built from it:

From this diagram I can conclude that most of the stars in all spectrums (with couuple of exceptions) are tended to be with about same density level and have obout the same magnitude.
Not having outliers at the bottom just an indication of small number of stars in the data set.

That would be interesting to plot trades for most traded stocks on NYSE.