If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots
At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.
We will analyze amounts of Individual lines of Sales Orders within each month.
The first step is to create a Data Set to process. That Data Set will contain a Month, Single Line amount and order number of that record within a month.
Here is how I did it:
USE AdventureWorks2014 GO IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL DROP TABLE #DataSet; GO SELECT AxisX = CAST(FLOOR(CAST( dateadd(day,1-day(h.OrderDate),h.OrderDate)AS FLOAT)) AS DATETIME) , AxisY = d.LineTotal , RowNum = Row_number() over (partition by CAST( FLOOR(CAST(dateadd(day,1-day(h.OrderDate),h.OrderDate)AS FLOAT)) AS DATETIME) order by d.LineTotal) INTO #DataSet FROM Sales.SalesOrderDetail as d INNER JOIN Sales.SalesOrderHeader as h ON h.SalesOrderID = d.SalesOrderID GO CREATE UNIQUE CLUSTERED INDEX #CUIX_DataSet ON #DataSet([AxisX],[RowNum]); GO
Here is TOP 10 of what I have in my Data Set:
Here is the Main script to extract the data into a Box-and-Whisker Plot:
/* Building a Box-and-Whisker Plot */ DECLARE @Period FLOAT, @Width FLOAT, @Line FLOAT, @g geometry; SELECT @Period = (MAX(AxisY) - MIN(AxisY)) * 2. / COUNT(DISTINCT AxisX) , @Width = @Period / 4, @Line = @Width / 40 FROM #DataSet ;WITH LineCounts as ( SELECT AxisX, SUM(AxisY) as Total, MIN(AxisY) as MinValue, COUNT(*) as OrderLines, COUNT(*) / 2. as OrderMedian, COUNT(*) / 4. as OrderQuarter FROM #DataSet GROUP BY AxisX ), BoxPlot as ( SELECT lc.AxisX, lc.Total, COUNT(*) as Items, MIN(ds.AxisY) as MinValue, MAX(ds.AxisY) as MaxValue, -- Finding Median Value for a Set SUM(CASE WHEN (lc.OrderLines % 2 = 0 and (ds.RowNum = lc.OrderMedian or ds.RowNum = lc.OrderMedian + 1) ) OR (lc.OrderLines % 2 = 1 and ds.RowNum = CEILING(lc.OrderMedian)) THEN ds.AxisY ELSE 0 END) / ( lc.OrderLines % 2 * (-1) + 2) as MedianValue, -- Finding 1st Quartile Value for a Set SUM(CASE WHEN (lc.OrderLines % 4 = 0 and (ds.RowNum = lc.OrderQuarter or ds.RowNum = lc.OrderQuarter + 1) ) OR (lc.OrderLines % 4 > 0 and ds.RowNum = CEILING(lc.OrderQuarter)) THEN ds.AxisY ELSE 0 END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as FirstQValue, -- Finding 3rd Quartile Value for for a Set SUM(CASE WHEN (lc.OrderLines % 4 = 0 and (ds.RowNum = lc.OrderLines - lc.OrderQuarter or ds.RowNum = lc.OrderLines - lc.OrderQuarter + 1) ) OR (lc.OrderLines % 4 > 0 and ds.RowNum = lc.OrderLines - FLOOR(lc.OrderQuarter)) THEN ds.AxisY ELSE 0 END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as ThirdQValue, BoxNum = Row_number() over (order by lc.AxisX) FROM LineCounts as lc INNER JOIN #DataSet as ds ON lc.AxisX = ds.AxisX GROUP BY lc.AxisX, lc.Total, lc.OrderLines, lc.OrderMedian, lc.OrderQuarter ) SELECT AxisX, Items, Total, MinValue, MaxValue, MedianValue, FirstQValue, ThirdQValue, BoxNum, CAST('GEOMETRYCOLLECTION(MULTIPOLYGON((( ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + '))), LINESTRING(' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ') )' as GEOMETRY) FROM BoxPlot ORDER BY AxisX;
Here is the diagram I've got:
Starting Interpretation:
1. We can divide the diagram on three about equal sections by Inter-quartile range: Wide, Medium and Short. All three sections are skewed top.
2. In the first section with Wide Inter-quartile range we see three outliers where data skeweddown.
3. The middle section has pretty stable distribution of first three quartiles
4. In the third, Short Inter-quartile range section, we also have three outliers with very condense data set.
5. All data sets, excluding the very last one, have oulier values at the top.
What else I can say without knowing particular business details just by looking at the data?
Obviously, Box-and-Whisker Plot alone, does not give us anything else.
That means we have to add some other metrics to better understand the business situation.
In that script I include two additional lines: Total amount of sales lines and total number of sale lines:
/* Building a Box-and-Whisker Plot */ DECLARE @Period FLOAT, @Width FLOAT, @Line FLOAT, @Point FLOAT, @MinPoint FLOAT, @Volume FLOAT, @g geometry; SELECT @Period = (MAX(AxisY) - MIN(AxisY)) * 2. / COUNT(DISTINCT AxisX) , @Volume = MAX(AxisY) - MIN(AxisY), @MinPoint = MIN(AxisY) , @Width = @Period / 4, @Line = @Width / 40, @Point = @Period / 20 FROM #DataSet ;WITH LineCounts as ( SELECT AxisX, SUM(AxisY) as Total, SUM(AxisY) / @Volume as VolumeRate, MIN(AxisY) as MinValue, COUNT(*) as OrderLines, COUNT(*) / @Volume as CountRate, COUNT(*) / 2. as OrderMedian, COUNT(*) / 4. as OrderQuarter FROM #DataSet GROUP BY AxisX ), BoxPlot as ( SELECT lc.AxisX, lc.Total, lc.Total / r.VolumeRate + r.MinValue as TotalVolume, lc.OrderLines / r.CountRate + r.MinValue as Counts, COUNT(*) as Items, MIN(ds.AxisY) as MinValue, MAX(ds.AxisY) as MaxValue, -- Finding Median Value for a Set SUM(CASE WHEN (lc.OrderLines % 2 = 0 and (ds.RowNum = lc.OrderMedian or ds.RowNum = lc.OrderMedian + 1) ) OR (lc.OrderLines % 2 = 1 and ds.RowNum = CEILING(lc.OrderMedian)) THEN ds.AxisY ELSE 0 END) / ( lc.OrderLines % 2 * (-1) + 2) as MedianValue, -- Finding 1st Quartile Value for a Set SUM(CASE WHEN (lc.OrderLines % 4 = 0 and (ds.RowNum = lc.OrderQuarter or ds.RowNum = lc.OrderQuarter + 1) ) OR (lc.OrderLines % 4 > 0 and ds.RowNum = CEILING(lc.OrderQuarter)) THEN ds.AxisY ELSE 0 END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as FirstQValue, -- Finding 3rd Quartile Value for for a Set SUM(CASE WHEN (lc.OrderLines % 4 = 0 and (ds.RowNum = lc.OrderLines - lc.OrderQuarter or ds.RowNum = lc.OrderLines - lc.OrderQuarter + 1) ) OR (lc.OrderLines % 4 > 0 and ds.RowNum = lc.OrderLines - FLOOR(lc.OrderQuarter)) THEN ds.AxisY ELSE 0 END) / ( SIGN(lc.OrderLines % 4) * (-1) + 2) as ThirdQValue, BoxNum = Row_number() over (order by lc.AxisX) FROM LineCounts as lc INNER JOIN #DataSet as ds ON lc.AxisX = ds.AxisX INNER JOIN (SELECT MAX(VolumeRate) as VolumeRate, MAX(CountRate) as CountRate, MIN(MinValue) as MinValue FROM LineCounts) as r ON 1 = 1 GROUP BY lc.AxisX, lc.Total, lc.OrderLines, lc.OrderMedian, lc.OrderQuarter, r.VolumeRate, r.MinValue, r.CountRate ), Volumes as ( SELECT Volumes = CAST((SELECT 'POINT(' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(TotalVolume as VARCHAR) + '),' FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX)) ), VolumeLine as ( SELECT VolumeLine = CAST((SELECT '' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(TotalVolume as VARCHAR) + ',' FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX)) ), CountLine as ( SELECT CountLine = CAST((SELECT '' + CAST(BoxNum * @Period as VARCHAR) + ' ' + CAST(Counts as VARCHAR) + ',' FROM BoxPlot FOR XML PATH('')) as VARCHAR(MAX)) ) SELECT CAST(AxisX as VARCHAR) as Attribute, Items, Total, MinValue, MaxValue, MedianValue, FirstQValue, ThirdQValue, BoxNum, 'Box Plot' as [Dataset Name], CAST('GEOMETRYCOLLECTION( MULTIPOLYGON(((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(FirstQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MinValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(ThirdQValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Line as VARCHAR) + ' ' + CAST(MaxValue as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MaxValue - @Line as VARCHAR) + ')), ((' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue + @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + ', ' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MinValue - @Line as VARCHAR) + '))), LINESTRING(' + CAST(BoxNum * @Period - @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ', ' + CAST(BoxNum * @Period + @Width as VARCHAR) + ' ' + CAST(MedianValue as VARCHAR) + ') )' as GEOMETRY) FROM BoxPlot UNION ALL SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'Items Count', CAST('GEOMETRYCOLLECTION(LINESTRING(' + LEFT(CountLine,LEN(CountLine)-1)+ '))' as GEOMETRY) FROM CountLine UNION ALL SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A', CAST('LINESTRING(' + CAST(@Line as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ', ' + CAST(@Line * 2 as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ')' as GEOMETRY) FROM Volumes UNION ALL SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A', CAST('LINESTRING(' + CAST(@Line as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ', ' + CAST(@Line * 2 as VARCHAR) + ' ' + CAST(@MinPoint as VARCHAR) + ')' as GEOMETRY) FROM Volumes UNION ALL SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'Volume', CAST('GEOMETRYCOLLECTION(LINESTRING(' + LEFT(VolumeLine,LEN(VolumeLine)-1)+ '))' as GEOMETRY) FROM VolumeLine UNION ALL SELECT Null, Null, Null, Null, Null, Null, Null, Null, Null, 'N/A', CAST('GEOMETRYCOLLECTION(' + LEFT(Volumes,LEN(Volumes)-1)+ ')' as GEOMETRY).STBuffer(@Point) FROM Volumes ORDER BY Attribute;
Here is what I have as a result:
Red line indicates Total Monthly Amount and Blue line indicates total number of items in the data set.
Those two lines proove stability of sales' data in the middle section and lead us to an explanation of Short Inter-quartile range in the third section. As you can see, at the right side of the diagram drammatically increased number of individual sales and outperformed rise of the total amount. That means that sales amount in the third section were generally very small.
From this analysis I can conclude that Box-and-Whisker Plot can help us in certain situations, but does not represent the whole picture without knowing number of records in the individual data set.
At the end, will draw one more diagram based on Celestial data (For more details see my earlier blog post: http://slavasql.blogspot.com/2015/01/skymap.html)
Here is the data set creation for celestial data.
USE Stars_DB GO IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL DROP TABLE #DataSet; GO SELECT RowNum = Row_number() over (partition by LEFT(spect,1) order by CAST(Mag as DECIMAL(7,3))), AxisX = LEFT(spect,1), AxisY = CAST(Mag as DECIMAL(7,3)) INTO #DataSet FROM tbl_Stellar_List WHERE Con != '' and LEFT(spect,1) >= 'a' GO CREATE UNIQUE CLUSTERED INDEX #CUIX_DataSet ON #DataSet([AxisX],[RowNum]); GO
Here is a Box-and-Whisker Plot I've built from it:
From this diagram I can conclude that most of the stars in all spectrums (with couuple of exceptions) are tended to be with about same density level and have obout the same magnitude.
Not having outliers at the bottom just an indication of small number of stars in the data set.
That would be interesting to plot trades for most traded stocks on NYSE.
No comments:
Post a Comment