column store indexes and batch processing mode (nx power lite)
DESCRIPTION
TRANSCRIPT
Column Store Index and Batch Mode
Scalability
An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years
experience.
About me . . .
The scalability challenges we face . . . .
Slides borrowedfrom Thomas Kejserwith his kind permission
CPU Cache, Memory and IO Subsystem Latency
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
C The “Cache out” Curve
Throughput
Touched Data Size
CPU Cache
TLB
NUMARemote
Storage
Every time we drop out of a cache and use the next slower one down, we pay a big throughput penalty
CPCaches
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 320000
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
Random Pages
Sequential Pages
Single Page
Size of Accessed memory (MB)Service Time + Wait Time
C Sequential Versus Random Page CPU Cache Throughput
“Transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
Spinning disk state of play Interfaces have evolved Aerial density has increased Rotation speed has peeked at 15K RPM Not much else . . .
Up until NAND flash, disk based IO sub systems have not kept pace with CPU advancements.
With next generation storage ( resistance ram etc) CPUs and storage may follow the same curve.
Moores Law Vs. Advancements In Disk Technology
How Execution Plans Run
Row by row Row by row
Row by row Row by row
How do rows travel betweenIterators ?
Control flow
Data Flow
What Is Required
Query execution which leverages CPU caches.
Break through levels of compressionto bridge the performance gap between IO subsystems andmodern processors.
Better query execution scalabilityas the degree of parallelism increase.
Optimizer Batch Mode
First introduced in SQL Server 2012, greatly enhanced in 2014 A batch is roughly 1000 rows in size and it is designed to fit into the L2/3
cache of the CPU, remember the slide on latency. Moving batches around is very efficient*:
One test showed that regular row-mode hash join consumed about 600 instructions per row while the batch-mode hash join needed about 85 instructions per row and in the best case (small, dense join domain) was a low as 16 instructions per row.
* From: Enhancements To SQL Server Column Stores Microsoft Research
Stack Walking The Database Engine
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGOUP BY p.EnglishProductName
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
xperfview stackwalk.etl
How do we squeeze an entire column store index into a CPU L2/3 cache ?
AnswerIts pipelined into the CPU
CPU
Blob cache
Load segments into blob cache
Break blobs into batches and pipeline them into CPU cache
Conceptual View . . . . . and whats happening in the call stack
What Difference Does Batch Mode Make ?
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesFio] f JOIN [dbo].[DimProduct] p ON f.ProductKey = p.ProductKey GROUP BY p.EnglishProductName
Row mode
Batch
0 50 100 150 200 250 300 350 400 450 500
Time (s)
x12
at DOP 2
Batch modeHash MatchAggregate78,400 ms** Timings are a statistical estimate
Row modeHash MatchAggregate445,585 ms*
Vs.
ColourRedRedBlueBlueGreenGreenGreen
DictionaryLookup ID Label1 Red2 Blue3 Green
SegmentLookup ID Run Length1 22 23 3
Optimizing Serial Scan Performance
Compressing data going down the column is far superior to compressing data going across the row, also we only retrieve the column data that is of interest.
Run length compression is usedin order to achieve this.
SQL Server 2012 introduces column store compression . . ., SQL Server 2014 adds more features to this.
SQL Server 2014 Column Store Storage Internals
RowGroups
Columns
A B C
Encode andCompress
Segments
Store
Blobs
Encode & Compress
Delta stores
< 1,048,576 rows
Inserts of 1,048,576 rows and over
Inserts less than 1,048,576 rowsand updates update = insert into
delta store+ insert to the
deletion bit map Delta store B-tree Column store segments
Column Store Index Split Personality
Tuple mover
Local Dictionary
Global dictionary
Deletion Bitmap
SELECT [ProductKey] ,[OrderDateKey]
,[DueDateKey],[ShipDateKey],[CustomerKey],[PromotionKey],[CurrencyKey]..
INTO FactInternetSalesBigFROM [dbo].[FactInternetSales]CROSS JOIN master..spt_values AS aCROSS JOIN master..spt_values AS bWHERE a.type = 'p'AND b.type = 'p'AND a.number <= 80
AND b.number <= 100
What Levels Of Compression Are Achievable ?Our ‘Big’ FactInternetSales Table
494,116,038 rows 0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
57 % 74 % 92 % 94 %
Size (Mb)
Query Uncompressed Size Size After Column Store
Compression
SELECT a.numberINTO OrderedSequenceFROM master..spt_values AS aCROSS JOIN master..spt_values AS bCROSS JOIN master..spt_values AS cWHERE c.number <= 57ORDER BY a.number
1.5 billion rows, 39,233.86 Mb
17.85 Mb
SELECT a.numberINTO RandomSequenceFROM master..spt_values AS aCROSS JOIN master..spt_values AS bCROSS JOIN master..spt_values AS cWHERE c.number <= 57ORDER BY NEWID()
18.48 Mb
Does Column Sort Order Affect Compression ?
1048576
SQL Server 2012 / 2014 Column Store Comparison
Feature SQL Server 2012 SQL Server 2014
Column store indexes Yes Yes
Clustered column store indexes No Yes
Updateable column store indexes No Yes
Column store archive compression No Yes
Columns in a column store index can be dropped No Yes
Support for GUID, binary, datetimeoffset precision > 2, numeric precision > 18. No Yes
Enhanced compression by storing short strings natively ( instead of 32 bit IDs ) No Yes
Bookmark support ( row_group_id:tuple_id) No Yes
Mixed row / batch mode execution No Yes
Optimized hash build and join in a single iterator No Yes
Hash memory spills cause row mode execution No Yes
Iterators supported Scan, filter, project, hash (inner) join and (local) hash aggregate
Yes
Column Store Index and Batch ModeTest Drive
Disclaimer: your own mileage may vary depending on your data, hardwareand queries
Hardware2 x 2.0 Ghz 6 core Xeon CPUsHyper threading enabled22 GB memoryRaid 0: 6 x 250 GB SATA III HD 10K RPMRaid 0: 3 x 80 GB Fusion IO
SoftwareWindows server 2012SQL Server 2014 CTP 2AdventureWorksDW DimProductTableEnlarged FactInternetSales table
Test Set Up
SELECT SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])
FROM [dbo].[FactInternetSales]
Sequential Scan Performance
0
50000
100000
150000
200000
250000
300000Compression Type / Time (ms)
Time (ms)
2050Mb/s 678Mb/s 256Mb/s85% CPU 98% CPU 98% CPU
Pagecompression1,340,097 ms*
All stack trace timings are a statistical estimate
No compression545,761 ms*
Vs.
hdd cstore hdd cstore archive flash cstore flash cstore archive0
500
1000
1500
2000
2500
3000
3500
4000
4500
Elapsed Time(ms) / Column Store Compression Type
Elapsed Time(ms)/Compression Type
52 Mb/s 27 Mb/s 99% CPU 56% CPU
Clustered column store index with archive compression61,196 ms
Clustered column store index60,651 ms
Vs.
Testing Join Scalability
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGROUP BY p.EnglishProductName
We will look at the best we can do without column store indexes:Partitioned heap fact table with page
compression for spinning diskPartitioned heap fact table without
any compression our flash storageNon partitioned column store indexes
on both types of store with and without archive compression.
2 4 6 8 10 12 14 16 18 20 22 240
100000
200000
300000
400000
500000
600000
700000
800000
HDD page compressed partitioned fact table
Flash partitioned fact table
Join Scalability DOP / Time (ms)Time (ms)
Degree of parallelism
2 4 6 8 10 12 14 16 18 20 22 240
10000
20000
30000
40000
50000
60000hdd column storehdd column store archiveflash column storeflash column store archive
Degree of parallel-ism
Time (ms) Join Scalability DOP / Time (ms)
Diving Deeper intoBatch ModeScalability
Scaling Multi Threaded Workloads 101
A SQL Server workload should scale up to the limits of hardware, such that:All CPU capacity is exhausted
orAll storage IOPS bandwidth is exhausted
As concurrency increases, we need to watch out for “The usual suspects” that can throttle throughput back.
Lock Contention
Latch Contention
Spinlock Contention
1 2 3 4 5 6 7 8 9 10 11 120
5000
10000
15000
20000
25000
30000
35000
40000
0
20
40
60
80
100
120
Elapsed Time (ms)Pct CPU Utilisation
Average CPU Utilisation and Elapsed Time (ms) / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 24
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
7000
8000
0
10
20
30
40
50
60
70
80
90
100
Waiting Latch Request Count
Pct CPU Utilisation
Average CPU Utilisation and Latch Request Count / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 24
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
10
20
30
40
50
60
70
80
90
100
Spinlock Spin Count (1000s)Pct CPU Utilisation
2 4 6 8 10 12 14 16 18 20 22 24
Average CPU Utilisation and Spin Count (1000s) / Degree of Parallelism
Takeaways
CPU
CPU used for IO consumption + CPU used for decompression < total CPU capacity
Compression works for you
What most people tend to have
Takeaways
CPU
CPU used for IO consumption + CPU used for decompression > total CPU capacity
Compression works against you
CPU used for IO consumption + CPU used for decompression = total CPU capacity
Nothing to be gained or lost from using compression
No significant difference in terms of performance between column store compression and column store archive compression.
Pre-sorting the data makes little difference to compression ratios.Batch modeProvides a tremendous performance boost with just two schedulers.Does not provide linear scalability with the hardware available.Does provide an order of magnitude performance increase in JOIN
performance.Performs marginally better with column store indexes which do not use
archive compression.
Takeaways
Enhancements To Column Store Indexes (SQL Server 2014 ) Microsoft Research
SQL Server Clustered Columnstore Tuple MoverRemus Rasanu SQL Server Columnstore Indexes at Teched 2013
Remus RasanuThe Effect of CPU Caches and Memory Access Patterns
Thomas Kejser
Further Reading
Thanks To My Reviewer and Contributor
Thomas Kejser
Former SQL CAT member and CTO of Livedrive
http://uk.linkedin.com/in/wollatondba
Contact Details
ChrisAdkin8
Questions ?