column store indexes and batch processing mode (nx power lite)

Column Store Index and Batch Mode

Scalability

An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years

experience.

About me . . .

The scalability challenges we face . . . .

Slides borrowedfrom Thomas Kejserwith his kind permission

CPU Cache, Memory and IO Subsystem Latency

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

1ns 10ns 100ns 100us 10ms10us

C The “Cache out” Curve

Throughput

Touched Data Size

CPU Cache

TLB

NUMARemote

Storage

Every time we drop out of a cache and use the next slower one down, we pay a big throughput penalty

CPCaches

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 320000

100,000,000

200,000,000

300,000,000

400,000,000

500,000,000

600,000,000

700,000,000

800,000,000

900,000,000

1,000,000,000

Random Pages

Sequential Pages

Single Page

Size of Accessed memory (MB)Service Time + Wait Time

C Sequential Versus Random Page CPU Cache Throughput

“Transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”

Spinning disk state of play Interfaces have evolved Aerial density has increased Rotation speed has peeked at 15K RPM Not much else . . .

Up until NAND flash, disk based IO sub systems have not kept pace with CPU advancements.

With next generation storage ( resistance ram etc) CPUs and storage may follow the same curve.

Moores Law Vs. Advancements In Disk Technology

How Execution Plans Run

Row by row Row by row

Row by row Row by row

How do rows travel betweenIterators ?

Control flow

Data Flow

What Is Required

Query execution which leverages CPU caches.

Break through levels of compressionto bridge the performance gap between IO subsystems andmodern processors.

Better query execution scalabilityas the degree of parallelism increase.

Optimizer Batch Mode

First introduced in SQL Server 2012, greatly enhanced in 2014 A batch is roughly 1000 rows in size and it is designed to fit into the L2/3

cache of the CPU, remember the slide on latency. Moving batches around is very efficient*:

One test showed that regular row-mode hash join consumed about 600 instructions per row while the batch-mode hash join needed about 85 instructions per row and in the best case (small, dense join domain) was a low as 16 instructions per row.

* From: Enhancements To SQL Server Column Stores Microsoft Research

Stack Walking The Database Engine

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGOUP BY p.EnglishProductName

xperf –on base –stackwalk profile

xperf –d stackwalk.etl

xperfview stackwalk.etl

How do we squeeze an entire column store index into a CPU L2/3 cache ?

AnswerIts pipelined into the CPU

CPU

Blob cache

Load segments into blob cache

Break blobs into batches and pipeline them into CPU cache

Conceptual View . . . . . and whats happening in the call stack

What Difference Does Batch Mode Make ?

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesFio] f JOIN [dbo].[DimProduct] p ON f.ProductKey = p.ProductKey GROUP BY p.EnglishProductName

Row mode

Batch

0 50 100 150 200 250 300 350 400 450 500

Time (s)

x12

at DOP 2

Batch modeHash MatchAggregate78,400 ms** Timings are a statistical estimate

Row modeHash MatchAggregate445,585 ms*

Vs.

ColourRedRedBlueBlueGreenGreenGreen

DictionaryLookup ID Label1 Red2 Blue3 Green

SegmentLookup ID Run Length1 22 23 3

Optimizing Serial Scan Performance

Compressing data going down the column is far superior to compressing data going across the row, also we only retrieve the column data that is of interest.

Run length compression is usedin order to achieve this.

SQL Server 2012 introduces column store compression . . ., SQL Server 2014 adds more features to this.

SQL Server 2014 Column Store Storage Internals

RowGroups

Columns

A B C

Encode andCompress

Segments

Store

Blobs

Encode & Compress

Delta stores

< 1,048,576 rows

Inserts of 1,048,576 rows and over

Inserts less than 1,048,576 rowsand updates update = insert into

delta store+ insert to the

deletion bit map Delta store B-tree Column store segments

Column Store Index Split Personality

Tuple mover

Local Dictionary

Global dictionary

Deletion Bitmap

SELECT [ProductKey] ,[OrderDateKey]

,[DueDateKey],[ShipDateKey],[CustomerKey],[PromotionKey],[CurrencyKey]..

INTO FactInternetSalesBigFROM [dbo].[FactInternetSales]CROSS JOIN master..spt_values AS aCROSS JOIN master..spt_values AS bWHERE a.type = 'p'AND b.type = 'p'AND a.number <= 80

AND b.number <= 100

What Levels Of Compression Are Achievable ?Our ‘Big’ FactInternetSales Table

494,116,038 rows 0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

57 % 74 % 92 % 94 %

Size (Mb)

Query Uncompressed Size Size After Column Store

Compression

SELECT a.numberINTO OrderedSequenceFROM master..spt_values AS aCROSS JOIN master..spt_values AS bCROSS JOIN master..spt_values AS cWHERE c.number <= 57ORDER BY a.number

1.5 billion rows, 39,233.86 Mb

17.85 Mb

SELECT a.numberINTO RandomSequenceFROM master..spt_values AS aCROSS JOIN master..spt_values AS bCROSS JOIN master..spt_values AS cWHERE c.number <= 57ORDER BY NEWID()

18.48 Mb

Does Column Sort Order Affect Compression ?

1048576

SQL Server 2012 / 2014 Column Store Comparison

Feature SQL Server 2012 SQL Server 2014

Column store indexes Yes Yes

Clustered column store indexes No Yes

Updateable column store indexes No Yes

Column store archive compression No Yes

Columns in a column store index can be dropped No Yes

Support for GUID, binary, datetimeoffset precision > 2, numeric precision > 18. No Yes

Enhanced compression by storing short strings natively ( instead of 32 bit IDs ) No Yes

Bookmark support ( row_group_id:tuple_id) No Yes

Mixed row / batch mode execution No Yes

Optimized hash build and join in a single iterator No Yes

Hash memory spills cause row mode execution No Yes

Iterators supported Scan, filter, project, hash (inner) join and (local) hash aggregate

Yes

Column Store Index and Batch ModeTest Drive

Disclaimer: your own mileage may vary depending on your data, hardwareand queries

Hardware2 x 2.0 Ghz 6 core Xeon CPUsHyper threading enabled22 GB memoryRaid 0: 6 x 250 GB SATA III HD 10K RPMRaid 0: 3 x 80 GB Fusion IO

SoftwareWindows server 2012SQL Server 2014 CTP 2AdventureWorksDW DimProductTableEnlarged FactInternetSales table

Test Set Up

SELECT SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])

FROM [dbo].[FactInternetSales]

Sequential Scan Performance

0

50000

100000

150000

200000

250000

300000Compression Type / Time (ms)

Time (ms)

2050Mb/s 678Mb/s 256Mb/s85% CPU 98% CPU 98% CPU

Pagecompression1,340,097 ms*

All stack trace timings are a statistical estimate

No compression545,761 ms*

Vs.

hdd cstore hdd cstore archive flash cstore flash cstore archive0

500

1000

1500

2000

2500

3000

3500

4000

4500

Elapsed Time(ms) / Column Store Compression Type

Elapsed Time(ms)/Compression Type

52 Mb/s 27 Mb/s 99% CPU 56% CPU

Clustered column store index with archive compression61,196 ms

Clustered column store index60,651 ms

Vs.

Testing Join Scalability

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGROUP BY p.EnglishProductName

We will look at the best we can do without column store indexes:Partitioned heap fact table with page

compression for spinning diskPartitioned heap fact table without

any compression our flash storageNon partitioned column store indexes

on both types of store with and without archive compression.

2 4 6 8 10 12 14 16 18 20 22 240

100000

200000

300000

400000

500000

600000

700000

800000

HDD page compressed partitioned fact table

Flash partitioned fact table

Join Scalability DOP / Time (ms)Time (ms)

Degree of parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10000

20000

30000

40000

50000

60000hdd column storehdd column store archiveflash column storeflash column store archive

Degree of parallel-ism

Time (ms) Join Scalability DOP / Time (ms)

Diving Deeper intoBatch ModeScalability

Scaling Multi Threaded Workloads 101

A SQL Server workload should scale up to the limits of hardware, such that:All CPU capacity is exhausted

orAll storage IOPS bandwidth is exhausted

As concurrency increases, we need to watch out for “The usual suspects” that can throttle throughput back.

Lock Contention

Latch Contention

Spinlock Contention

1 2 3 4 5 6 7 8 9 10 11 120

5000

10000

15000

20000

25000

30000

35000

40000

0

20

40

60

80

100

120

Elapsed Time (ms)Pct CPU Utilisation

Average CPU Utilisation and Elapsed Time (ms) / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 24

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

0

10

20

30

40

50

60

70

80

90

100

Waiting Latch Request Count

Pct CPU Utilisation

Average CPU Utilisation and Latch Request Count / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 24

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0

10

20

30

40

50

60

70

80

90

100

Spinlock Spin Count (1000s)Pct CPU Utilisation

2 4 6 8 10 12 14 16 18 20 22 24

Average CPU Utilisation and Spin Count (1000s) / Degree of Parallelism

Takeaways

CPU

CPU used for IO consumption + CPU used for decompression < total CPU capacity

Compression works for you

What most people tend to have

Takeaways

CPU

CPU used for IO consumption + CPU used for decompression > total CPU capacity

Compression works against you

CPU used for IO consumption + CPU used for decompression = total CPU capacity

Nothing to be gained or lost from using compression

No significant difference in terms of performance between column store compression and column store archive compression.

Pre-sorting the data makes little difference to compression ratios.Batch modeProvides a tremendous performance boost with just two schedulers.Does not provide linear scalability with the hardware available.Does provide an order of magnitude performance increase in JOIN

performance.Performs marginally better with column store indexes which do not use

archive compression.

Takeaways

Enhancements To Column Store Indexes (SQL Server 2014 ) Microsoft Research

SQL Server Clustered Columnstore Tuple MoverRemus Rasanu SQL Server Columnstore Indexes at Teched 2013

Remus RasanuThe Effect of CPU Caches and Memory Access Patterns

Thomas Kejser

Further Reading

http://research.microsoft.com/apps/pubs/default.aspx?id=193599

http://rusanu.com/2013/12/02/sql-server-clustered-columnstore-tuple-mover/



http://rusanu.com/2013/06/11/sql-server-clustered-columnstore-indexes-at-teched-2013/





http://blog.kejser.org/2012/06/14/the-effect-of-cpu-caches-and-memory-access-patterns/

Thanks To My Reviewer and Contributor

Thomas Kejser

Former SQL CAT member and CTO of Livedrive

[email protected]

http://uk.linkedin.com/in/wollatondba

Contact Details

ChrisAdkin8

mailto:[email protected]



Questions ?

column store indexes and batch processing mode (nx power lite)

Technology

cross join

compressing

join

row

576 rows

productkey

dbo

productstandardcost