11 33 55 scalable data warehouse & data marts reportsanalysis sql server dbms sql server...

SQL Sequential I/OFast Track Data Warehouse

Agenda

Fast Track I/O fundamentalsPutting the data on diskGetting the data off disk

Actual system results

TakeawaysUnderstand the interaction between SQL and the storage layer of the database stack for scan based disk I/O optimizationArticulate core Fast Track disk I/O balancing approach for Data Warehousing

Some SQL Data Warehouses Today

Big SANBig 64-core ServerConnected together!

What’s wrong withthis picture???

System out of balance

This server can consume 16 GB/Sec of IO, but the SAN can only deliver 2 GB/Sec• Even when the SAN is dedicated to the

SQL Data Warehouse, which it often isn’t• Lots of disks for Random IOPS BUT• Limited controllers Limited IO

bandwidth System is typically IO bound and

queries are slow• Despite significant investment in both

Server and Storage

The Alternative: A Balanced System

Design a server + storage configuration that can deliver all the IO bandwidth that CPUs can consume when executing a SQL Relational DW workload

Avoid sharing storage devices among servers

Avoid overinvesting in disk drives Focus on scan performance, not IOPS

Layout and manage data to maximize range scan performance and minimize fragmentation

Key Customer challengesData explosion, increased complexity, Quality & Compliance

Support data from multiple sources

Rapidly Growing Data

1

3

Ensure Quality & Integrity

5

Scalable Data Warehouse & Data Marts

ReportsAnalysis

SQL Server DBMS

SQL Server Integration Services

Custom OLTP

Increase usage & trust

Consolidate & reduce cost

Increased complexity and inflexibility of EDWs

4

Increasing demand for real time data

2

Higher Customer Expectations

6

What is SQL Server Fast TrackData Warehouse?

A new offering to help Customers and Partners accelerate their Data Warehouse deployments.

Seven new SMP Reference Architectures

Our offering includes (1) New DW Reference Architectures

and (2) SI provided Templates for DW Solutions.

SI Solution Templates

• Method for designing a cost-effective, balanced system for Data Warehouse workloads

• Reference hardware configurations developed in conjunction with hardware partners using this method

• Best practices for data layout, loading and management

Fast Track Data Warehouse Components

Software:• SQL Server 2008

Enterprise• Windows Server 2008

Hardware:• Tight specifications for

servers, storage and networking

• ‘Per core’ building block

Configuration guidelines:• Physical table

structures• Indexes• Compression• SQL Server settings• Windows Server

settings• Loading

Published Reference Architectures Balanced System Examples -- HP / Dell / IBM, 8 to 48 core

Server CPU

CPU Core

s SANData Drive

Count

Initial Max

Capacity*

Capacity**

HP Proliant DL 385 G6

(2) AMD Opteron Istanbulsix core 2.6 GHz

12 (3) HP (24) 300GB 15k SAS

6TB 12TB

MSA2312HP Proliant DL 385 G6

(2) AMD Opteron Istanbul six core 2.6 GHz

12 (3) EMC AX4 (24) 300GB 15k FC 6TB 12TB


(4) AMD Opteron Instanbul six core 2.6 GHz

24 (6) HP MSA2312 (48) 300GB 15k SAS

12TB 24TB


(4) AMD Opteron Instanbul six core 2.6 GHz

24 (6) EMC AX4 (48) 300GB 15k FC 12TB 24TB



48 (12) HP MSA2312 (96) 300GB 15k SAS

24TB 48TB



48 (12) EMC AX4 (96) 300GB 15k FC 24TB 48TB

Dell PowerEdge R710

(2) Intel Xeon Nehalem quad core 2.66

GHz

8 (2) EMC AX4 (16) 300GB 15k FC 4TB 8TB

Dell Power Edge R900

(4) Intel Xeon Dunningtonsix core 2.67GHz

24 (6) EMC (48) 300GB 15k FC 12TB 24TB

AX4IBM X3650 M2 (2) Intel Xeon Nehalem

quad core 2.67 GHx8 (2) IBM DS3400 (16) 200GB 15K

FC4TB 8TB

IBM X3850 M2 (4) Intel Xeon Dunnington six core 2.67 GHz

24 (6) IBM DS3400 (24) 300GB 15k FC 12TB 24TB

IBM X3950 M2 (8) Intel Xeon Dunnington four core 2.13 GHz

32 (8) IBM DS3400 (32) 300GB 15k SAS

16TB 32TB

Benefits of Fast Track Data Warehouse

Appliance-like time to valueReduces DBA effort; fewer indexes, much higher level of sequential I/O

Choice of HW PlatformsHP, Dell, EMC, IBM, Bull – more in

future

Low TCO ThroughCommodity Hardware and value

pricing; Lower storage costs.

High ScaleNew reference architectures scale

up to 48 TB (assuming 2.5x compression)

Reduced RiskTested by Microsoft; better choice of hardware; application of Best

Practice

Scope of Fast Track Data Warehouse in a BI Solution

Dat

a P

ath

Data Warehouse

Analysis Services Cubes

PerformancePoint

SAN, Storage Array

Reporting Services

Web Analytic Tools

Integration Services ETL

SharePoint Services

Microsoft Office SharePoint

Data Staging,Bulk Loading

Subject AreaData Marts

Supporting Systems BI Data Storage Systems Presentation Layer Systems

Reference Architecture Scope (dashed)

Excel Services

Pre

sen

tati

on

Dat

aP

rese

nta

tio

n D

ata

Designing High Performance I/OFundamentals of Disk Drives and

Caches

http://thefutureofthings.com/upload/items_icons/X-25E-Intel-Enterprise-Clas_large.jpg

The Full Stack

CPU PCI Bus I/O Controller / HBA Cabling Array Cache SpindleWindowsSQL Serv.

DB

Component Balance For… CPU Maximum consumption rate of cached data for targeted

query mix

Controller (Service Processor)

Bandwidth needed to feed CPU cores (based on targeted query mix)

HBA Aggregate bandwidth array controllers will deliver

Switch Aligned with HBA bandwidth and optimized for sequential IO

Disks Aggregate bandwidth of array controllers / Database capacity

http://thefutureofthings.com/upload/items_icons/X-25E-Intel-Enterprise-Clas_large.jpg

Balanced Architecture ComponentsA hardware system where ALL resources

can be leveraged to its maximum without any of the other components in the system bottlenecking it.

Disk Subsystem

Server

NIC

Memory

Network1

53

4

2

SQL File Layout

HBA

Numbers to Remember - SpindlesTraditional Spindle throughput

10K RPM – 100 -130 IOPs at ‘full stroke’15K RPM – 150-180 IOPs at ‘full stroke’Can achieve 2x or more when ‘short stroking’ the disks (using less than 20% capacity of the physical spindle)These are for random 8K I/O

Aggregate throughput when sequential access:Between 90MB/sec and 125MB/sec for a single driveIf true sequential, any block size over 8K will give you these numbersDepends on drive form factor, 3.5” drives slightly faster than 2.5”

Approximate latency: 3-5ms

Scaling of Spindle Count Short vs. Full Stroke

Each 8 disks exposes a single 900GB LUN

Test data set is fixed at 800GB

Single 800GB for single LUN (8 disks), two 400GB test files across two LUNs, etc…

Lower IOPs per physical disk when more capacity of the physical disks are used (longer seek times)

8 Disks

16 Disks

32 Disks

48 Disks

64 Disks

0

50

100

150

200

250

300

350

400

0

10

20

30

40

50

60

70

80

90

100

233270

311327 336

Short vs. Full Stroke Impact

Random 8K Reads

Reads Per Disk (Random 8K)

I/O

's P

er

Second

I/O WeavingTest setup:

Two Partition on same LUN Hit both partitions with a sequential 64K read pattern7200rpm SATA drive

Compare with Random 64K

Sequential worse than Random!

Key takeaway: If tuning for sequential, be careful about I/O weaving

Stripes and Block Random

Stripe Size Transfer Size Random Size Sequential

128KB 64KB 1.2GB/sec 2.1GB/sec




RAID10 stripe on 24 spindles

Sequential Gated by controller87MB/sec/spindle

Theoretical: 125MB/sec/spindle

Note the effect of stripe sizes and increased transfer size

83MB/sec/spindle

Path to the drives - SAN

Cach

e

Fib

er C

ha

nn

el P

orts

Co

ntro

llers

/Pro

ce

ss

ors

Sw

itch

HBAS

witch

PCI BusPCI BusPCI Bus

PCI Bus

Best Practice: Make sure you have the tools to monitor the entire path to the drives. Understand utilization of individual componets

Fabric Array

Sequential Write …..

Data Warehouse Workload Characteristics

Scan Intensive

Hash Joins

Aggregations

SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY,SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE,SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS

SUM_DISC_PRICE,SUM(L_EXTENDEDPRICE*(1-

L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE,

AVG(L_QUANTITY) AS AVG_QTY,AVG(L_EXTENDEDPRICE) AS AVG_PRICE,AVG(L_DISCOUNT) AS AVG_DISC,COUNT(*) AS COUNT_ORDER

FROM LINEITEMGROUP BY L_RETURNFLAG,

L_LINESTATUSORDER BY L_RETURNFLAG,

L_LINESTATUS

SQL File System

SQL Databases are comprised of one or more Files made up of blocks that comprise logical database pages, rows, and tables.Hierarchy: Physical to Logical

NTFS: Block (512 bytes)NTFS: Allocation Unit (64k)NTFS: FileDatabase: Row (variable)Database: Page (8k)Database: ExtentDatabase: Table

Database table/row ordering does not imply ordering of physical blocks on disk

Data fragmentation

Note: We also use the term “fragmentation” to generally describe poor sequential data layoutSQL Server Fragmentation types

File fragmentation (Physical)Extent fragmentation (Logical)Page fragmentation (Logical)Non clustered index fragmentation (Logical)

Fragmentation

File 1 File 2 File 1 File 1 File 2 File 1File

Extent

File 1Table 1 Table 2 Table 1

1:32

1:38

1:37

1:40

1:39

1:33

1:31

1:35

1:34

1:36

Extents

Page

SQL

Example: Page fragmentation

Bulk Inserts into Clustered Index using a moderate ‘BATCHSIZE’ parameter

Each ‘batch’ is sorted independentlyOverlapping batches lead to page splits

Note: Even without page splits a CI implies no disk level ordering of data

1:32

1:31

1:35

1:34

1:33

1:36

1:38

1:37

1:40

1:39

1:32

1:31

1:35

1:34

1:33

Key Order of Index

Writing Sequential DataGoals

Minimize fragmentationFor ordered tables: match logical to physical location as best we canUse filegroup locations to stripe data across multiple LUN: Fast Track Data Stripe

Maximize extent allocation. Goal is to increase effective request size.

–E startup option is used; Allocates 64 extents at a time (4MB)

Fast Track Data StripingFast Track evenly spreads SQL data files across physical RAID-1 disk arrays

ARY01D1v01

ARY01D2v02

ARY02D1v03

ARY02D2v04

ARY03D1v05

ARY03D2v06

ARY04D1v07

ARY04D2v08

ARY05v09

DB1-1.ndf DB1-7.ndfDB1-5.ndfDB1-3.ndf

DB1-2.ndf DB1-4.ndf DB1-6.ndf DB1-8.ndf

DB1.ldf

Primary Data Log

FT Storage EnclosureRaid-1

Disk 1 & 2

SQL Extent Allocation -E will allocate 64 extents at a time as files grow: “Extent Allocation” becomes 4MBData is written round-robin across all locations in the Filegroup.

Each file receives data in 4MB incrementsEach Raid1 group has 2 locations (LUN) resulting in an 8MB stripe of data across all arrays in the Filegroup

ARY01D1v01

ARY01D2v02

ARY02D1v03

ARY02D2v04

ARY03D1v05

ARY03D2v06

ARY04D1v07

ARY04D2v08

ARY05v09



DB1.ldf4MB

4MB4MB

4MB4MB4MB

4MB4MB

Autogrow and Fast Track RA

Best Practice: Pre-allocate all databases and do not use Autogrow.

*This is not always practical

When Autogrow is usedTF1117 should be used.This will enforce even file grow and maintain the 4MB extent allocation (stripe width).

Writing Sequential Data

Fast Track ‘sequential’ writes…Load processes are managed to minimize page and extent fragmentationBulk loads parallelism is managedVolatile/Non-Volatile data segregation if necessaryPrimary user databases are pre-allocated

Fast Track Sequential Scan

Achieving Sequential ScanFast Track Goals

Maintain sequential data layoutCreate an affinity between LUN and physical CPU cores. Use SQL filegroups to stripe data across these LUNIncrease effective request size beyond the 512k limit by leveraging SQL and storage hardware

SQL Read AheadSAN hardware pre-fetchRAID1 mirroring

Storage Pre-Fetch

Disk Scan

Pages For Table ‘Y’

Pages For Table ‘X’ Extent 1..16

512k Read

*2MB Pre-Fetch

Extent 17..64

Extent 65……Extent

128Extent 129

Pre-fetch extends individual SQL read requests significantly (~2MB, hardware dependant) by reading additional, adjacent data each time the disk head touches platter.Assume contiguous extents; Pre-fetch will put data in cache that will subsequently be requested

Reduces disk head movementMaximizes SQL Read-Ahead

Storage Pre-Fetch & Fragmentation

Example: Extent FragmentationEffective read size is now restricted to the number of contiguous extents available within the pre-fetch rangeSignificantly reduces the value of pre-fetch to sequential I/O by increasing disk head movement

Disk Scan


Pages For Table ‘X’ Extent 1..16

512k Read

*2MB Pre-Fetch

Extent 17-64 Extent 65..70 Extent 70..89 Extent 90..99

SQL Server Read-AheadSQL R-A (Read-Ahead) goals for Fast TrackI. Queue pending read requests at the storage enclosure

level.II. Maintain queue depth that keeps the storage system

busy

R-A + storage Pre-fetch; Maintain read efficiency for large scans or concurrency (coalescing reads)R-A and Fast Track data striping allow multiple drive bandwidth to a single read thread

Single read thread, leveraging R-A, may read up to 40MB per thread. FT data stripe is 8MB per RAID1 array; One thread can touch up to 5 arrays, or 10 LUN.

SQL Read-Ahead (per SQL thread)

Disk Scan


Pages For Table ‘X’ Extent 1…

512k Read

*2MB Pre-Fetch

…Extent 64

Extent 65…

…Extent 128

Extent 129

*8MB Read-Ahead per ThreadR-A can request up to 40MB per threadQueue read requests to the storage hardware

Better chance of coalescing contiguous readsEach cache hit minimizes disk head movement and increases average scan rate

Keep the stack busyMaintain enough read queue depth to ensure that reads can process while SQL is processing data

Sequential Scan Components

RAID1+ data striping + pre-fetch + SQL read-ahead work to create efficient ‘Sequential IO’

Data stripe width is balanced against read-ahead “Depth”This combination can provide up to ~2MB ‘effective’ read size despite SQL Servers native 512k limit

Each element is necessary or efficiency falls off

ARY01D1v01

ARY01D2v02

ARY02D1v03

ARY02D2v04

ARY03D1v05

ARY03D2v06

ARY04D1v07

ARY04D2v08



4MB

4MB4MB

4MB4MB4MB

4MB4MB

Fast Track Data Stripe

512k read

*2MB Pre-Fetch*8MB Read-Ahead

Results &General Principles

POC Hardware

HP DL785, 8 socket, 48 core12 HP MSA-2312 storage arraysMaximum Throughput: 8GB/s User DataTwo Primary database, 24 LUN filegroup in eachThe next example was executed on only ‘half’ of this Fast Track configured system.

Each database can sustain this workload benchmark at ~3.5GB/s when run concurrently…Net is ~8GB/s disk I/O throughput.

Fast Track I/O Performance: Recent POC

Billion Row heap, no index, no partition168GB User DataFilegroup: 24 LUN, 24 FilesQuery:

Select DATEPART(mm,update_dt) ,SUM(POS_CURR_REQ_AMT) ,COUNT(*) ,AVG(BASE_CURR_REQ_AMT) FROM tf_1 WHERE time_dim_ky between 234 and

149999965 GROUP BY datepart(mm,update_dt)

Benchmark Results

100 Seconds

15%

25%

2000 MB/s 3800 MB/s

Summary1G rows, 168GB user data scanned < 1 minute (DOP 16)Maximum disk I/O throughput at DOP 16: 3.8GB/s25% system CPU utilization at DOP 16

Customer Workload SLA: < 100 Seconds execution time

SLA achieved at DOP 6Disk I/O throughput at DOP6: 2GB/sCPU Utilization at DOP6: 15%

Takeaway: Managing Workloads

DOP settings – Resource managementFor I/O bound queries: DOP at some level less than total LUN count will achieve maximum I/O bandwidthFor CPU bound queries: DOP at some level at or above total LUN count will achieve maximum I/O bandwidthPrinciple is holistic; Can be viewed based on the nature of the overall workload (CPU or I/O bound)

Relative predictability for an SMP databaseFast Track CPU – I/O balancingData Striping across balanced LUNScan based workload

11 33 55 scalable data warehouse & data marts reportsanalysis sql server dbms sql server...

Documents