storage performance 2013

92
#SQLSatRiyadh Storage Performance 2013 Joe Chang www.qdpma.com

Upload: lovey

Post on 23-Feb-2016

19 views

Category:

Documents


1 download

DESCRIPTION

Storage Performance 2013. Joe Chang www.qdpma.com. About Joe. SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?) Database with distribution statistics only, no data 2004 Decoding statblob/stats_stream - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Storage Performance 2013

#SQLSatRiyadh

Storage Performance 2013Joe Chang

www.qdpma.com

Page 2: Storage Performance 2013

About Joe

• SQL Server consultant since 1999• Query Optimizer execution plan cost formulas (2002)• True cost structure of SQL plan operations (2003?)• Database with distribution statistics only, no data

2004• Decoding statblob/stats_stream

– writing your own statistics• Disk IO cost structure• Tools for system monitoring, execution plan analysis See ExecStats on

www.qdpma.com

Page 3: Storage Performance 2013

Storage Performance Chain

• All elements must be correct– No weak links

• Perfect on 6 out 7 elementsand 1 not correct= bad IO performance

HDD SDD

SAS

RAID Group

Dir At/SAN SAS/FC

SQL Server File

SQL Server EngineSQL Server Extent

Pool

Page 4: Storage Performance 2013

Storage Performance Overview

• System Architecture– PCI-E, SAS, HBA/RAID controllers

• SSD, NAND, Flash Controllers, Standards– Form Factors, Endurance, ONFI, Interfaces

• SLC, MLC Performance

• Storage system architecture – Direct-attach, SAN

• Database – SQL Server Files, FileGroup

Page 5: Storage Performance 2013

Sandy Bridge EN & EP

Xeon E5-2600, Socket: R 2011-pin2 QPI, 4 DDR3, 40 PCI-E 3.0 8GT/s, DMI2Model, cores, clock, LLC, QPI, (Turbo)E5-2690 8 core 2.9GHz 20M 8.0GT/s (3.8)*E5-2680 8 core 2.7GHz 20M 8.0GT/s (3.5)E5-2670 8 core 2.6GHz 20M 8.0GT/s (3.3)E5-2667 6 core 2.9GHz 15M 8.0GT/s (3.5)*E5-2665 8 core 2.4GHz 20M 8.0GT/s (3.1)E5-2660 8 core 2.2GHz 20M 8.0GT/s (3.0)E5-2650 8 core 2.0GHz 20M 8.0GT/s (2.8)E5-2643 4 core 3.3GHz 10M 8.0GT/s (3.5)*E5-2640 6 core 2.5GHz 15M 7.2GT/s (3.0)

Xeon E5-2400, Socket B2 1356 pins1 QPI 8 GT/s, 3 DDR3 memory channels24 PCI-E 3.0 8GT/s, DMI2 (x4 @ 5GT/s)

E5-2470 8 core 2.3GHz 20M 8.0GT/s (3.1)E5-2440 6 core 2.4GHz 15M 7.2GT/s (2.9)E5-2407 4c – 4t 2.2GHz 10M 6.4GT/s (n/a)

QPI

QPI

DMI 2

PCIe x8PCIe x8

PCIe x8PCIe x8PCIe x8

MI

PCI-E

C1C0

C6C7

C2 C5C3 C4

LLC

QPI

MI

PCI-E x8PCI-E x8

PCI-E x8PCI-E x8PCI-E x8

MI

PCI-E

C1C0

C6C7

C2 C5C3 C4

LLC

QPI

MI

PCH

x4

QPI

DMI 2

PCIe x8PCIe x8

PCIe x8

MI

PCI-E

C1C0

C6C7

C2 C5C3 C4

LLC

QPI

MI

PCI-E x8PCI-E x8

PCI-E x8

MI

PCI-E

C1C0

C6C7

C2 C5C3 C4

LLC

QPI

MI

PCH

x4

Dell T620 4 x16, 2 x8, 1 x4Dell R720 1 x16, 6 x8HP DL380 G8p 2 x16, 3 x8, 1 x4Supermicro X9DRX+F 10 x8, 1 x4 g2

80 PCI-E gen 3 lanes + 8 gen 2 possible

EN

EP

Disable cores in BIOS/UEFI?

Page 6: Storage Performance 2013

Xeon E5-4600Xeon E5-4600Socket: R 2011-pin2 QPI, 4 DDR340 PCI-E 3.0 8GT/s, DMI2Model, cores, Clock, LLC, QPI, (Turbo)E5-4650 8 core 2.70GHz 20M 8.0GT/s (3.3)*E5-4640 8 core 2.40GHz 20M 8.0GT/s (2.8)E5-4620 8 core 2.20GHz 16M 7.2GT/s (2.6)E5-4617 6c - 6t 2.90GHz 15M 7.2GT/s (3.4)E5-4610 6 core 2.40GHz 15M 7.2GT/s (2.9)E5-4607 6 core 2.20GHz 12M 6.4GT/s (n/a)E5-4603 4 core 2.00GHz 10M 6.4GT/s (n/a)

Hi-freq 6-core gives up HTNo high-frequency 4-core,

160 PCI-E gen 3 lanes + 16 gen 2 possible

Dell R820 2 x16, 4 x8, 1 intHP DL560 G8p 2 x16, 3 x8, 1 x4Supermicro X9QR 7 x16, 1 x8

QPI

QPI

DMI 2

PCI-EPCI-E

PCI-EPCI-EPCI-E

QPI

QPI

PCI-EPCI-E

PCI-EPCI-EPCI-E

PCI-E

PCI-EPCI-E

PCI-EPCI-EPCI-E

PCI-E

PCI-EPCI-E

PCI-EPCI-EPCI-E

PCI-E

MI

PCI-E

C1 C6C2 C5C3 C4

LLC

QPI

MIC7C0

MI

PCI-E

C1 C6C2 C5C3 C4

LLC

QPI

MIC7C0

MI

PCI-E

C1 C6C2 C5C3 C4

LLC

QPI

MIC7C0

MI

PCI-E

C1 C6C2 C5C3 C4

LLC

QPI

MIC7C0

Page 7: Storage Performance 2013

PCI-E, SAS & RAID CONTROLLERS2

Page 8: Storage Performance 2013

PCI-E gen 1, 2 & 3Gen Raw bit

rateUnencoded Bandwidth

per directionBW x8Per direction

Net Bandwidthx8

PCIe 1 2.5GT/s 2Gbps ~250MB/s 2GB/s 1.6GB/s

PCIe 2 5.0GT/s 4Gbps ~500MB/s 4GB/s 3.2GB/s

PCIe 3 8.0GT/s 8Gbps ~1GB/s 8GB/s 6.4GB/s?

• PCIe 1.0 & 2.0 encoding scheme 8b/10b• PCIe 3.0 encoding scheme 128b/130b• Simultaneous bi-directional transfer• Protocol Overhead

– Sequence/CRC, Header– 22 bytes, (20%?)

Adaptec Series 7: 6.6GB/s, 450K IOPS

Page 9: Storage Performance 2013

PCI-E Packet

Net realizable bandwidth appears to be 20% less (1.6GB/s of 2.0GB/s)

Page 10: Storage Performance 2013

PCIe Gen2 & SAS/SATA 6Gbps

• SATA 6Gbps – single lane, net BW 560MB/s• SAS 6Gbps, x 4 lanes, net BW 2.2GB/s

– Dual-port, SAS protocol only• Not supported by SATA

PCIe g2 x8 HBA

SAS x4 6G

SAS x4 6G3.2GB/s

2.2GB/s

Some bandwidth mismatch is OK, especially on downstream side

A B

A B

A

B

Page 11: Storage Performance 2013

PCIe 3 & SAS

• 12Gbps – coming soon? Slowly? – Infrastructure will take more time

PCIe g3 x8 HBA

SAS x4 6G

SAS x4 6G

SAS x4 6G

SAS x4 6G

PCIe 3.0 x8 HBA2 SAS x4 12Gbps portsor 4 SAS x4 6Gbps port if HBA can support 6GB/s

PCIe g3 x8 HBA

SASExpander

SAS x4 6Gb

SAS x4 6GbSAS x4 12G

SAS x4 12GSAS

Expander

SAS x4 6Gb

SAS x4 6Gb

Page 12: Storage Performance 2013

PCIe Gen3 & SAS 6Gbps

Page 13: Storage Performance 2013

LSI 12Gpbs SAS 3008

Page 14: Storage Performance 2013

PCIe RAID Controllers?

• 2 x4 SAS 6Gbps ports (2.2GB/s per x4 port)– 1st generation PCIe 2 – 2.8GB/s?– Adaptec: PCIe g3 can do 4GB/s– 3 x4 SAS 6Gbps bandwidth match PCIe 3.0 x8

• 6 x4 SAS 6Gpbs – Adaptec Series 7, PMC– 1 Chip: x8 PCIe g3 and 24 SAS 6Gbps lanes

• Because they could

PCIe g3 x8 HBA

SAS x4 6G

SAS x4 6G

SAS x4 6G

SAS x4 6G

SAS x4 6G

SAS x4 6G

Page 15: Storage Performance 2013

SSD, NAND, FLASH CONTROLLERS2

Page 16: Storage Performance 2013

SSD Evolution

• HDD replacement – using existing HDD infrastructure– PCI-E card form factor lack expansion flexibility

• Storage system designed around SSD– PCI-E interface with HDD like form factor?– Storage enclosure designed for SSD

• Rethink computer system memory & storage• Re-do the software stack too!

Page 17: Storage Performance 2013

SFF-8639 & Express Bay

SCSI Express – storage over PCI-E, NVM-e

Page 18: Storage Performance 2013

New Form Factors - NGFF

Enterprise 10K/15K HDD - 15mm

15mmSSD Storage Enclosure could be 1U, 75 5mm devices?

Page 19: Storage Performance 2013

SATA Express Card (NGFF)

mSATA

M2

Crucial

Page 20: Storage Performance 2013

SSD – NAND Flash

• NAND– SLC, MLC regular and high-endurance– eMLC could mean endurance or embedded - differ

• Controller interfaces NAND to SATA or PCIE• Form Factor

– SATA/SAS interface in 2.5in HDD or new form factor

– PCI-E interface and FF, or HDD-like FF– Complete SSD storage system

Page 21: Storage Performance 2013

NAND Endurance

Intel – High Endurance Technology MLC

Page 22: Storage Performance 2013

NAND Endurance – Write Performance

Write Performance

Endu

ranc

e

SLC

MLC-e

MLC

Cost StructureMLC = 1MLC EE = 1.3SLC = 3

Process depend.34nm25nm20nm

Write perf?

Page 23: Storage Performance 2013

NAND P/E - Micron

34 or 25nm MLC NAND is probably goodDatabase can support cost structure

Page 24: Storage Performance 2013

NAND P/E - IBM

34 or 25nm MLC NAND is probably goodDatabase can support cost structure

Page 25: Storage Performance 2013

Write EnduranceVendors commonly cite single spec for range of models, 120, 240, 480GBShould vary with raw capacity?Depends on over-provioning?

3 year life is OK for MLC cost structure, maybe even 2 year

MLC 20TB / life = 10GB/day for 2000 days (5 years+), 20GB/day – 3 yearsVendors now cite 72TB write endurance for 120-480GB capacities?

Page 26: Storage Performance 2013

NAND

• SLC – fast writes, high endurance• eMLC – slow writes, medium endurance• MLC – medium writes, low endurance

• MLC cost structure of $1/GB @ 25nm – eMLC 1.4X, SLC 2X?

Page 27: Storage Performance 2013

ONFI

Open NAND Flash Interface organization• 1.0 2006 – 50MB/s• 2.0 2008 – 133MB/s• 2.1 2009 – 166 & 200MB/s• 3.0 2011 – 400MB/s

– Micron has 200 & 333MHz products

ONFI 1.0 – 6 channels to support 3Gbps SATA, 260MB/sONFI 2.0 – 4+ channels to support 6Gbps SATA, 560MB/s

Page 28: Storage Performance 2013

NAND write performance

MLC85MB/s per 4-die channel (128GB)340MB/s over 4 channels (512GB)?

Page 29: Storage Performance 2013

Controller Interface PCIe vs. SATA

NANDController

NAND

NAND

NAND

NAND

NAND

NAND

NAND

NAND

Some bandwidth mistmatch/overkill OKONFI 2 – 8 channels at 133MHz to SATA 6Gbps – 560 MB/s a good match

PCIe or SATA?Multiple lanes?

CPU access efficiency and scalingIntel & NVM Express

6-8 channel at 400MB/sto match 2.2GB/s x4 SAS?

16 channel+ at 400MB/sto match 6.4GB/s x8 PCIe 3

But ONFI 3.0 is overwhelming SATA 6Gbps?

Page 30: Storage Performance 2013

Controller Interface PCIe vs. SATA

Controller

SATA

DR

AM

NAN

D

NAN

D

NAN

D

NAN

D

NAN

D

NAN

D

NAN

D

NAN

D

PCIe

Controller

DR

AM

NAND NAND NAND NAND NAND NAND

PCIe NAND Controller VendorsVendor Channels PCIe GenIDT 32 x8 Gen3 NVMeMicron 32 x8 Gen2Fusion-IO 3x4? X8 Gen2?

Page 31: Storage Performance 2013

SATA & PCI-E SSD Capacities64 Gbit MLC NAND die 150mm2 25nm

2 x 32 Gbit 34nm1 x 64 Gbit 25nm 1 x 64 Gbit 29nm

1 64 Gbit die

8 x 64 Gbit die in 1 package = 64GB

SATA Controller – 8 channels, 8 package x 64GB = 512GB

PCI-E Controller – 32 channels x 64GB = 2TB

Page 32: Storage Performance 2013

PCI-E vs. SATA/SAS

• SATA/SAS controllers have 8 NAND channels– No economic benefit in fewer channels?– 8 ch. Good match for 50MB/s NAND to SATA 3G

• 3Gbps – approx 280MB/s realizable BW– 8 ch also good match for 100MB/s to SATA 6G

• 6Gbps – 560MB/s realizable BW– NAND is now at 200 & 333MB/s

• PCI-E – 32 channels practical – 1500 pins– 333MHz good match to gen 3 x8 – 6.4GB/s BW

Page 33: Storage Performance 2013

Crucial/Micron P400m & eCrucial P400m 100GB 200GB 400GBRaw 168GB 336GB 672GBSeq Read (up to) 380MB/s 380MB/s 380MB/Seq Write (up to) 200MB/s 310MB/s 310MB/sRandom Read 52K 54K 60KRandom Write 21K 26K 26KEndurance 2M-hr MTBF 1.75PB 3.0PB 7.0PBPrice $300? $600? $1000?

Crucial P400e 100GB 200GB 400GBRaw 128 256 512Seq Read (up to) 350MB/s 350MB/s 350MB/Seq Write (up to) 140MB/s 140MB/s 140MB/sRandom Read 50K 50K 50KRandom Write 7.5K 7.5K 7.5KEndurance 1.2M-hr MTBF 175TB 175TB 175TBPrice $176 $334 $631

P410m SAS specs slightly different

EE MLC Higher endurancewrite perf not lower than MLC?

Preliminary – need to update

Page 34: Storage Performance 2013

Crucial m4 & m500

Crucial m4 128GB 256GB 512GBRaw 128 256 512Seq Read (up to) 415MB/s 415MB/s 415MB/Seq Write (up to) 175MB/s 260MB/s 260MB/sRandom Read 40K 40K 40KRandom Write 35K 50K 50KEndurance 72TB 72TB 72TBPrice $112 $212 $400

Crucial m500 120GB 240GB 480GB 960GBRaw 128GB 256GB 512GB 1024Seq Read (up to) 500MB/s 500MB/s 500MB/Seq Write (up to) 130MB/s 250MB/s 400MB/sRandom Read 62K 72K 80KRandom Write 35K 60K 80KEndurance 1.2M-hr MTBF 72TB 72TB 72TBPrice $130 $220 $400 $600

Preliminary – need to update

Page 35: Storage Performance 2013

Micron & Intel SSD Pricing (2013-02)

100/128 200/256 400/512$0

$100

$200

$300

$400

$500

$600

$700

$800

$900

$1,000

m400P400eP400mS3700

P400m raw capacities are 168, 336 and 672GB (pricing retracted)Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively

Need corrected P400m pricing

Page 36: Storage Performance 2013

4K Write K IOPS

100/128 200/256 400/5120

10

20

30

40

50

60

m400P400eP400mS3700

P400m raw capacities are 168, 336 and 672GB (pricing retracted)Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively

Need corrected P400m pricing

Page 37: Storage Performance 2013

SSD Summary

• MLC is possible with careful write strategy– Partitioning to minimize index rebuilds– Avoid full database restore to SSD

• Endurance (HET) MLC – write perf?– Standard DB practice work– But avoid frequent index defrags?

• SLC – only extreme write intensive?– Lower volume product – higher cost

Page 38: Storage Performance 2013

DIRECT ATTACH STORAGE3

Page 39: Storage Performance 2013

Full IO Bandwidth

Misc devices on 2 x4 PCIe g2, Internal boot disks, 1GbE or 10GbE, graphics

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x8

QPI

QPI

• 10 PCIe g3 x8 slots possible – Supermicro only – HP, Dell systems have 5-7 x8+ slots + 1 x4?

• 4GB per slot with 2 x4 SAS, – 6GB/s with 4 x4

• Mixed SSD + HDD – reduce wear on MLC

PCIe x4

10GbE InfiniBand

192 GB

RAID

HDD

SSD

RAID

HDD

SSD

RAID

HDD

SSD

RAID

HDD

SSD

192 GB

PCIe x8

InfiniBand

PCIe x4

Misc

PCIe x8

PCIe x8

PCIe x8

PCIe x8

RAID

HDD

SSD

RAID

HDD

SSD

RAID

HDD

SSD

RAID

HDD

SSD

Page 40: Storage Performance 2013

System Storage StrategyDell & HP only have 5-7 slots4 Controllers @ 4GB/s each is probably good enough?Few practical products can use PCIe G3 x16 slots

• Capable of 16GB/s with initial capacity– 4 HBA, 4-6GB/s each

• with allowance for capacity growth– And mixed SSD + HDD

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x4

IBRAID RAID RAIDRAID10GbE

QPI

QPI192 GB 192 GB

HDD HDD HDD HDD

SSD SSD SSD SSD

Page 41: Storage Performance 2013

Node 1QPI

QPI

192 GB 192 GB

HBA HBA HBA HBA

Node 2QPI

QPI

192 GB 192 GB

HBA HBA HBA HBA

MD3220 MD3220 MD3220 MD3220

Clustered SAS StorageDell MD3220 supports clusteringUpto 4 nodes w/o external switch(extra nodes not shown)

SASHost

IOC PCIESwitch

SASHost

SASHost

SASHost

2GBSASExp

SASHost

IOC PCIESwitch

SASHost

SASHost

SASHost

2GBSASExp

Host

IOC PCIE

Host Host Host

2GB Exp

Host

IOC PCIE

Host Host Host

2GB Exp

SSDSSD SSD SSD

HDD HDD HDD HDD

Page 42: Storage Performance 2013

Alternate SSD/HDD Strategy

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x4

IBHBA HBA HBAHBA

• Primary System– All SSD for data & temp, – logs may be HDD

• Secondary System– HDD for backup and restore testing

10GbE

SSD SSD SSD SSD

HDD

Backup System

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x4

PCIe x8

IB RAID RAID RAIDRAID

HDD HDD HDD HDD

QPI

QPI192 GB 192 GB

Page 43: Storage Performance 2013

System Storage Mixed SSD + HDDEach RAID Group-Volume should not exceed 2GB/s BW of x4 SAS2-4 volumes per x8 PCIe G3 slot

SATA SSD read 350-500MB/s, write 140MB/s+8 per volume allows for some overkill16 SSD per RAID Controller

64 SATA/SAS SSD’s to deliver 16-24GB/s

4 HDD per volume rule does not apply

HDD for local database backup, restore tests, and DW flat filesSSD & HDD on shared channel – simultaneous bi-directional IO

x8

HBA

x8 x8 x8 x8x4

HBA HBA IBHBA

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

HDD

10GbE

QPI

QPI192 GB 192 GB

Page 44: Storage Performance 2013

SSD/HDD System Strategy

• MLC is possible with careful write strategy– Partitioning to minimize index rebuilds– Avoid full database restore to SSD

• Hybrid SSD + HDD system, full-duplex signalling

• Endurance (HET) MLC – write perf?– Standard DB practice work, avoid index defrags

• SLC – only extreme write intensive?– Lower volume product – higher cost

• HDD – for restore testing

Page 45: Storage Performance 2013

SAS Expander

Disk Enclosureexpansionports not shown2 x4 to hosts

1 x4 for expansion24 x1 for disks

Page 46: Storage Performance 2013

Storage Infracture – designed for HDD15mm

• 2 SAS Expanders for dual-port support– 1 x4 upstream (to host), 1 x4 downstream (expan)– 24 x1 for bays

2U

Page 47: Storage Performance 2013

Mixed HDD + SSD Enclosure15mm

• Current: 24 x 15mm = 360mm + spacing• Proposed 16 x15mm=240mm + 16x7mm= 120

2U

Page 48: Storage Performance 2013

Enclosure 24x15mm and proposed

Current 2U Enclosure, 24 x 15mm bays – HDD or SSD2 SAS expanders – 32 lanes each4 lanes upstream to host4 lanes downstream for expansion24 lanes for bays

PCIe

Host

x8

PCIe

x8

384 GB

SAS

SAS

SAS

SAS

SAS

Expa

nder

SAS Expander

HBA SAS x46 Gpbs2.2GB/s

PCIe

Host

x8

PCIe

x8

384 GB

SAS

SAS

SAS

SAS

SAS

Expa

nder

SAS Expander

HBA SAS x412 Gpbs 4GB/s

New SAS 12Gbps16 x 15mm + 16 x 7mm bays2 SAS expanders – 40 lanes each4 lanes upstream to host4 lanes downstream for expansion32 lanes for bays

2 RAID Groups for SSD, 2 for HDD1 SSD Volume on path A1 SSD Volume on path B

Page 49: Storage Performance 2013

Enclosure 24x15mm and proposed

Current 2U Enclosure, 24 x 15mm bays – HDD or SSD2 SAS expanders – 32 lanes each4 lanes upstream to host4 lanes downstream for expansion24 lanes for bays

PCIe

Host

x8

PCIe

x8

384 GB

SAS

SAS

SAS

SAS

SAS

Expa

nder

SAS Expander

HBA

PCIe

Host

x8

PCIe

x8

384 GB

SAS

SAS

SAS

SAS

SAS

Expa

nder

SAS Expander

HBA

New SAS 12Gbps16 x 15mm + 16 x 7mm bays2 SAS expanders – 40 lanes each4 lanes upstream to host4 lanes downstream for expansion32 lanes for bays

2 RAID Groups for SSD, 2 for HDD1 SSD Volume on path A1 SSD Volume on path B

Page 50: Storage Performance 2013

SAS x4

SAS x4

SAS x4

SAS x4

Alternative Expansion

PCIe x8 HBA

SAS x4

SAS x4

SAS x4

SAS x4

Hos

t

Expa

nder

Expa

nder

Expa

nder

Expa

nder

Each SAS expander – 40 lanes, 8 lanes upstream to host with no expansionor 4 lanes upstream and 5 lanes downstream for expansion32 lanes for bays

Enclosure 2

Enclosure 3

Enclosure 4

Enclosure 1

Page 51: Storage Performance 2013

PCI-E with Expansion

SAS

SAS

SAS

SAS

SAS

Expa

nder

SAS Expander

PCIe

Host

x8

PCIe

x8

384 GB

HBA SAS x46 Gpbs2.2GB/s

PCIe

Host

x8

PCIe

x8

384 GB

PCI-E Switch

x8 x8 x8 x8

x8

• PCI-E slot SSD suitable for known capacity• 48 & 64 lanes PCI-E switches available

– x8 or x4 ports

Express bay form factor?

Few x8 ports?or many x4 ports?

Page 52: Storage Performance 2013

Enclosure for SSD (+ HDD?)

• 2 x4 on each expander upstream – 4GB/s– No downstream ports for expansion?

• 32 ports for device bays– 16 SSD (7mm) + 16 HDD (15mm)

• 40 lanes total, no expansion– 48 lanes with expansion

Page 53: Storage Performance 2013

Large SSD Array

• Large number of devices, large capacity– Downstream from CPU has excess bandwidth

• Do not need SSD firmware peak performance– 1 ) no stoppages, 2) consistency is nice

• Mostly static data – some write intensive– Careful use of partitioning to avoid index rebuild

and defragmentation– If 70% is static, 10% is write intensive

• Does wear leveling work

Page 54: Storage Performance 2013

DATABASE – SQL SERVER4

Page 55: Storage Performance 2013

Database Environment

• OLTP + DW Databases are very high value– Software license + development is huge– 1 or more full time DBA, several application

developers, and help desk personnel– Can justify any reasonable expense– Full knowledge of data (where the writes are)– Full control of data (where the writes are)– Can adjust practices to avoid writes to SSD

Page 56: Storage Performance 2013

Database – Storage Growth

• 10GB per day data growth– 10M items at 1KB per row (or 4 x 250 byte rows)– 18 TB for 5 years (1831 days)– Database log can stay on HDD

• Heavy system– 64-128 x 256/512GB (raw) SSD– Each SSD can support 20GB/day (36TB lifetime?)

• With Partitioning – few full index rebuilds• Can replace MLC SSD every 2 years if required

Big company

Page 57: Storage Performance 2013

Extra Capacity - Maintenance

• Storage capacity will be 2-3X database size– It will be really stupid if you cannot update

application for lack of space to modify a large table

– SAN evironment• Only required storage capacity allocated• May not be able to perform maintenance ops

– If SAN admin does not allocate extra space

Page 58: Storage Performance 2013

SSD/HDD Component Pricing 2013

• MLC consumer <$1.0K/TB• MLC Micron P400e <$1.2K/TB• MLC endurance <$2.0K/TB• SLC $4K???• HDD 600GB, 10K $400

Page 59: Storage Performance 2013

Database Storage Cost

• 8 x256GB (raw) SSD per x4 SAS channel = 2TB• 2 x4 ports per RAID controller, 4TB/RC• 4 RAID Controller per 2 socket system, 16TB• 32 TB with 512GB SSD, 64TB with 1TB,

– 64 SSD per system at $250 (MLC) $16K– 64 HDD 10K 600GB $400 $26K– Server 2xE5, 24x16GB, qty 2 $12K each– SQL Server 2012 EE $6K x 16 cores $96K

HET MLC and even SLC premium OKServer/Enterprise premium – high validation effort, low volume, high support expectations

Page 60: Storage Performance 2013

OLTP & DW

• OLTP – backup to local HDD – Superfast backup, read 10GB/s, write 3GB/s (R5)– Writes to data blocked during backup– Recovery requires log replay

• DW – example: 10TB data, 16TB SSD– Flat files on HDD– Tempdb will generate intensive writes (1TB)

• Database (real) restore testing– Force tx roll forward/back, i.e., need HDD array

Page 61: Storage Performance 2013

SQL Server Storage Configuration

• IO system must have massive IO bandwidth– IO over several channels

• Database must be able to use all channels simultaneously– Multiple files per filegroups

• Volumes / RAID Groups on each channel– Volume comprised of several devices

Page 62: Storage Performance 2013

HDD, RAID versus SQL Server

• HDD – pure sequential – not practical,

• impossible to maintain– Large block 256K good enough

• 64K OK

• RAID Controller – 64K to 256K stripe size• SQL Server

– Default extent allocation: 64K per file– With –E, 4 consecutive extents – why not 16???

Page 63: Storage Performance 2013

File Layout Physical ViewEach Filegroup and tempdb has 1 data file on every data volume

IO to any object is distributed over all paths and all disks

x8

HBA

x8

HBA

x8

HBA

x8

HBA

QPI

QPI192 GB 192 GB

x4

10GbE

Page 64: Storage Performance 2013

Filegroup & File LayoutEach File Group has 1 file on each data volumeEach object is distributed across all data “disks”Tempdb data files share same volumes

As shown, 2 RAID groups per controller, 1 per port.Can be 4 RG/volume per Ctlr

OS and Log disks not shown

Controller 1 Port 0Disk 2Basic

FileGroup A, File 1FileGroup B, File 1Tempdb File 1

Controller 1 Port 1Disk 3Basic

FileGroup A, File 2FileGroup B, File 2Tempdb File 2

Controller 2 Port 0Disk 4Basic

FileGroup A, File 3FileGroup B, File 3Tempdb File 3

Controller 2 Port 1Disk 5Basic

FileGroup A, File 4FileGroup B, File 4Tempdb File 4

FileGroup A, File 5FileGroup B, File 5Tempdb File 5

Controller 3 Port 0Disk 6Basic

FileGroup A, File 6FileGroup B, File 6Tempdb File 6

Controller 3 Port 1Disk 7Basic

FileGroup A, File 7FileGroup B, File 7Tempdb File 7

Controller 4 Port 0Disk 8Basic

FileGroup A, File 8FileGroup B, File 8Tempdb File 8

Controller 4 Port 1Disk 9Basic

Page 65: Storage Performance 2013

RAID versus SQL Server ExtentsDefault: allocate 1 extent from file 1,allocate extent 2 from file 2, Disk IO – 64KOnly 1 disk in each RAID group is active

Controller 1 Port 0Disk 2Basic1112GBOnline Extent 1

Extent 17Extent 33

Extent 5Extent 21Extent 37

Extent 9Extent 25Extent 41

Extent 13Extent 29Extent 45

Controller 1 Port 1Disk 3Basic1112GBOnline Extent 2

Extent 18Extent 34

Extent 6Extent 22Extent 38

Extent 10Extent 26Extent 42

Extent 14Extent 30Extent 46

Controller 2 Port 0Disk 4Basic1112GBOnline Extent 3

Extent 19Extent 35

Extent 7Extent 23Extent 39

Extent 11Extent 27Extent 43

Extent 15Extent 31Extent 47

Controller 2 Port 1Disk 5Basic1112GBOnline Extent 4

Extent 20Extent 36

Extent 8Extent 24Extent 40

Extent 12Extent 28Extent 44

Extent 16Extent 32Extent 48

Page 66: Storage Performance 2013

Consecutive Extents -EController 1 Port 0Disk 2

Basic1112GBOnline Extent 1

Extent 17Extent 33

Extent 2Extent 18Extent 34

Extent 3Extent 19Extent 35

Extent 4Extent 20Extent 36

Controller 1 Port 1Disk 3Basic1112GBOnline Extent 5

Extent 21Extent 37

Extent 6Extent 22Extent 38

Extent 7Extent 23Extent 39

Extent 8Extent 24Extent 40

Controller 2 Port 0Disk 4Basic1112GBOnline Extent 9

Extent 25Extent 41

Extent 10Extent 26Extent 42

Extent 11Extent 27Extent 43

Extent 12Extent 28Extent 44

Controller 2 Port 1Disk 5Basic1112GBOnline Extent 13

Extent 29Extent 45

Extent 14Extent 30Extent 46

Extent 15Extent 31Extent 47

Extent 16Extent 32Extent 48

Allocate 4 consecutive extents from each file,

OS issues 256K Disk IO

Each HDD in RAID group sees 64K IOUpto 4 disks in RG gets IO

Page 67: Storage Performance 2013

Storage Summary

• OLTP – endurance MLC or consumer MLC?• DW - MLC w/ higher OP• QA – consumer MLC or endurance MLC?• Tempdb – possibly SLC• Single log – HDD, multiple logs: SSD?• Backups/test Restore/Flat files – HDD• No caching, no auto-tiers

Page 68: Storage Performance 2013

SAN

Page 69: Storage Performance 2013

Software Cache + Tier

Page 70: Storage Performance 2013

Cache + Auto-TierGood idea if1) No knowledge2) No control

In DatabaseWe have1) Full

knowledge2) Full Control

Virtual file statsFilegroupspartitioning

Page 71: Storage Performance 2013

Common SAN Vendor Configuration

Log volume

SSD 10K 7.2K Hot Spares

Node 1 Node 2

Switch Switch

SP A SP B

8 Gbps FC or10Gbps FCOE

768 GB 768 GB

x4 SAS2GB/s

24 GB 24 GB

Main Volume

Path and component fault-tolerance, poor IO performance

Multi-path IO:perferred portalternate port

Single large volume for data, additional volumes for log, tempdb, etc

All data IO on single FC port700MB/s IO bandwidth

Page 72: Storage Performance 2013

Multiple Paths & Volumes 3

Multiple quad-port FC HBAs

Optional SSD volumes

Data files must also be evenly distributed

Many SAS ports

Multiple local SSD for tempdb

Node 2768 GB

Node 1 768 GB

Switch Switch

SP A SP B

8 Gb FC

24 GB 24 GB

x4 SAS 2GB/s

SSD

x8 x8

SSD

x8 x8 x8x8

Data 5 Data 6 Data 7

Data 1 Data 2 Data 3 Data 4

Data 8

Data 9

Data 13

Data 10

Data 14

Data 11

Data 15

Data 12

Data 16

SSD 1 SSD 2 SSD 3 SSD 4

Log 1 Log 2 Log 3 Log 4

Page 73: Storage Performance 2013

Multiple Paths & Volumes 2

Multiple quad-port FC HBAs

Optional SSD volumes

Data files must also be evenly distributed

Many SAS ports

Node 2768 GB

Node 1 768 GB

Switch Switch

SP A SP B

8 Gb FC

24 GB 24 GB

x4 SAS 2GB/s

Data 1 Data 2 Data 3 Data 4

Data 5 Data 6 Data 7 Data 8

Data 9 Data 10 Data 11 Data 12

Data 13 Data 14 Data 15 Data 16

SSD 1 SSD 2 SSD 3 SSD 4

Log 1 Log 2 Log 3 Log 4

SSD

x8 x8

SSD

x8 x8 x8x8

Multiple local SSD for tempdb

Page 74: Storage Performance 2013

8Gbps FC rules

• 4-5 HDD RAID Group/Volumes– SQL Server with –E only allocates 4 consecutive

extents• 2+ Volumes per FC port

– Target 700MB/s per 8Gbps FC port• SSD Volumes

– Limited by 700-800MB/s per 8Gbps FC port– Too many ports required for serious BW– Management headache from too many volumes

Page 75: Storage Performance 2013

SQL Server

• SQL Server table scan to – heap generates 512K IO, easy to hit 100MB/s/disk– (clustered) index 64K IO, 30-50MB/s per disk likely

Page 76: Storage Performance 2013

EMC VNX 5300 FT DW Ref Arch

Page 77: Storage Performance 2013

iSCSI & File structure

x4

10GbE

x4

10GbE

Controller 1 Controller 2

RJ45 SFP+

DB1 files DB2 files

x4

10GbE

x4

10GbE

Controller 1 Controller 2

RJ45 SFP+

DB1 files DB2 files

x4

10GbEx4

10GbE

Controller 1 Controller 2

RJ45 SFP+

DB1 file 1DB2 file 1

DB1 file 2DB2 file2

Page 78: Storage Performance 2013
Page 79: Storage Performance 2013

EMC VMAX

Page 80: Storage Performance 2013

EMC VMAX orig and 2nd gen

· 2.8 GHz Xeon w/turbo (Westmere)· 24 CPU cores· 256 GB cache memory (maximum)· Quad Virtual Matrix· PCIe Gen2

· 2.3 GHz Xeon (Harpertown)· 16 CPU cores· 128 GB cache memory (maximum)· Dual Virtual Matrix· PCIe Gen1

CPU Complex

Front End/Back End Ports

CPU Complex

Front End/Back End Ports

Global Memory

CMI-II

Front End Back End Front End Back End

Page 81: Storage Performance 2013

EMC VMAX 10K

Page 82: Storage Performance 2013

EMC VMAX Virtual Matrix

VirtualMatrix

Page 83: Storage Performance 2013

VMAX Director

Page 84: Storage Performance 2013

EMC VMAX Director

IOH

IOH

Director

VMI

VMAX Engine?

FC HBA FC HBA SAS SAS

VMI

IOH

IOH

Director

FC HBA FC HBA SAS SAS

VMI VMI VMI VMI

VMAX 10K newUpto 4 engines, 1 x 6c 2.8G per dir50GB/s VM BW?16 x 8Gbps FC per engine

VMAX 20K Engine 4 QC 2.33GHz 128GB Virtual Maxtrix BW 24GB/sSystem - 8 engines, 1TB, VM BW 192GB/s, 128 FE ports

VMAX 40K Engine 4 SC 2.8GHz256GBVirtual Maxtrix BW 50GB/s

System - 8 engines, 2TB, VM BW 400GB/s, 128 FE ports

RapidIO IPC3.125GHz, 2.5Gb/s 8/104 lanes per connection10Gb/s = 1.25GB/s, 2.5GB/s full duplex4 Conn per engine - 10GB/s

36 PCI-E per IOH, 72 combined8 FE, 8 BE16 VMI 1, 32 VMI 2

Page 85: Storage Performance 2013
Page 86: Storage Performance 2013

SQL Server Default Extent Allocation

Data file 1

Extent 1 Extent 5 Extent 9 Extent 13Extent 17 Extent 21 Extent 25 Extent 29Extent 33 Extent 37 Extent 41 Extent 45

Data file 2

Extent 2 Extent 6 Extent 10 Extent 14Extent 18 Extent 22 Extent 26 Extent 30Extent 34 Extent 38 Extent 42 Extent 46

Data file 3

Extent 3 Extent 7 Extent 11 Extent 15Extent 19 Extent 23 Extent 27 Extent 31Extent 35 Extent 39 Extent 43 Extent 47

Data file 4

Extent 4 Extent 8 Extent 12 Extent 16Extent 20 Extent 24 Extent 28 Extent 32Extent 36 Extent 40 Extent 44 Extent 48

Allocate 1 extent per file in round robinProportional fillEE/SE table scan tries to stay 1024 pages ahead?

SQL can read 64 contiguous pages from 1 file. The storage engine reads index pages serially in key order.Partitioned table support for heap organization desired?

Page 87: Storage Performance 2013

SAN

Volume 1 Data Volume 2 Data

Volume .. Data Volume .. Data

Volume 3 Data Volume 4 Data

Volume 15 Data Volume 16 Data

Volume - Log Log Log

SSD 1 SSD 2 SSD ... SSD 8

Node 1 Node 2

Switch Switch

SP A SP B

768 GB 768 GB

24 GB 24 GB

8 Gb FC

x4 SAS2GB/s

SSD 10K

Node 1

192 GB 192 GB

HBA HBA HBA HBA

Node 2QPI

QPI

192 GB 192 GB

HBA HBA HBA HBA

Page 88: Storage Performance 2013

Clustered SAS

ExpHost Host

Host HostSASHost

IOC PCIESwitch

SASHost

SASHost

SASHost

2GBSASExp

SASHost

IOC PCIESwitch

SASHost

SASHost

SASHost

2GBSASExp

Node 1

768 GB

Node 1

768 GB

Host

IOC Switch

Host Host Host

SASExp

Host

IOC Switch

Host Host Host

SASExp

Node 1

192 GB 192 GB

HBA HBA HBA HBA

Node 2QPI

QPI

192 GB 192 GB

HBA HBA HBA HBA

Node 2

768 GB

SAS In

SASOut

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x8

PCIe x4

IBRAID RAID RAIDRAID10GbE

HDD HDD HDD HDD

SSD SSD SSD SSD

Page 89: Storage Performance 2013

Fusion-IO ioScale

Page 90: Storage Performance 2013
Page 91: Storage Performance 2013
Page 92: Storage Performance 2013