#sqlsatriyadh storage performance 2013 joe chang
TRANSCRIPT
#SQLSatRiyadh
Storage Performance 2013Joe Chang
www.qdpma.com
About Joe
• SQL Server consultant since 1999• Query Optimizer execution plan cost formulas (2002)• True cost structure of SQL plan operations (2003?)• Database with distribution statistics only, no data
2004• Decoding statblob/stats_stream
– writing your own statistics• Disk IO cost structure• Tools for system monitoring, execution plan analysis See ExecStats on
www.qdpma.com
Storage Performance Chain
• All elements must be correct– No weak links
• Perfect on 6 out 7 elementsand 1 not correct= bad IO performance
HDD SDD
SAS
RAID Group
Dir At/SAN SAS/FC
SQL Server File
SQL Server EngineSQL Server Extent
Pool
Storage Performance Overview
• System Architecture– PCI-E, SAS, HBA/RAID controllers
• SSD, NAND, Flash Controllers, Standards– Form Factors, Endurance, ONFI, Interfaces
• SLC, MLC Performance
• Storage system architecture – Direct-attach, SAN
• Database – SQL Server Files, FileGroup
Sandy Bridge EN & EP
Xeon E5-2600, Socket: R 2011-pin2 QPI, 4 DDR3, 40 PCI-E 3.0 8GT/s, DMI2Model, cores, clock, LLC, QPI, (Turbo)E5-2690 8 core 2.9GHz 20M 8.0GT/s (3.8)*E5-2680 8 core 2.7GHz 20M 8.0GT/s (3.5)E5-2670 8 core 2.6GHz 20M 8.0GT/s (3.3)E5-2667 6 core 2.9GHz 15M 8.0GT/s (3.5)*E5-2665 8 core 2.4GHz 20M 8.0GT/s (3.1)E5-2660 8 core 2.2GHz 20M 8.0GT/s (3.0)E5-2650 8 core 2.0GHz 20M 8.0GT/s (2.8)E5-2643 4 core 3.3GHz 10M 8.0GT/s (3.5)*E5-2640 6 core 2.5GHz 15M 7.2GT/s (3.0)
Xeon E5-2400, Socket B2 1356 pins1 QPI 8 GT/s, 3 DDR3 memory channels24 PCI-E 3.0 8GT/s, DMI2 (x4 @ 5GT/s)
E5-2470 8 core 2.3GHz 20M 8.0GT/s (3.1)E5-2440 6 core 2.4GHz 15M 7.2GT/s (2.9)E5-2407 4c – 4t 2.2GHz 10M 6.4GT/s (n/a)
QPI
QPI
DM
I 2
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
MI
PCI-E
C1C0
C6C7
C2 C5C3 C4
LLC
QPI
MI
PCI-E x8
PCI-E x8
PCI-E x8
PCI-E x8
PCI-E x8
MI
PCI-E
C1C0
C6C7
C2 C5C3 C4
LLC
QPI
MI
PCH
x4
QPI
DM
I 2
PCIe x8
PCIe x8
PCIe x8
MI
PCI-E
C1C0
C6C7
C2 C5C3 C4
LLC
QPI
MI
PCI-E x8
PCI-E x8
PCI-E x8
MI
PCI-E
C1C0
C6C7
C2 C5C3 C4
LLC
QPI
MI
PCH
x4
Dell T620 4 x16, 2 x8, 1 x4Dell R720 1 x16, 6 x8HP DL380 G8p 2 x16, 3 x8, 1 x4Supermicro X9DRX+F 10 x8, 1 x4 g2
80 PCI-E gen 3 lanes + 8 gen 2 possible
EN
EP
Disable cores in BIOS/UEFI?
Xeon E5-4600Xeon E5-4600Socket: R 2011-pin2 QPI, 4 DDR340 PCI-E 3.0 8GT/s, DMI2Model, cores, Clock, LLC, QPI, (Turbo)E5-4650 8 core 2.70GHz 20M 8.0GT/s (3.3)*E5-4640 8 core 2.40GHz 20M 8.0GT/s (2.8)E5-4620 8 core 2.20GHz 16M 7.2GT/s (2.6)E5-4617 6c - 6t 2.90GHz 15M 7.2GT/s (3.4)E5-4610 6 core 2.40GHz 15M 7.2GT/s (2.9)E5-4607 6 core 2.20GHz 12M 6.4GT/s (n/a)E5-4603 4 core 2.00GHz 10M 6.4GT/s (n/a)
Hi-freq 6-core gives up HTNo high-frequency 4-core,
160 PCI-E gen 3 lanes + 16 gen 2 possible
Dell R820 2 x16, 4 x8, 1 intHP DL560 G8p 2 x16, 3 x8, 1 x4Supermicro X9QR 7 x16, 1 x8
QPI
QPI
DM
I 2
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
QPI
QPI
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
MI
PCI-E
C1 C6C2 C5C3 C4
LLC
QPI
MIC7C0
MI
PCI-E
C1 C6C2 C5C3 C4
LLC
QPI
MIC7C0
MI
PCI-E
C1 C6C2 C5C3 C4
LLC
QPI
MIC7C0
MI
PCI-E
C1 C6C2 C5C3 C4
LLC
QPI
MIC7C0
PCI-E, SAS & RAID CONTROLLERS2
PCI-E gen 1, 2 & 3Gen Raw bit
rateUnencoded Bandwidth
per directionBW x8Per direction
Net Bandwidthx8
PCIe 1 2.5GT/s 2Gbps ~250MB/s 2GB/s 1.6GB/s
PCIe 2 5.0GT/s 4Gbps ~500MB/s 4GB/s 3.2GB/s
PCIe 3 8.0GT/s 8Gbps ~1GB/s 8GB/s 6.4GB/s?
• PCIe 1.0 & 2.0 encoding scheme 8b/10b• PCIe 3.0 encoding scheme 128b/130b• Simultaneous bi-directional transfer• Protocol Overhead
– Sequence/CRC, Header– 22 bytes, (20%?)
Adaptec Series 7: 6.6GB/s, 450K IOPS
PCI-E Packet
Net realizable bandwidth appears to be 20% less (1.6GB/s of 2.0GB/s)
PCIe Gen2 & SAS/SATA 6Gbps
• SATA 6Gbps – single lane, net BW 560MB/s• SAS 6Gbps, x 4 lanes, net BW 2.2GB/s
– Dual-port, SAS protocol only• Not supported by SATA
PCIe g2 x8 HBA
SAS x4 6G
SAS x4 6G3.2GB/s
2.2GB/s
Some bandwidth mismatch is OK, especially on downstream side
A B
A B
A
B
PCIe 3 & SAS
• 12Gbps – coming soon? Slowly? – Infrastructure will take more time
PCIe g3 x8 HBA
SAS x4 6G
SAS x4 6G
SAS x4 6G
SAS x4 6G
PCIe 3.0 x8 HBA2 SAS x4 12Gbps portsor 4 SAS x4 6Gbps port if HBA can support 6GB/s
PCIe g3 x8 HBA
SASExpander
SAS x4 6Gb
SAS x4 6GbSAS x4 12G
SAS x4 12GSAS
Expander
SAS x4 6Gb
SAS x4 6Gb
PCIe Gen3 & SAS 6Gbps
LSI 12Gpbs SAS 3008
PCIe RAID Controllers?
• 2 x4 SAS 6Gbps ports (2.2GB/s per x4 port)– 1st generation PCIe 2 – 2.8GB/s?– Adaptec: PCIe g3 can do 4GB/s– 3 x4 SAS 6Gbps bandwidth match PCIe 3.0 x8
• 6 x4 SAS 6Gpbs – Adaptec Series 7, PMC– 1 Chip: x8 PCIe g3 and 24 SAS 6Gbps lanes
• Because they could
PCIe g3 x8 HBA
SAS x4 6G
SAS x4 6G
SAS x4 6G
SAS x4 6G
SAS x4 6G
SAS x4 6G
SSD, NAND, FLASH CONTROLLERS2
SSD Evolution
• HDD replacement – using existing HDD infrastructure– PCI-E card form factor lack expansion flexibility
• Storage system designed around SSD– PCI-E interface with HDD like form factor?– Storage enclosure designed for SSD
• Rethink computer system memory & storage• Re-do the software stack too!
SFF-8639 & Express Bay
SCSI Express – storage over PCI-E, NVM-e
New Form Factors - NGFF
Enterprise 10K/15K HDD - 15mm
15mmSSD Storage Enclosure could be 1U, 75 5mm devices?
SATA Express Card (NGFF)
mSATA
M2
Crucial
SSD – NAND Flash
• NAND– SLC, MLC regular and high-endurance– eMLC could mean endurance or embedded - differ
• Controller interfaces NAND to SATA or PCIE• Form Factor
– SATA/SAS interface in 2.5in HDD or new form factor
– PCI-E interface and FF, or HDD-like FF– Complete SSD storage system
NAND Endurance
Intel – High Endurance Technology MLC
NAND Endurance – Write Performance
Write Performance
Endu
ranc
e
SLC
MLC-e
MLC
Cost StructureMLC = 1MLC EE = 1.3SLC = 3
Process depend.34nm25nm20nm
Write perf?
NAND P/E - Micron
34 or 25nm MLC NAND is probably goodDatabase can support cost structure
NAND P/E - IBM
34 or 25nm MLC NAND is probably goodDatabase can support cost structure
Write EnduranceVendors commonly cite single spec for range of models, 120, 240, 480GBShould vary with raw capacity?Depends on over-provioning?
3 year life is OK for MLC cost structure, maybe even 2 year
MLC 20TB / life = 10GB/day for 2000 days (5 years+), 20GB/day – 3 yearsVendors now cite 72TB write endurance for 120-480GB capacities?
NAND
• SLC – fast writes, high endurance• eMLC – slow writes, medium endurance• MLC – medium writes, low endurance
• MLC cost structure of $1/GB @ 25nm – eMLC 1.4X, SLC 2X?
ONFI
Open NAND Flash Interface organization• 1.0 2006 – 50MB/s• 2.0 2008 – 133MB/s• 2.1 2009 – 166 & 200MB/s• 3.0 2011 – 400MB/s
– Micron has 200 & 333MHz products
ONFI 1.0 – 6 channels to support 3Gbps SATA, 260MB/sONFI 2.0 – 4+ channels to support 6Gbps SATA, 560MB/s
NAND write performance
MLC85MB/s per 4-die channel (128GB)340MB/s over 4 channels (512GB)?
Controller Interface PCIe vs. SATA
NANDController
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
Some bandwidth mistmatch/overkill OKONFI 2 – 8 channels at 133MHz to SATA 6Gbps – 560 MB/s a good match
PCIe or SATA?Multiple lanes?
CPU access efficiency and scalingIntel & NVM Express
6-8 channel at 400MB/sto match 2.2GB/s x4 SAS?
16 channel+ at 400MB/sto match 6.4GB/s x8 PCIe 3
But ONFI 3.0 is overwhelming SATA 6Gbps?
Controller Interface PCIe vs. SATA
Controller
SA
TA
DR
AM
NAN
D
NAN
D
NAN
D
NAN
D
NAN
D
NAN
D
NAN
D
NAN
D
PC
Ie
Controller
DR
AM
NAND NAND NAND NAND NAND NAND
PCIe NAND Controller VendorsVendor Channels PCIe GenIDT 32 x8 Gen3 NVMeMicron 32 x8 Gen2Fusion-IO 3x4? X8 Gen2?
SATA & PCI-E SSD Capacities64 Gbit MLC NAND die 150mm2 25nm
2 x 32 Gbit 34nm1 x 64 Gbit 25nm 1 x 64 Gbit 29nm
1 64 Gbit die
8 x 64 Gbit die in 1 package = 64GB
SATA Controller – 8 channels, 8 package x 64GB = 512GB
PCI-E Controller – 32 channels x 64GB = 2TB
PCI-E vs. SATA/SAS
• SATA/SAS controllers have 8 NAND channels– No economic benefit in fewer channels?– 8 ch. Good match for 50MB/s NAND to SATA 3G
• 3Gbps – approx 280MB/s realizable BW
– 8 ch also good match for 100MB/s to SATA 6G• 6Gbps – 560MB/s realizable BW
– NAND is now at 200 & 333MB/s• PCI-E – 32 channels practical – 1500 pins
– 333MHz good match to gen 3 x8 – 6.4GB/s BW
Crucial/Micron P400m & eCrucial P400m 100GB 200GB 400GBRaw 168GB 336GB 672GBSeq Read (up to) 380MB/s 380MB/s 380MB/Seq Write (up to) 200MB/s 310MB/s 310MB/sRandom Read 52K 54K 60KRandom Write 21K 26K 26KEndurance 2M-hr MTBF 1.75PB 3.0PB 7.0PBPrice $300? $600? $1000?
Crucial P400e 100GB 200GB 400GBRaw 128 256 512Seq Read (up to) 350MB/s 350MB/s 350MB/Seq Write (up to) 140MB/s 140MB/s 140MB/sRandom Read 50K 50K 50KRandom Write 7.5K 7.5K 7.5KEndurance 1.2M-hr MTBF 175TB 175TB 175TBPrice $176 $334 $631
P410m SAS specs slightly different
EE MLC Higher endurancewrite perf not lower than MLC?
Preliminary – need to update
Crucial m4 & m500
Crucial m4 128GB 256GB 512GBRaw 128 256 512Seq Read (up to) 415MB/s 415MB/s 415MB/Seq Write (up to) 175MB/s 260MB/s 260MB/sRandom Read 40K 40K 40KRandom Write 35K 50K 50KEndurance 72TB 72TB 72TBPrice $112 $212 $400
Crucial m500 120GB 240GB 480GB 960GBRaw 128GB 256GB 512GB 1024Seq Read (up to) 500MB/s 500MB/s 500MB/Seq Write (up to) 130MB/s 250MB/s 400MB/sRandom Read 62K 72K 80KRandom Write 35K 60K 80KEndurance 1.2M-hr MTBF 72TB 72TB 72TBPrice $130 $220 $400 $600
Preliminary – need to update
Micron & Intel SSD Pricing (2013-02)
100/128 200/256 400/512$0
$100
$200
$300
$400
$500
$600
$700
$800
$900
$1,000
m400P400eP400mS3700
P400m raw capacities are 168, 336 and 672GB (pricing retracted)Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively
Need corrected P400m pricing
4K Write K IOPS
100/128 200/256 400/5120
10
20
30
40
50
60
m400P400eP400mS3700
P400m raw capacities are 168, 336 and 672GB (pricing retracted)Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively
Need corrected P400m pricing
SSD Summary
• MLC is possible with careful write strategy– Partitioning to minimize index rebuilds– Avoid full database restore to SSD
• Endurance (HET) MLC – write perf?– Standard DB practice work– But avoid frequent index defrags?
• SLC – only extreme write intensive?– Lower volume product – higher cost
DIRECT ATTACH STORAGE3
Full IO Bandwidth
Misc devices on 2 x4 PCIe g2, Internal boot disks, 1GbE or 10GbE, graphics
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
QPI
QPI
• 10 PCIe g3 x8 slots possible – Supermicro only – HP, Dell systems have 5-7 x8+ slots + 1 x4?
• 4GB per slot with 2 x4 SAS, – 6GB/s with 4 x4
• Mixed SSD + HDD – reduce wear on MLC
PC
Ie x4
10GbE InfiniBand
192 GB
RAID
HDD
SSD
RAID
HDD
SSD
RAID
HDD
SSD
RAID
HDD
SSD
192 GB
PC
Ie x8
InfiniBand
PC
Ie x4
Misc
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
RAID
HDD
SSD
RAID
HDD
SSD
RAID
HDD
SSD
RAID
HDD
SSD
System Storage StrategyDell & HP only have 5-7 slots4 Controllers @ 4GB/s each is probably good enough?Few practical products can use PCIe G3 x16 slots
• Capable of 16GB/s with initial capacity– 4 HBA, 4-6GB/s each
• with allowance for capacity growth– And mixed SSD + HDD
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x4
IBRAID RAID RAIDRAID10GbE
QPI
QPI192 GB 192 GB
HDD HDD HDD HDD
SSD SSD SSD SSD
Node 1QPI
QPI
192 GB 192 GB
HBA HBA HBA HBA
Node 2QPI
QPI
192 GB 192 GB
HBA HBA HBA HBA
MD3220 MD3220 MD3220 MD3220
Clustered SAS StorageDell MD3220 supports clusteringUpto 4 nodes w/o external switch(extra nodes not shown)
SASHost
IOC PCIESwitch
SASHost
SASHost
SASHost
2GBSASExp
SASHost
IOC PCIESwitch
SASHost
SASHost
SASHost
2GBSASExp
Host
IOC PCIE
Host Host Host
2GB Exp
Host
IOC PCIE
Host Host Host
2GB Exp
SSDSSD SSD SSD
HDD HDD HDD HDD
Alternate SSD/HDD Strategy
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x4
IBHBA HBA HBAHBA
• Primary System– All SSD for data & temp, – logs may be HDD
• Secondary System– HDD for backup and restore testing
10GbE
SSD SSD SSD SSD
HDD
Backup System
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x4
PC
Ie x8
IB RAID RAID RAIDRAID
HDD HDD HDD HDD
QPI
QPI192 GB 192 GB
System Storage Mixed SSD + HDDEach RAID Group-Volume should not exceed 2GB/s BW of x4 SAS2-4 volumes per x8 PCIe G3 slot
SATA SSD read 350-500MB/s, write 140MB/s+8 per volume allows for some overkill16 SSD per RAID Controller
64 SATA/SAS SSD’s to deliver 16-24GB/s
4 HDD per volume rule does not apply
HDD for local database backup, restore tests, and DW flat filesSSD & HDD on shared channel – simultaneous bi-directional IO
x8
HBA
x8 x8 x8 x8x4
HBA HBA IBHBA
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
10GbE
QPI
QPI192 GB 192 GB
SSD/HDD System Strategy
• MLC is possible with careful write strategy– Partitioning to minimize index rebuilds– Avoid full database restore to SSD
• Hybrid SSD + HDD system, full-duplex signalling
• Endurance (HET) MLC – write perf?– Standard DB practice work, avoid index defrags
• SLC – only extreme write intensive?– Lower volume product – higher cost
• HDD – for restore testing
SAS Expander
Disk Enclosureexpansionports not shown2 x4 to hosts
1 x4 for expansion24 x1 for disks
Storage Infracture – designed for HDD15mm
• 2 SAS Expanders for dual-port support– 1 x4 upstream (to host), 1 x4 downstream (expan)– 24 x1 for bays
2U
Mixed HDD + SSD Enclosure15mm
• Current: 24 x 15mm = 360mm + spacing• Proposed 16 x15mm=240mm + 16x7mm= 120
2U
Enclosure 24x15mm and proposed
Current 2U Enclosure, 24 x 15mm bays – HDD or SSD2 SAS expanders – 32 lanes each4 lanes upstream to host4 lanes downstream for expansion24 lanes for bays
PC
Ie
Host
x8
PC
Ie
x8
384 GB
SA
S
SA
S
SA
S
SA
S
SA
S E
xp
an
de
r SA
S E
xp
an
de
r
HBA SAS x46 Gpbs2.2GB/s
PC
Ie
Host
x8
PC
Ie
x8
384 GB
SA
S
SA
S
SA
S
SA
S
SA
S E
xp
an
de
r SA
S E
xp
an
de
r
HBA SAS x412 Gpbs 4GB/s
New SAS 12Gbps16 x 15mm + 16 x 7mm bays2 SAS expanders – 40 lanes each4 lanes upstream to host4 lanes downstream for expansion32 lanes for bays
2 RAID Groups for SSD, 2 for HDD1 SSD Volume on path A1 SSD Volume on path B
Enclosure 24x15mm and proposed
Current 2U Enclosure, 24 x 15mm bays – HDD or SSD2 SAS expanders – 32 lanes each4 lanes upstream to host4 lanes downstream for expansion24 lanes for bays
PC
Ie
Host
x8
PC
Ie
x8
384 GB
SA
S
SA
S
SA
S
SA
S
SA
S E
xp
an
de
r SA
S E
xp
an
de
r
HBA
PC
Ie
Host
x8
PC
Ie
x8
384 GB
SA
S
SA
S
SA
S
SA
S
SA
S E
xp
an
de
r SA
S E
xp
an
de
r
HBA
New SAS 12Gbps16 x 15mm + 16 x 7mm bays2 SAS expanders – 40 lanes each4 lanes upstream to host4 lanes downstream for expansion32 lanes for bays
2 RAID Groups for SSD, 2 for HDD1 SSD Volume on path A1 SSD Volume on path B
SAS x4
SAS x4
SAS x4
SAS x4
Alternative Expansion
PCIe x8 HBA
SAS x4
SAS x4
SAS x4
SAS x4
Ho
st
Ex
pa
nd
er
Ex
pa
nd
er
Ex
pa
nd
er
Ex
pa
nd
er
Each SAS expander – 40 lanes, 8 lanes upstream to host with no expansionor 4 lanes upstream and 5 lanes downstream for expansion32 lanes for bays
Enclosure 2
Enclosure 3
Enclosure 4
Enclosure 1
PCI-E with Expansion
SA
S
SA
S
SA
S
SA
S
SA
S E
xp
an
de
r SA
S E
xp
an
de
r
PC
Ie
Host
x8
PC
Ie
x8
384 GB
HBA SAS x46 Gpbs2.2GB/s
PC
Ie
Host
x8
PC
Ie
x8
384 GB
PCI-E Switch
x8 x8 x8 x8
x8
• PCI-E slot SSD suitable for known capacity• 48 & 64 lanes PCI-E switches available
– x8 or x4 ports
Express bay form factor?
Few x8 ports?or many x4 ports?
Enclosure for SSD (+ HDD?)
• 2 x4 on each expander upstream – 4GB/s– No downstream ports for expansion?
• 32 ports for device bays– 16 SSD (7mm) + 16 HDD (15mm)
• 40 lanes total, no expansion– 48 lanes with expansion
Large SSD Array
• Large number of devices, large capacity– Downstream from CPU has excess bandwidth
• Do not need SSD firmware peak performance– 1 ) no stoppages, 2) consistency is nice
• Mostly static data – some write intensive– Careful use of partitioning to avoid index rebuild
and defragmentation– If 70% is static, 10% is write intensive
• Does wear leveling work
DATABASE – SQL SERVER4
Database Environment
• OLTP + DW Databases are very high value– Software license + development is huge– 1 or more full time DBA, several application
developers, and help desk personnel– Can justify any reasonable expense– Full knowledge of data (where the writes are)– Full control of data (where the writes are)– Can adjust practices to avoid writes to SSD
Database – Storage Growth
• 10GB per day data growth– 10M items at 1KB per row (or 4 x 250 byte rows)– 18 TB for 5 years (1831 days)– Database log can stay on HDD
• Heavy system– 64-128 x 256/512GB (raw) SSD– Each SSD can support 20GB/day (36TB lifetime?)
• With Partitioning – few full index rebuilds• Can replace MLC SSD every 2 years if required
Big company
Extra Capacity - Maintenance
• Storage capacity will be 2-3X database size– It will be really stupid if you cannot update
application for lack of space to modify a large table
– SAN evironment• Only required storage capacity allocated• May not be able to perform maintenance ops
– If SAN admin does not allocate extra space
SSD/HDD Component Pricing 2013
• MLC consumer <$1.0K/TB• MLC Micron P400e <$1.2K/TB• MLC endurance <$2.0K/TB• SLC $4K???• HDD 600GB, 10K $400
Database Storage Cost
• 8 x256GB (raw) SSD per x4 SAS channel = 2TB• 2 x4 ports per RAID controller, 4TB/RC• 4 RAID Controller per 2 socket system, 16TB• 32 TB with 512GB SSD, 64TB with 1TB,
– 64 SSD per system at $250 (MLC) $16K– 64 HDD 10K 600GB $400 $26K– Server 2xE5, 24x16GB, qty 2 $12K each– SQL Server 2012 EE $6K x 16 cores $96K
HET MLC and even SLC premium OKServer/Enterprise premium – high validation effort, low volume, high support expectations
OLTP & DW
• OLTP – backup to local HDD – Superfast backup, read 10GB/s, write 3GB/s (R5)– Writes to data blocked during backup– Recovery requires log replay
• DW – example: 10TB data, 16TB SSD– Flat files on HDD– Tempdb will generate intensive writes (1TB)
• Database (real) restore testing– Force tx roll forward/back, i.e., need HDD array
SQL Server Storage Configuration
• IO system must have massive IO bandwidth– IO over several channels
• Database must be able to use all channels simultaneously– Multiple files per filegroups
• Volumes / RAID Groups on each channel– Volume comprised of several devices
HDD, RAID versus SQL Server
• HDD – pure sequential – not practical,
• impossible to maintain
– Large block 256K good enough• 64K OK
• RAID Controller – 64K to 256K stripe size• SQL Server
– Default extent allocation: 64K per file– With –E, 4 consecutive extents – why not 16???
File Layout Physical ViewEach Filegroup and tempdb has 1 data file on every data volume
IO to any object is distributed over all paths and all disks
x8
HBA
x8
HBA
x8
HBA
x8
HBA
QPI
QPI192 GB 192 GB
x4
10GbE
Filegroup & File LayoutEach File Group has 1 file on each data volumeEach object is distributed across all data “disks”Tempdb data files share same volumes
As shown, 2 RAID groups per controller, 1 per port.Can be 4 RG/volume per Ctlr
OS and Log disks not shown
Controller 1 Port 0Disk 2Basic
FileGroup A, File 1FileGroup B, File 1Tempdb File 1
Controller 1 Port 1Disk 3Basic
FileGroup A, File 2FileGroup B, File 2Tempdb File 2
Controller 2 Port 0Disk 4Basic
FileGroup A, File 3FileGroup B, File 3Tempdb File 3
Controller 2 Port 1Disk 5Basic
FileGroup A, File 4FileGroup B, File 4Tempdb File 4
FileGroup A, File 5FileGroup B, File 5Tempdb File 5
Controller 3 Port 0Disk 6Basic
FileGroup A, File 6FileGroup B, File 6Tempdb File 6
Controller 3 Port 1Disk 7Basic
FileGroup A, File 7FileGroup B, File 7Tempdb File 7
Controller 4 Port 0Disk 8Basic
FileGroup A, File 8FileGroup B, File 8Tempdb File 8
Controller 4 Port 1Disk 9Basic
RAID versus SQL Server ExtentsDefault: allocate 1 extent from file 1,allocate extent 2 from file 2, Disk IO – 64KOnly 1 disk in each RAID group is active
Controller 1 Port 0Disk 2Basic1112GBOnline Extent 1
Extent 17Extent 33
Extent 5Extent 21Extent 37
Extent 9Extent 25Extent 41
Extent 13Extent 29Extent 45
Controller 1 Port 1Disk 3Basic1112GBOnline Extent 2
Extent 18Extent 34
Extent 6Extent 22Extent 38
Extent 10Extent 26Extent 42
Extent 14Extent 30Extent 46
Controller 2 Port 0Disk 4Basic1112GBOnline Extent 3
Extent 19Extent 35
Extent 7Extent 23Extent 39
Extent 11Extent 27Extent 43
Extent 15Extent 31Extent 47
Controller 2 Port 1Disk 5Basic1112GBOnline Extent 4
Extent 20Extent 36
Extent 8Extent 24Extent 40
Extent 12Extent 28Extent 44
Extent 16Extent 32Extent 48
Consecutive Extents -EController 1 Port 0Disk 2
Basic1112GBOnline Extent 1
Extent 17Extent 33
Extent 2Extent 18Extent 34
Extent 3Extent 19Extent 35
Extent 4Extent 20Extent 36
Controller 1 Port 1Disk 3Basic1112GBOnline Extent 5
Extent 21Extent 37
Extent 6Extent 22Extent 38
Extent 7Extent 23Extent 39
Extent 8Extent 24Extent 40
Controller 2 Port 0Disk 4Basic1112GBOnline Extent 9
Extent 25Extent 41
Extent 10Extent 26Extent 42
Extent 11Extent 27Extent 43
Extent 12Extent 28Extent 44
Controller 2 Port 1Disk 5Basic1112GBOnline Extent 13
Extent 29Extent 45
Extent 14Extent 30Extent 46
Extent 15Extent 31Extent 47
Extent 16Extent 32Extent 48
Allocate 4 consecutive extents from each file,
OS issues 256K Disk IO
Each HDD in RAID group sees 64K IOUpto 4 disks in RG gets IO
Storage Summary
• OLTP – endurance MLC or consumer MLC?• DW - MLC w/ higher OP• QA – consumer MLC or endurance MLC?• Tempdb – possibly SLC• Single log – HDD, multiple logs: SSD?• Backups/test Restore/Flat files – HDD• No caching, no auto-tiers
SAN
Software Cache + Tier
Cache + Auto-TierGood idea if1) No knowledge2) No control
In DatabaseWe have1) Full
knowledge2) Full Control
Virtual file statsFilegroupspartitioning
Common SAN Vendor Configuration
Log volume
SSD 10K 7.2K Hot Spares
Node 1 Node 2
Switch Switch
SP A SP B
8 Gbps FC or10Gbps FCOE
768 GB 768 GB
x4 SAS2GB/s
24 GB 24 GB
Main Volume
Path and component fault-tolerance, poor IO performance
Multi-path IO:perferred portalternate port
Single large volume for data, additional volumes for log, tempdb, etc
All data IO on single FC port700MB/s IO bandwidth
Multiple Paths & Volumes 3
Multiple quad-port FC HBAs
Optional SSD volumes
Data files must also be evenly distributed
Many SAS ports
Multiple local SSD for tempdb
Node 2768 GB
Node 1 768 GB
Switch Switch
SP A SP B
8 Gb FC
24 GB 24 GB
x4 SAS 2GB/s
SSD
x8 x8
SSD
x8 x8 x8x8
Data 5 Data 6 Data 7
Data 1 Data 2 Data 3 Data 4
Data 8
Data 9
Data 13
Data 10
Data 14
Data 11
Data 15
Data 12
Data 16
SSD 1 SSD 2 SSD 3 SSD 4
Log 1 Log 2 Log 3 Log 4
Multiple Paths & Volumes 2
Multiple quad-port FC HBAs
Optional SSD volumes
Data files must also be evenly distributed
Many SAS ports
Node 2768 GB
Node 1 768 GB
Switch Switch
SP A SP B
8 Gb FC
24 GB 24 GB
x4 SAS 2GB/s
Data 1 Data 2 Data 3 Data 4
Data 5 Data 6 Data 7 Data 8
Data 9 Data 10 Data 11 Data 12
Data 13 Data 14 Data 15 Data 16
SSD 1 SSD 2 SSD 3 SSD 4
Log 1 Log 2 Log 3 Log 4
SSD
x8 x8
SSD
x8 x8 x8x8
Multiple local SSD for tempdb
8Gbps FC rules
• 4-5 HDD RAID Group/Volumes– SQL Server with –E only allocates 4 consecutive
extents• 2+ Volumes per FC port
– Target 700MB/s per 8Gbps FC port• SSD Volumes
– Limited by 700-800MB/s per 8Gbps FC port– Too many ports required for serious BW– Management headache from too many volumes
SQL Server
• SQL Server table scan to – heap generates 512K IO, easy to hit 100MB/s/disk– (clustered) index 64K IO, 30-50MB/s per disk likely
EMC VNX 5300 FT DW Ref Arch
iSCSI & File structure
x4
10GbE
x4
10GbE
Controller 1 Controller 2
RJ45 SFP+
DB1 files DB2 files
x4
10GbE
x4
10GbE
Controller 1 Controller 2
RJ45 SFP+
DB1 files DB2 files
x4
10GbE
x4
10GbE
Controller 1 Controller 2
RJ45 SFP+
DB1 file 1DB2 file 1
DB1 file 2DB2 file2
EMC VMAX
EMC VMAX orig and 2nd gen
· 2.8 GHz Xeon w/turbo (Westmere)· 24 CPU cores· 256 GB cache memory (maximum)· Quad Virtual Matrix· PCIe Gen2
· 2.3 GHz Xeon (Harpertown)· 16 CPU cores· 128 GB cache memory (maximum)· Dual Virtual Matrix· PCIe Gen1
CPU Complex
Front End/Back End Ports
CPU Complex
Front End/Back End Ports
Global Memory
CMI-II
Front End Back End Front End Back End
EMC VMAX 10K
EMC VMAX Virtual Matrix
VirtualMatrix
VMAX Director
EMC VMAX Director
IOH
IOH
Director
VMI
VMAX Engine?
FC HBA FC HBA SAS SAS
VMI
IOH
IOH
Director
FC HBA FC HBA SAS SAS
VMI VMI VMI VMI
VMAX 10K newUpto 4 engines, 1 x 6c 2.8G per dir50GB/s VM BW?16 x 8Gbps FC per engine
VMAX 20K Engine 4 QC 2.33GHz 128GB Virtual Maxtrix BW 24GB/sSystem - 8 engines, 1TB, VM BW 192GB/s, 128 FE ports
VMAX 40K Engine 4 SC 2.8GHz256GBVirtual Maxtrix BW 50GB/s
System - 8 engines, 2TB, VM BW 400GB/s, 128 FE ports
RapidIO IPC3.125GHz, 2.5Gb/s 8/104 lanes per connection10Gb/s = 1.25GB/s, 2.5GB/s full duplex4 Conn per engine - 10GB/s
36 PCI-E per IOH, 72 combined8 FE, 8 BE16 VMI 1, 32 VMI 2
SQL Server Default Extent Allocation
Data file 1
Extent 1 Extent 5 Extent 9 Extent 13Extent 17 Extent 21 Extent 25 Extent 29Extent 33 Extent 37 Extent 41 Extent 45
Data file 2
Extent 2 Extent 6 Extent 10 Extent 14Extent 18 Extent 22 Extent 26 Extent 30Extent 34 Extent 38 Extent 42 Extent 46
Data file 3
Extent 3 Extent 7 Extent 11 Extent 15Extent 19 Extent 23 Extent 27 Extent 31Extent 35 Extent 39 Extent 43 Extent 47
Data file 4
Extent 4 Extent 8 Extent 12 Extent 16Extent 20 Extent 24 Extent 28 Extent 32Extent 36 Extent 40 Extent 44 Extent 48
Allocate 1 extent per file in round robinProportional fillEE/SE table scan tries to stay 1024 pages ahead?
SQL can read 64 contiguous pages from 1 file. The storage engine reads index pages serially in key order.Partitioned table support for heap organization desired?
SAN
Volume 1 Data Volume 2 Data
Volume .. Data Volume .. Data
Volume 3 Data Volume 4 Data
Volume 15 Data Volume 16 Data
Volume - Log Log Log
SSD 1 SSD 2 SSD ... SSD 8
Node 1 Node 2
Switch Switch
SP A SP B
768 GB 768 GB
24 GB 24 GB
8 Gb FC
x4 SAS2GB/s
SSD 10K
Node 1
192 GB 192 GB
HBA HBA HBA HBA
Node 2QPI
QPI
192 GB 192 GB
HBA HBA HBA HBA
Clustered SAS
ExpHost Host
Host Host
SASHost
IOC PCIESwitch
SASHost
SASHost
SASHost
2GBSASExp
SASHost
IOC PCIESwitch
SASHost
SASHost
SASHost
2GBSASExp
Node 1
768 GB
Node 1
768 GB
Host
IOCSwitch
Host Host Host
SASExp
Host
IOCSwitch
Host Host Host
SASExp
Node 1
192 GB 192 GB
HBA HBA HBA HBA
Node 2QPI
QPI
192 GB 192 GB
HBA HBA HBA HBA
Node 2
768 GB
SAS In
SASOut
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x8
PC
Ie x4
IBRAID RAID RAIDRAID10GbE
HDD HDD HDD HDD
SSD SSD SSD SSD
Fusion-IO ioScale