1 designing for 20tb disk drives and enterprise storage jim gray, microsoft research

1

Designing for 20TB Disk DrivesAnd “enterprise storage”

Jim Gray, Microsoft research

2

Disk EvolutionCapacity:100x in 10 years

1 TB 3.5” drive in 2005 20 TB? in 2012?!

System on a chip High-speed SAN

Disk replacing tapeDisk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

3

Disks are becoming computersSmart drivesCamera with micro-driveReplay / Tivo / Ultimate TVPhone with micro-driveMP3 playersTabletXboxMany more…

Disk Ctlr + 1Ghz cpu+1GB RAM

Comm:Infiniband, Ethernet, radio…

ApplicationsWeb, DBMS, Files

OS

4

Intermediate Step: Shared Logic

Brick with 8-12 disk drives200 mips/arm (or more)

2xGbpsEthernetGeneral purpose OS 10k$/TB to 100k$/TBShared Sheet metal Power Support/Config Security Network ports

These bricks could run applications (e.g. SQL or Mail or..)

Snap ~1TB 12x80GB NAS

NetApp ~.5TB 8x70GB NAS

Maxstor ~2TB 12x160GB NAS

IBM TotalStorage~360GB 10x36GB NAS

http://www.mce-computer.com/_vti_bin/shtml.dll/d/system-products/ibm/totalstorage.htm/map2

5

HardwareHomogenous machines leads to quick response through reallocationHP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives$4k/TB (street), 2.5processors/TB, 1GB RAM/TB3 weeks from ordering to operational

Slide courtesy of Brewster Kahle, @ Archive.org

6

Disk as TapeTape is unreliable, specialized, slow, low density, not improving fast, and expensiveUsing removable hard drives to replace tape’s function has been successfulWhen a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Slide courtesy of Brewster Kahle, @ Archive.org

7

Disk As Tape: What format?Today I send NTFS/SQL disks.But that is not a good format for Linux.Solution: Ship NFS/CIFS/ODBC servers (not disks)Plug “disk” into LAN. DHCP then file or DB server via standard

interface. Web Service in long term

8

State is ExpensiveStateless clones are easy to manage App servers are middle tier

Cost goes to zero with Moore’s law. One admin per 1,000 clones. Good story about scaleout.

Stateful servers are expensive to manage 1TB to 100TB per admin Storage cost is going to zero(2k$ to 200k$).

Cost of storage is management cost

9

Databases (== SQL)

VLDB survey (Winter Corp).10 TB to 100TB DBs. Size doubling yearly Riding disk Moore’s law 10,000 disks at 18GB is 100TB cooked.

Mostly DSS and data warehouses.Some media managers

10

Interesting factsNo DBMSs beyond 100TB.Most bytes are in files.The web is file centriceMail is file centric.Science (and batch) is file centric.But….SQL performance is better than CIFS/NFS.. CISC vs RISC

11

BarBar: the biggest DB

500 TBUses Objectivity™SLAC eventsLinux cluster scans DB looking for patterns

12

300 TB (cooked)Hotmail / Yahoo Clone front ends ~10,[email protected] servers ~100 @ hotmail Get mail box Get/put mail Disk bound

~30,000 disks

~ 20 admins

13

AOL (msn) (1PB?)

10 B transactions per day (10% of that)Huge storageHuge trafficLots of eye candyDB used for security/accounting.GUESS AOL is a petabyte (40M x 10MB = 400 x 1012)

http://go.msn.com/0/3/

http://r.aol.com/cgi/redir-complex?url=http://www.aol.com&sid=n0

14

Google1.5PB as of last spring8,000 no-name PCs Each 1/3U, 2 x 80 GB disk,

2 cpu 256MB ram

1.4 PB online.2 TB ram online8 TeraOps Slice-price is 1K$ so 8M$.15 admins (!) (== 1/100TB).

15

Astronomy

I’ve been trying to apply DB to astronomyToday they are at 10TB per data setHeading for PetabytesUsing ObjectivityTrying SQL (talk to me offline)

16

Scale Out: Buy Computing by the Slice709,202 tpmC! == 1 Billion transactions/day

Slice: 8cpu, 8GB, 100 disks (=1.8TB) 20ktpmC per slice, ~300k$/sliceclients and 4 DTC nodes not shown

http://www.compaq.com/

17

ScaleUp: A Very Big System!

UNISYS Windows 2000 Data Center Limited Edition32 cpus on 32 GB of RAM and1,061 disks (15.5 TB) Will be helped by 64bit addressing

24fiber

channel

http://www.unisys.com/

18

Hardware

SQL\Inst1SQL\Inst1

SQL\Inst2SQL\Inst2

SQL\Inst3SQL\Inst3

SpareSpare

F GLKP Q

E EJ JO O

IHM NR S

22002200 22002200 22002200

220022002200220022002200

22002200 22002200 22002200

One SQL database per rackOne SQL database per rackEach rack contains 4.5 tbEach rack contains 4.5 tb261 total drives / 13.7 TB total261 total drives / 13.7 TB total

Meta DataMeta DataStored on 101 GBStored on 101 GB““Fast, Small Disks”Fast, Small Disks”(18 x 18.2 GB)(18 x 18.2 GB)

Imagery DataImagery DataStored on 4 339 GBStored on 4 339 GB““Slow, Big Disks”Slow, Big Disks”(15 x 73.8 GB)(15 x 73.8 GB)

To Add 90 72.8 GBTo Add 90 72.8 GBDisks in Feb 2001Disks in Feb 2001to create 18 TB SANto create 18 TB SAN

8 Compaq DL360 “Photon” Web Servers8 Compaq DL360 “Photon” Web Servers

Fiber SANFiber SANSwitchesSwitches

4 Compaq ProLiant 8500 Db Servers4 Compaq ProLiant 8500 Db Servers

19

Amdahl’s Balance Laws

parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S.balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps.

memory law: =1: the MB/MIPS ratio (called alpha ()), in a balanced system is 1.IO law: Programs do one IO per 50,000 instructions.

20

Amdahl’s Laws Valid 35 Years Later?

Parallelism law is algebra: so SURE! Balanced system laws? Look at tpc results (tpcC, tpcH) at http://

www.tpc.org/

Some imagination needed: What’s an instruction (CPI varies from 1-

3)? RISC, CISC, VLIW, … clocks per instruction,…

What’s an I/O?

http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

21

Disks/ cpu

50

22

TPC systemsNormalize for CPI (clocks per instruction) TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO

TPC-H needs ½ as many disks, sequential vs randomBoth use 9GB 10 krpm disks (need arms, not bytes)

MHz/cpu

CPI mipsKB

/IO

IO/s/

disk

Disks

MB/s/

cpu

Ins/IO

Byte

Amdahl 1 1 1 6 8

TPC-C=random

550 2.1 262 8 100 397 40 7TPC-H= sequential

550 1.2 458 64 100 176 141 3

22

TPC systems: What’s alpha (=MB/MIPS)?Hard to say:

Intel 32 bit addressing (= 4GB limit). Known CPI.

IBM, HP, Sun have 64 GB limit. Unknown CPI.

Look at both, guess CPI for IBM, HP, Sun

Alpha is between 1 and 6Mips Memory Alpha

Amdahl 1 1 1tpcC Intel 8x262 = 2Gips 4GB 2tpcH Intel 8x458 = 4Gips 4GB 1tpcC IBM 24 cpus ?= 12 Gips 64GB 6tpcH HP 32 cpus ?= 16 Gips 32 GB 2

23

Performance (on current SDSS data)

time vs queryID

1

10

100

1000

Q08 Q01 Q09 Q10A Q19 Q12 Q10 Q20 Q16 Q02 Q13 Q04 Q06 Q11 Q15B Q17 Q07 Q14 Q15A Q05 Q03 Q18

seco

nd

s cpu

elapsedae

Run times: on 15k$ COMPAQ Server (2 cpu, 1 GB , 8 disk)Some take 10 minutesSome take 1 minute Median ~ 22 sec. Ghz processors are fast! (10 mips/IO, 200 ins/byte) 2.5 m rec/s/cpu

cpu vs IO

1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1E+7

0.01 0.1 1. 10. 100. 1,000.CPU sec

IO c

ount 1,000 IOs/cpu sec

~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

24

How much storage do we need?

Soon everything can be recorded and

indexedMost bytes will never be seen by humans.Data summarization, trend detection anomaly detection are key technologies

See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html

See Lyman & Varian: How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All LoC books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

http://www.lesk.com/mlesk/ksg97/ksg.html

http://www.sims.berkeley.edu/research/projects/how-much-info/



25

Standard Storage MetricsCapacity: RAM: MB and $/MB: today at 512MB and 200$/GB Disk: GB and $/GB: today at 80GB and 70k$/TB Tape: TB and $/TB: today at 40GB and

10k$/TB (nearline)

Access time (latency) RAM: 100 ns Disk: 15 ms Tape: 30 second pick, 30 second position

Transfer rate RAM: 1-10 GB/s Disk: 10-50 MB/s - - -Arrays can go to 10GB/s Tape: 5-15 MB/s - - - Arrays can go to

1GB/s

26

New Storage Metrics: Kaps, Maps, SCAN

Kaps: How many kilobyte objects served per second The file server, transaction processing metric This is the OLD metric.

Maps: How many megabyte objects served per sec The Multi-Media metric

SCAN: How long to scan all the data the data mining and utility metric

And Kaps/$, Maps/$, TBscan/$

27

Kaps over time

1.E+0

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1970 1980 1990 2000

Kap

s/$

10

100

1000

Kap

s/d

isk

Kaps

Kaps/$

More Kaps and Kaps/$ but….

Disk accesses got much less expensiveBetter disks

Cheaper disks!But: disk arms are expensivethe scarce resource1 hour Scanvs 5 minutes in 1990

100 GB

30 MB/s

28

Data on Disk Can Move to RAM in 10 years

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

Year

MB

/k$

100:1

10 years

29

The “Absurd” 10x (=4 year) Disk

2.5 hr scan time (poor sequential access)1 aps / 5 GB (VERY cold data)It’s a tape!

1 TB100 MB/s

200 Kaps

30

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?).

A geo-plexScrub it continuously (look for errors)On failure, use other copy until failure repaired, refresh lost copy from safe copy.

Can organize the two copies differently (e.g.: one by time, one by space)

31

Auto Manage Storage1980 rule of thumb: A DataAdmin per 10GB, SysAdmin per mips

2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app).

Problem: 5TB is 50k$ today, 5k$ in a few years.

Admin cost >> storage cost !!!!Challenge: Automate ALL storage admin tasks

32

How to cool disk data:

Cache data in main memory See 5 minute rule later in presentation

Fewer-larger transfers Larger pages (512-> 8KB -> 256KB)

Sequential rather than random access Random 8KB IO is 1.5 MBps Sequential IO is 30 MBps (20:1 ratio is

growing)

Raid1 (mirroring) rather than Raid5 (parity).

33

Data delivery costs 1$/GB today

Rent for “big” customers:

300$/megabit per second per monthImproved 3x in last 6 years (!).That translates to 1$/GB at each end.

You can mail a 160 GB disk for 20$.

That’s 16x cheaper If overnight it’s 4 MBps.

3x160 GB

~ ½ TB

1 designing for 20tb disk drives and enterprise storage jim gray, microsoft research

Documents

tb ram

tb disk drives

tb san

tb dbs

gb disks

offline slide

db servers slide

gb 10x36gb nas slide