download it

52
ECE7130: Advanced Computer Architecture: Storage Systems Dr. Xubin He http://iweb.tntech.edu/hexb Email: [email protected] Tel: 931-3723462, Brown Hall 319

Upload: cameroon45

Post on 06-May-2015

756 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Download It

ECE7130: Advanced Computer Architecture:Storage Systems

Dr. Xubin He

http://iweb.tntech.edu/hexb

Email: [email protected]

Tel: 931-3723462, Brown Hall 319

Page 2: Download It

ECE7130

Outline

• Quick overview: storage systems

• RAID

• Advanced Dependability/Reliability/Availability

• I/O Benchmarks, Performance and Dependability

Page 3: Download It

ECE7130 3

Storage Architectures

04/11/23

Page 4: Download It

ECE7130

Disk Figure of Merit: Areal Density• Bits recorded along a track

– Metric is Bits Per Inch (BPI)

• Number of tracks per surface– Metric is Tracks Per Inch (TPI)

• Disk Designs Brag about bit density per unit area– Metric is Bits Per Square Inch: Areal Density = BPI x TPI

Year Areal Density1973 2 1979 8 1989 63 1997 3,090 2000 17,100 2006 130,000

1

10

100

1,000

10,000

100,000

1,000,000

1970 1980 1990 2000 2010

Year

Are

al D

ensi

ty

Page 5: Download It

ECE7130

Historical Perspective

• 1956 IBM Ramac — early 1970s Winchester– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in.

• Form factor and capacity drives market more than performance• 1970s developments

– 5.25 inch floppy disk formfactor (microcode into mainframe)– Emergence of industry standard disk interfaces

• Early 1980s: PCs and first generation workstations• Mid 1980s: Client/server computing

– Centralized storage on file server» accelerates disk downsizing: 8 inch to 5.25

– Mass market disk drives become a reality» industry standards: SCSI, IPI, IDE» 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces

• 1900s: Laptops => 2.5 inch drives• 2000s: What new devices leading to new drives?

Page 6: Download It

ECE7130

Storage Media

• Magnetic:– Hard drive

• Optical:– CD

• Solid State Semiconductor:– Flash SSD

– RAM SSD

• MEMS-based– Significantly cheaper than DRAM but much faster than traditional HDD, high density

• Tape: sequential accessed

604/11/23

Page 7: Download It

ECE7130

Driving forces

7Source: Anderson & Whittington, Seagate, FAST’07

04/11/23

Page 8: Download It

ECE7130

HDD Market

8Source: Anderson & Whittington, Seagate, FAST’07

04/11/23

Page 9: Download It

ECE7130

HDD Technology Trend

9Source: Anderson & Whittington, Seagate, FAST’07

04/11/23

Page 10: Download It

ECE7130

Errors

• Errors happens:– Media cause: bit flip, noise…

• Data detection and decoding logic, ECC

1004/11/23

Page 11: Download It

ECE7130

Right HDD for right application

11Source: Anderson & Whittington, Seagate, FAST’07

04/11/23

Page 12: Download It

ECE7130

Interfaces

• Internal– SCSI: wide SCSI, Ultra wide SCSI…

– IDE ATA: Parallel ATA (PATA): PATA 66/100/133

– Serial ATA (SATA): SATA 150, SATA II 300

– Serial Attached SCSI (SAS)

• External– USB (1.0/2.0)

– Firewire (400/800)

– eSATA: external SATA

1204/11/23

Page 13: Download It

ECE7130

SATA/SAS compatibility

13Source: Anderson & Whittington, Seagate, FAST’07

04/11/23

Page 14: Download It

ECE7130

Top HDD manufactures

• Western Digital

• Seagate (also own Maxtor)

• Fujitsu

• Samsung

• Hitachi

• IBM used to make hard disk drives but sold that division to Hitachi.

1404/11/23

Page 15: Download It

ECE7130

Top companies to provide storage solutions

• Adaptec

• EMC

• Qlogic

• IBM

• Hitachi

• Brocade

• Cisco

• HP

• Network appliance

• Emulex

1504/11/23

Page 16: Download It

ECE7130

Fastest growing storage companies (07/2008) Source: storagesearch.com

Company Yearly growth Main product

Coraid 370% NAS

ExaGrid Systems 337% Disk to disk backup

RELDATA 300% iSCSI

Voltaire 194% InfiniBand

Transcend Information 166% SSD

OnStor 157% NAS

Bluepoint Data Storage 140% Online Backup and Storage

Compellent 124% NAS

Alacritech 100% iSCSI

Intransa 100% iSCSI

1604/11/23

Page 17: Download It

ECE7130

Solid State Drive

• A data storage device that uses solid-state memory to store persistent data.

• A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)

04/11/23 17

Page 18: Download It

ECE7130

Advantages of SSD

• Faster startup: no spin-up

• Fast random access for read

• Extremely low read/write latency: much smaller seek time

• Quiet: no moving

• High mechanical reliability: endure shock, vibration

• Balanced performance across entire storage device

04/11/23 18

Page 19: Download It

ECE7130

Disadvantages of SSD

• Price: unit price of SSD is 20x of HDD

• Limited write cycles (Flash SSD): 30-50 millions

• Slower write speed (Flash SSD): erase blocks

• Lower storage density

• Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges

04/11/23 19

Page 20: Download It

ECE7130

Top SSD manufactures (as of 1Q of 2008, source: storagesearch.com)

Rank Manufacturer SSD Technology

1 BitMICRO Networks Flash SSD

2 STEC Flash SSD

3 Mtron Flash SSD

4 Memoright Flash SSD

5 SanDisk Flash SSD

6 Samsung Flash SSD

7 Adtron Flash SSD

8 Texas Memory Systems RAM/Flash SSD

9 Toshiba Flash SSD

10 Violin Memory RAM SSD

2004/11/23

Page 21: Download It

ECE7130 21

Networked Storage

• Network Attached Storage(NAS)• Storage accessed over TCP/IP, using industry standard file

sharing protocols like NFS, HTTP, Windows Networking Provide File System Functionality Take LAN bandwidth of Servers

• Storage Area Network(SAN)• Storage accessed over a Fibre Channel switching fabric,

using encapsulated SCSI. Block level storage system Fibre-Channel SAN IP SAN

» Implementing SAN over well-known TCP/IP» iSCSI: Cost-effective, SCSI and TCP/IP

04/11/23

Page 22: Download It

ECE7130 22

Advantages

• Consolidation

• Centralized Data Management

• Scalability

• Fault Resiliency

04/11/23

Page 23: Download It

ECE7130

NAS and SAN shortcomings

• SAN Shortcomings--Data to desktop--Sharing between NT and UNIX--Lack of standards for file access and locking

• NAS Shortcomings--Shared tape resources--Number of drives--Distance to tapes/disks

• NAS--Focuses on applications, users, and the files and data that they share

• SAN--Focuses on disks, tapes, and a scalable, reliable infrastructure to connect them

• NAS Plus SAN--The complete solution, from desktop to data center to storage device

2304/11/23

Page 24: Download It

ECE7130 24

Organizations

• IEEE Computer Society Mass Storage Systems Technical Committee (MSSTC or TCMS): http://www.msstc.org

• IEEE IETF: www.ietf.org– IPS, IMSS

• International Committee for Information Technology Standards: www.incits.org

– Storage: B11, T10, T11, T13

• Storage mailing list: [email protected]

• INSIC: Information Storage Industry Consortium: www.insic.org

• SNIA: Storage Networking Industry Association: http://www.snia.org– Technical work groups

04/11/23

Page 25: Download It

ECE7130 25

Conferences

• FAST: Usenix: http://www.usenix.org/event/fast09/

• SC: IEEE/ACM: http://sc08.supercomputing.org/

• MSST: IEEE: http://storageconference.org/– SNAPI, SISW, CMPD, DAPS

• NAS: IEEE: http://www.eece.maine.edu/iwnas/• SNW: SNIA: Storage Networking World

• SNIA Storage Developer Conference: http://www.snia.org/events/storage-developer2008/

• Other conferences with storage components:– IPDPS, ICDCS, ICPP, AReS, PDCS, Mascots, HPDC, IPCCC, ccGRID…

04/11/23

Page 26: Download It

ECE7130 26

Awards in Storage Systems

• IEEE Reynold B. Johnson Information Storage Systems Award– Sponsored by: IBM Almaden Research Center

• 2008 Recipient:– 2008 - ALAN J. SMITH, Professor

University of California at BerkeleyBerkeley, CA, USA 

•2007- Co-Recipients

– DAVE HITZExecutive Vice President and Co-FounderNetwork ApplianceSunnyvale, CA

– JAMES LAUChief Strategy Officer, Executive Vice President and Co-FounderNetwork ApplianceSunnyvale, CA

•2006 Recipient

– JAISHANKAR M. MENONDirector of Storage Systems Architecture and DesignIBM, San Jose, CA

– "For pioneering work in the theory and application of RAID storage systems."

• More: http://www.ieee.org/portal/pages/about/awards/sums/johnson.html

04/11/23

Page 27: Download It

ECE7130 27

HPC storage challenge: SC

• HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.

• Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.

• Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.

04/11/23

Page 28: Download It

ECE7130

Research hotspots

• Energy Efficient Storage: CMU, UIUC

• Scalable Storages Meets Petaflops: IBM, UCSC

• High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC

• Object Storage (OSD): CMU, HUST, UCSC

• Storage virtualization: CMU

• Intelligent Storage Systems: UC Berkeley, CMU

2804/11/23

Page 29: Download It

ECE7130

Energy Efficient Storage

• Energy aware: – disk level: SSD

– I/O and file system level: efficient memory, cache, networked storage

– Application level: data layout

• Green storage initiative (GSI): SNIA– Energy efficient storage networking solutions

– Storage administrator and infrastructure

2904/11/23

Page 30: Download It

ECE7130

Peta-scale scalable storage

• Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.

• Latencies to disks are not keeping up with peta-scale computing.

• Incentive approaches are needed.

04/11/23 30

Page 31: Download It

ECE7130

My research in high performance and reliable storage systems

• Active/Active Service for High Availability Computing– Active/Active Metadata Service

• Networked Storage Systems– STICS

– iRAID

• A Unified Multiple-Level Cache for High Performance Storage Systems

– iPVFS

– iCache

• Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP

3104/11/23

Page 32: Download It

ECE7130 32

Improving disk performance.

• Use large sectors to improve bandwidth

• Use track caches and read ahead;– Read entire track into on-controller cache

– Exploit locality (improves both latency and BW)

• Design file systems to maximize locality– Allocate files sequentially on disks (exploit track cache)

– Locate similar files in same cylinder (reduce seeks)

– Locate simlar files in near-by cylinders (reduce seek distance)

• Pack bits closer together to improve transfer rate and density.

• Use a collection of small disks to form a large, high performance one--->disk array

Stripping data across multiple disks to allow parallel I/O, thus improving performance.

04/11/23

Page 33: Download It

ECE7130

Use Arrays of Small Disks?

14”10”5.25”3.5”

3.5”

Disk Array: 1 disk design

Conventional: 4 disk designs

Low End High End

•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?

Page 34: Download It

ECE7130

Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)

Capacity

Volume

Power

Data Rate

I/O Rate

MTTF

Cost

IBM 3390K

20 GBytes

97 cu. ft.

3 KW

15 MB/s

600 I/Os/s

250 KHrs

$250K

IBM 3.5" 0061

320 MBytes

0.1 cu. ft.

11 W

1.5 MB/s

55 I/Os/s

50 KHrs

$2K

x70

23 GBytes

11 cu. ft.

1 KW

110 MB/s

3900 IOs/s

??? Hrs

$150K

Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?

9X

3X

8X

6X

Page 35: Download It

ECE7130

Array Reliability

•MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.

•Reliability of N disks = Reliability of 1 Disk ÷N

50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month!

• Arrays without redundancy too unreliable to be useful!

Solution: redundancy.

Page 36: Download It

ECE7130

Redundant Arrays of (Inexpensive) Disks

• Replicate data over several disks so that no data will be lost if one disk fails.

• Redundancy yields high data availability– Availability: service still provided to user, even if some components

failed

• Disks will still fail• Contents reconstructed from data redundantly stored

in the array Capacity penalty to store redundant info

Bandwidth penalty to update redundant info

Page 37: Download It

ECE7130

Page 38: Download It

ECE7130

Levels of RAID

• Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)

• Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).

• Other kinds have been proposed in literature,Level 6 (P+Q Redundancy), Level 10, RAID53, etc.

• Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.

Page 39: Download It

ECE7130

RAID 0: Nonredundant (JBOD)

file data block 1block 0 block 2 block 3

Disk 1Disk 0 Disk 2 Disk 3

• High I/O performance.•Data is not save redundantly.•Single copy of data is striped across multiple disks.

•Low cost.•Lack of redundancy.

•Least reliable: single disk failure leads to data loss.

Page 40: Download It

ECE7130

Redundant Arrays of Inexpensive DisksRAID 1: Disk Mirroring/Shadowing

• Each disk is fully duplicated onto its “mirror” Very high availability can be achieved• Bandwidth sacrifice on write: Logical write = two physical writes

• Reads may be optimized, minimize the queue and disk search time

• Most expensive solution: 100% capacity overhead

recoverygroup

Targeted for high I/O rate , high availability environments

Page 41: Download It

ECE7130

RAID 2: Memory-Style ECC

f0(b)b2b1b0 b3f1(b) P(b)

Data Disks Multiple ECC Disks and a Parity Disk

• Multiple disks record the ECC information to determine which disk is in fault

• A parity disk is then used to reconstruct corrupted or lost data

• Needs log2(number of disks) redundancy disks

Page 42: Download It

ECE7130

RAID 3: Bit (Byte) Interleaved Parity

• Only need one parity disk • Write/Read accesses all disks• Only one request can be serviced at a time• Easy to implement• Provides high bandwidth but not high I/O rates

Targeted for high bandwidth applications: Multimedia, Image Processing

100100111100110110010011

. . .

Logical record

1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0

Striped physicalrecords

P

Physical record

Page 43: Download It

ECE7130

RAID 3

• Sum computed across recovery group to protect against hard disk failures, stored in P disk

• Logically, a single high capacity, high transfer rate disk: good for large transfers

• Wider arrays reduce capacity costs, but decreases availability• 12.5% capacity cost for parity in this configuration

• RAID 3 relies on parity disk to discover errors on Read• But every sector has an error detection field• Rely on error detection field to catch errors on read, not on the

parity disk• Allows independent reads to different disks simultaneously

Inspiration for RAID 4

Page 44: Download It

ECE7130

RAID 4: Block Interleaved Parity

block 0

block 4

block 8

block 12

block 1

block 5

block 9

block 13

block 2

block 6

block 10

block 14

block 3

block 7

block 11

block 15

P(0-3)

P(4-7)

P(8-11)

P(12-15)

•Blocks: striping units •Allow for parallel access by multiple I/O requests, high I/O rates • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).• Large writes(full stripe), update the parity:

P’ = d0’ + d1’ + d2’ + d3’; • Small writes(eg. write on d0), update the parity:

P = d0 + d1 + d2 + d3P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;

• However, writes are still very slow since the parity disk is the bottleneck.

Page 45: Download It

ECE7130

Problems of Disk Arrays: Small Writes (read-modify-write procedure)

D0 D1 D2 D3 PD0'

+

+

D0' D1 D2 D3 P'

newdata

olddata

old parity

XOR

XOR

(1. Read) (2. Read)

(3. Write) (4. Write)

RAID-5: Small Write Algorithm

1 Logical Write = 2 Physical Reads + 2 Physical Writes

Page 46: Download It

ECE7130

Inspiration for RAID 5

• RAID 4 works well for small reads• Small writes (write to one disk):

– Option 1: read other data disks, create new sum and write to Parity Disk

– Option 2: since P has old sum, compare old data to new data, add the difference to P

• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!

D0 D1 D2 D3 P

D4 D5 D6 PD7

Page 47: Download It

ECE7130

Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity

Independent writespossible because ofinterleaved parity

Independent writespossible because ofinterleaved parity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Disk Columns

IncreasingLogical

Disk Addresses

Example: write to D0, D5 uses disks 0, 1, 3, 4

Page 48: Download It

ECE7130

Comparison of RAID Levels (N disks, each with capacity of C)

RAID level

Technique Capacity Advantage Disadvantage

0 Striping NxC Maximum data transfer rate and sizes

No redundancy

1 Mirroring (N/2)xC High performance,

fastest write

cost

3 Bit(byte)-level parity

(N-1)xC Easy to implement, high error recoverability

Low performance

4 Block-level parity

(N-1)xC High redundancy and better performance

Write-related bottleneck

5 Interleave parity

(N-1)xC High performance, reliability

Small write problem

Page 49: Download It

ECE7130

RAID 6: Recovering from 2 failures

• Why > 1 failure recovery?– operator accidentally replaces the wrong disk during a

failure

– since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing increases the chances of a 2nd failure during repair since takes longer

– reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss

Page 50: Download It

ECE7130

RAID 6: Recovering from 2 failures• Network Appliance’s row-diagonal parity or RAID-DP

• Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe

• Since it is protecting against a double failure, it adds two check blocks per stripe of data.

– If p+1 disks total, p-1 disks have data; assume p=5

• Row parity disk is just like in RAID 4 – Even parity across the other 4 data blocks in its stripe

• Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal

Page 51: Download It

ECE7130

Example p = 5

• Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity

– Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block

• Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes

• Process continues until two failed disks are restored

Data Disk 0

Data Disk 1

Data Disk 2

Data Disk 3

Row Parity

Diagonal Parity

0 1 2 3 4 0

1 2 3 4 0 1

2 3 4 0 1 2

3 4 0 1 2 3

4 0 1 2 3 4

0 1 2 3 4 0

Page 52: Download It

ECE7130

Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage

• Disk Mirroring, Shadowing (RAID 1)

Each disk is fully duplicated onto its "shadow" Logical write = two physical writes

100% capacity overhead

• Parity Data Bandwidth Array (RAID 3)

Parity computed horizontally

Logically a single high data bw disk

• High I/O Rate Parity Array (RAID 5)

Interleaved parity blocks

Independent reads and writes

Logical write = 2 reads + 2 writes

10010011

11001101

10010011

00110010

10010011

10010011