storage challenges for petascale systems

© 2004IBM Corporation

Storage Challenges for Petascale Systems

Dilip D. KandlurDirector, Storage Systems ResearchIBM Research Division

2 © 2006 IBM Corporation

IBM Almaden Research Center

Outline

Storage Technology TrendsImplications for high performance computingAchieving petascale storage performanceManageability of petascale systemsOrganizing and finding Information



Extreme ScalingThere have been recent inflection points in CAGR of processing and storage – in the wrong direction!

Programs like HPCS are aimed at maintaining throughput at or above the CAGR of Moore’s Law in spite of these technology trends

Disk Areal Density Trend 2000-2010

0.11

10100

100010000

1998 2000 2002 2004 2006 2008 2010Year of Production

Are

al D

ensi

ty

Gb/

sq.in

.

100% CAGR25-35% CAGR

2000 2001 2002 2003 2004 2005 2006 20071

10

2

3

4

Pentium 4 (180 nm)

Pentium 4 (130 nm)

Freq

uenc

y (G

Hz)

5

Initial Ship Date

Prescott (90 nm)

2002 Roadmap~35% yr/yr

200310-15% yr/yr

Maximum Internal Bandwidth

10

100

1000

1998 2000 2002 2004 2006 2008 2010

Max

imum

inte

rnal

Ban

dwid

th M

B/s



Peta-scale systems: DARPA HPCS, NSF Track 1

HPCS goal: “Double value every 18 months” in the face of flattening technology curvesNSF track 1 goal: at least a sustained petaflop for actual science applicationsNew technologies like multi-core will keep processing power on the rise but will make storage relatively more expensiveMaintaining “balanced system” scaling constants for storage will be expensive

– Storage bandwidth: .001byte/second/flop capacty: 20 bytes/flop– Cost per drive will be same order of magnitude, so proportionally the same amount of storage will

be a higher fraction of total system costHow to make reliable a system with 10x today’s number of moving parts?

110002000 TB1228815361221002005Purple/C

2000

12

3

TF

2000

9

3

GB/s

5000040000 TB300000100002011NSF Track 1(possible)

8064147 TB81925122000White

504043 TB585614641998Blue P

DisksStorageCoresNodesYearSystem



HPCSStorage

300,000 processors150,000 disk drives

Fix 3 or more concurrent errorsDetect “undetected” errors

Only minor slowing during disk rebuildDetect and manage slow disks

Unified manager for files, storageEnd-end discovery, metrics, events

Managing system changes, problem fixesGUI scaled to large clusters

5 TB/sec sequential bandwidth30,000 file creates/sec on one node

Capable of running fsck on 1 trillion files

0.001

0.1

10

1000

100000

1995 2000 2005 2010 2015CPU Performance File System CapacityNumber of Disk Drives File System Throughput

1995 2000 2005 2010

CPU Performance Number of Disk Drives File System Throughput

165,000 drives

5,000 drives

4 TF

3.6 GB/s

6 PF

6 TB/s

11,000 drives

100 TF

120 GB/s

Managable

RobustFast



GPFS Parallel File System

Cluster: thousands of nodes, fast reliable communication, common admin domain.

Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service.

Parallel: data and metadataflow to/from all nodes from/to all disks in parallel; files striped across all disks.

GPFS file system nodes

Control IP network

Disk FC network

GPFS file system nodes

Data / control IP network

GPFS disk server nodes: VSD on AIX, NSD on Linux – RPC interface to raw disks



Scaling GPFS

HPCS file system performance and scaling targets– “Balanced system” DOE metrics (.001B/s/F, 20 B/F)

• This means 2-6 TB/s throughput, 40-120 PB storage!!

– Other performance goals• 30 GB/s single node to single file for data ingest• 30K file opens per second on a single node• 1 trillion files in a single file system• Scaling to 32K nodes (OS images)



Extreme Scaling: Metadata

Metadata: the on-disk data structures that represent hierarchical directories, storage allocation maps, …

Why is it a problem? Structural integrity requires proper synchronization. Performance is sensitive to the latency of these (small) I/O’s.

Techniques for scaling metadata– Scaling synchronization (distributing the lock manager)– Segregating metadata from data to reduce queuing delays

• Separate disks• Separate fabric ports

– Different RAID levels for metadata to reduce latency, or solid-state memory– Adaptive metadata management (centralized vs. distributed)– GPFS provides for all these to some degree; work always ongoing– Sensible application design can make a big difference!



Data loss in Petascale SystemsPetaflop systems require tens to hundreds of petabytes of storage

Evidence exists that manufacturer MTBF specs may be optimistic (Schroeder & Gibson)

Evidence exists that failure statistics may not be as favorable as simple exponential distribution

Hard error rate of 1 in 1015 means one rebuild in 30 will get an error

– Rebuild of 8+P array of 500GB drives reads 4TB, or 3.2×1013 bits

RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss

– Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code

– Stronger RAID (8+3P) increase MTTDL by 3-4 orders of magnitude for extra 10% overhead.

– Stronger RAID is sufficiently reliable even for unreliable (commodity) disk drives

MTTDL in years for 20PB system

1

10

100

1000

10000

100000

1000000

10000000

8+3P 600K hrsExponential




8+3P 600K hrsWeibull

8+2P 600K hrsWeibull

Configuration

MTT

DL

in y

ears

4%16%

28%



GPFS Software RAID

Implement software RAID in the GPFS NSD serverMotivations– Better fault-tolerance– Reduce the performance impact of rebuilds and slow disks– Eliminate costly external RAID controllers and storage fabric– Use the processing cycles now being wasted in the storage node– Improve performance by file-system-aware caching

Approach– Storage node (NSD server) manages disks as JBOD– Use stronger RAID codes as appropriate (e.g. triple parity for data and multi-way

mirroring for metadata)– Always check parity on read

• Increases reliability and prevents performance degradation from slow drives– Checksum everything!– Declustered RAID for better load balancing and non-disruptive rebuild



Declustered RAID

Partitioned RAID

16 logical tracks

Declustered RAID

20 physical disks 20 physical disks



Rebuild Work Distribution

Relative readand write

throughput for rebuild

failed disk



Rebuild (2)

Upon the first failure, begin rebuilding the tracks that are affected by the failure (large arrows).

Many disks involved in performing rebuild, so work is balanced, avoiding hot spots.



Declustered vs. Partitioned RAID

1E-3

1E-2

1E-1

1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1 2 3

Failure tolerance

Dat

a lo

sses

per

yea

r per

100

PB partitioneddistributed

Simulation results



IBM TotalStorage Productivity CenterStandard Edition

Autonomic Storage ManagementMaking Complex Tasks Simple

FabricDisk Data

A Single Application with modular components

Streamlined Installation and PackagingSingle User InterfaceSingle DatabaseSingle Set of services for consistent administration and operations

Ease of Use

Policy Based Storage ManagementSAN Best Practices

SAN Configuration ValidationStorage Subsystem PlanningFabric Security PlanningHost Planning (Multi-path)

Console Enhancements End-to-End DataPath Explorer

Integrated Storage PlannerConfiguration Change RoverConfiguration CheckerPersonalizationTSM Integration

Business ResiliencyIntegrated Repication Manager

Metro Disaster RecoveryGlobal Disaster RecoveryCascaded Disaster Recovery

Application Disaster Recovery



Integrated Management

Systems

Knowledge

DB

Discovery Monitoring Reporting Configuration

ServerOSMiddleware StorageApplications File System Network

Operating systemsVirtualization software

Hardware

Middleware

Applications

Hardware

Applications

Best Practices Orchestration

Deployment Analytics

Integrated Web 2.0 GUI

Seamlessly integrate systems management across servers, storage and network & provide end-to-end problem determination and analytics capabilities



PERCS Management

CIM

Provider

CIM

Repository

CIMOM

CIM

Model

Management

Server

Uses

Uses

Retrieves

Data

CIM Client

Systems

DB

PERCS GUI

Simulator

PERCS

Storage

GPFS

File System

•A unified and standards based management for GPFS and PERCS Storage

• A GUI that is designed forlarge-scale clusters

•Supporting PERCS scale•GPFS

•The PERCS UI will support:•Information collection: asset tracking, end-end discovery,metrics, events•Management: system changes, problem fixes, configuration changes

•Rich visualizations to help them maintain situational awareness of system status

•Essential for large systems•Also enable GPFS to satisfy commercial customers requiring ease-of-use



AnalyticsProblem Determination and Impact Analysis– Root cause analysis: discover the finest-grain events that indicate the root

cause of the problem– Symptom suppression: correlate alarms/symptoms caused by a common cause

across the integrated infrastructure

Bottleneck Analysis– Post-mortem, live and predictive analysis

Workload and Virtualization Management– Automatically monitor multi-tiered, distributed, heterogeneous or homogeneous

workloads – Migrate virtual machines to satisfy performance goals

Integrated Server, Storage and Network Allocation and Migration– Integrated allocation accounting for connectivity, affinity, flows, ports based on

performance workloads

Disaster Management– Provides integrated server/storage disaster recovery support



VisualizationIntegrated Management is centered around Topology Viewer capabilities based on Web 2.0 technologies– Data Path Viewer for Applications, Servers, Networks and Storage– Progressive Information Disclosure– Semantic Zooming– Information Overlays– Mixed Graphical and Tabular Views– Integrated Historical and Real Time Reporting



The Changing Nature of ArchiveCurrent Archive: Data landfill

Store and forget

Not easily accessible, typically offline and offsite with access time measured in days

Not organized for usage, retained just in case needed

Readily accessible, access time measured in seconds

Indexed for effective discovery

Mined for business value

Emerging Archive: Leverage information for business advantage



Building Storage Systems Targeted at Archive

ScalabilityScale to huge capacity

Exploit tiered storage with disk and tapeLeverage commodity disk storage

Handle extremely large number of objectsSupport high ingest ratesEffect data management actions in a scalable fashion

ReliabilityEnsure data integrity and protectionProvide media management and rejuvenationSupport long-term retention

FunctionalityConsistently handle multiple kinds of objectsManage and retrieve based on data semantics

E.g. Logical groupings of objectsSupport effective search and discoveryProvide for compliance with regulations

Current Archive:

Data landfill

Emerging Archive: Leverage info for business advantage



GPFS Information Lifecycle Management (ILM)

GPFS ILM abstractions– Storage pool – group of LUNs– Fileset – subtree of a file system

namespace– Policy – rule for file placement,

retention, or movement among pools

ILM Scenarios– Tiered storage – fast storage for

frequently used files, slower for infrequently used files

– Project storage – separate pools for each project, each with separate policies, quotas, etc.

– Differentiated storage – e.g. place media files on media-friendly storage (QoS)

GPFS Manager Node•Cluster manager•Lock manager•Quota manager•Allocation manager•Policy manager

System Pool Data Pools

GPFS Clients

Storage Network

GoldPool

SilverPool

PewterPool

GPFS RPC Protocol

GPFSPlacement

Policy

Application

GPFSPlacement

Policy

Application

GPFSPlacement

Policy

Application

GPFSPlacement

Policy

ApplicationPosix

GPFS File System (Volume Group)



GPFS 3.1 ILM Policies

Placement policies, evaluated at file creation, example

Migration policies, evaluated periodically

Deletion policies, evaluated periodicallyGPFS Manager Nod

•Cluster manager•Lock manager•Quota manager•Allocation manager•Policy manager

System Pool Data Pools

GPFS Clients

Storage Network

GoldPool

SilverPool

PewterPool

GPFS RPC Protocol

GPFS Placement Policy

Application


Application


Application


ApplicationPosix

GPFS File System (Volume Group)



GPFS Policy Engine

Migrate and delete rules scan the file system to identify candidate files– Conventional backup and HSM systems also do this

• Usually implemented with readdir() and stat()• This is slow – random small record reads, distributed locking• Can take hours or days for a large file system

GPFS Policy Engine uses efficient sort-merge rather than slow readdir()/stat()– Directory walk builds list of path names (readdir() but no stat()!)– List sorted by inode number, merged with inode file, then evaluated– Both list building and policy evaluation done in parallel on all nodes– … > 105 files/sec per node!



Storage Hierarchies – the old way

Normally implemented one of two ways:Explicit control

– archive command (IBM TSM, Unitree)– copy into special “archive” file system

(IBM HPSS)– copy to archive server (HPSS, Unitree)– … all of which are troublesome and

error-prone for the user

Implicit control through an interface like DMAPI

– File system sends “events” to HSM system (create/delete, low space)

– Archive system moves data and punches “holes” in files to manage space

– Access miss generates event; HSM system transparently brings file back

Client Cluster Computers

HPSSAPI

ClientDomain

TapeLibraries

MetadataDisks

HPSS SANDisk HPSS Cluster Computers

Core Server and Movers

HPSSData Store

IP Network

HPSS FC SAN

1. Client issues HPSS Write or Put to HPSS Core Server

2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover


HPSSAPI

ClientDomain


HPSSAPI


HPSSAPI


HPSSAPI


HPSSAPI

ClientDomain

TapeLibraries

MetadataDisks


Core Server and Movers

HPSSData Store

IP Network

HPSS FC SAN

TapeLibrariesTapeLibrariesTapeLibraries

MetadataDisksMetadataDisks

HPSS SANDisk

HPSS SANDisk


Core Server and MoversHPSS Cluster ComputersCore Server and Movers

HPSSData Store

IP NetworkIP Network

HPSS FC SANHPSS FC SAN





GPFS Cluster

GPFSDiskArrays

HPSS Cluster

HPSS Movers

HPSS Core

ServerDB2

HPSSDiskArrays

HPSSTapeLibraries

Mover-less SAN

Data transfers

HSM Control Information

IP LAN Data

transfers

GPFS I/O Nodes

HPSS

Interface

GPFSSession

Node

HPSS

HSM

processes DB2

Tape –disk

transfers

GPFS ClusterGPFS Cluster

GPFSDiskArrays

GPFSDiskArrays

HPSS ClusterHPSS Cluster

HPSS MoversHPSS

Movers

HPSS Core

ServerDB2

HPSSDiskArrays

HPSSDiskArrays

HPSSTapeLibraries

HPSSTapeLibraries

Mover-less SAN

Data transfers


IP LAN Data

transfers

GPFS I/O Nodes

HPSS

Interface

GPFS I/O Nodes

HPSS

Interface

GPFS I/O Nodes

HPSS

Interface

GPFSSession

Node

HPSS

HSM

processes

GPFSSession

Node

HPSS

HSM

processes DB2

Tape –disk

transfers

GPFS 3.1 and HPSS 6.2 DMAPI Architecture

HPSS 6.2 API Architecture



DMAPI ProblemsNamespace events (create, delete, rename)

– Synchronous and recoverable– Each is multiple database

transactions– Slow down the file system

Directory scans– DMAPI low-space events trigger

directory scans to determine what to archive

• can take hours or days on large FS– Scans have little information upon

which to make archiving decisions (what you get from “ls –l”)

• As a result, data movement policies are usually hard-coded and primitive

Read/write managed region– Blocks the user program while data

brought back from HSM system– Parallel data movement isn’t in the

spec, but everyone implements it anyway

– Data movement is actually the one thing about DMAPI worth saving

GPFS Cluster

GPFSDiskArrays

HPSS Cluster

HPSS Movers

HPSS Core

ServerDB2

HPSSDiskArrays

HPSSTapeLibraries

Mover-less SAN

Data transfers


IP LAN Data

transfers

GPFS I/O Nodes

HPSS

Interface

GPFSSession

Node

HPSS

HSM

processes DB2

Tape –disk

transfers

GPFS ClusterGPFS Cluster

GPFSDiskArrays

GPFSDiskArrays

HPSS ClusterHPSS Cluster

HPSS MoversHPSS

Movers

HPSS Core

ServerDB2

HPSSDiskArrays

HPSSDiskArrays

HPSSTapeLibraries

HPSSTapeLibraries

Mover-less SAN

Data transfers


IP LAN Data

transfers

GPFS I/O Nodes

HPSS

Interface

GPFS I/O Nodes

HPSS

Interface

GPFS I/O Nodes

HPSS

Interface

GPFSSession

Node

HPSS

HSM

processes

GPFSSession

Node

HPSS

HSM

processes DB2

Tape –disk

transfers

GPFS 3.1 and HPSS 6.2 DMAPI Architecture



GPFS Approach: “External Pools”

External pools are really interfaces to external storage managers, e.g. HPSS or TSM– External pool “rule” defines script to call to migrate/recall/etc. files

RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’]

GPFS policy engine builds candidate lists and passes them to external pool scripts

External storage manager actually moves the data– Using DMAPI managed regions (read/write invisible, punch hole)– Or using conventional Posix API’s



GPFS ILM Demonstration

SC’06Tampa, FL

GPFS1M active filesFC, SATA disks

NERSCOakland, CA

HPSSArchive on tapes with disk buffering,connected via 10Gb link

High bandwidth,parallel data

movement acrossall devices and

networks



Nearline Information – conceptual view

Scale-out Archiving Engine(GPFS Cluster)

TSM Deep Storage

NFS/CIFSClient

TSM Archive Client/API

Migration via TSM Archive Client

TSM Archive Client

DMAPIGlobal Index

and Search Capability

Admin /Search

• Provides capability to handle extended meta-data

•Meta-data may be derived from data content

•Extended attributes (integrity code, retention period, retention hold status, and any application meta-data)

• Global index on content and EA meta-data• Allow for application-specific parsers (e.g., DICOM)

NFS/CIFSServer

TSM Archive API



Summary

Storage environments moving from petabytes to exabytes– Traditional HPC– New archive environments

Significant challenges for reliability, resiliency, and manageability

Meta-data becomes key for information organization and discovery

storage challenges for petascale systems

Documents