storage challenges for petascale systems
TRANSCRIPT
© 2004IBM Corporation
Storage Challenges for Petascale Systems
Dilip D. KandlurDirector, Storage Systems ResearchIBM Research Division
2 © 2006 IBM Corporation
IBM Almaden Research Center
Outline
Storage Technology TrendsImplications for high performance computingAchieving petascale storage performanceManageability of petascale systemsOrganizing and finding Information
3 © 2006 IBM Corporation
IBM Almaden Research Center
Extreme ScalingThere have been recent inflection points in CAGR of processing and storage – in the wrong direction!
Programs like HPCS are aimed at maintaining throughput at or above the CAGR of Moore’s Law in spite of these technology trends
Disk Areal Density Trend 2000-2010
0.11
10100
100010000
1998 2000 2002 2004 2006 2008 2010Year of Production
Are
al D
ensi
ty
Gb/
sq.in
.
100% CAGR25-35% CAGR
2000 2001 2002 2003 2004 2005 2006 20071
10
2
3
4
Pentium 4 (180 nm)
Pentium 4 (130 nm)
Freq
uenc
y (G
Hz)
5
Initial Ship Date
Prescott (90 nm)
2002 Roadmap~35% yr/yr
200310-15% yr/yr
Maximum Internal Bandwidth
10
100
1000
1998 2000 2002 2004 2006 2008 2010
Max
imum
inte
rnal
Ban
dwid
th M
B/s
4 © 2006 IBM Corporation
IBM Almaden Research Center
Peta-scale systems: DARPA HPCS, NSF Track 1
HPCS goal: “Double value every 18 months” in the face of flattening technology curvesNSF track 1 goal: at least a sustained petaflop for actual science applicationsNew technologies like multi-core will keep processing power on the rise but will make storage relatively more expensiveMaintaining “balanced system” scaling constants for storage will be expensive
– Storage bandwidth: .001byte/second/flop capacty: 20 bytes/flop– Cost per drive will be same order of magnitude, so proportionally the same amount of storage will
be a higher fraction of total system costHow to make reliable a system with 10x today’s number of moving parts?
110002000 TB1228815361221002005Purple/C
2000
12
3
TF
2000
9
3
GB/s
5000040000 TB300000100002011NSF Track 1(possible)
8064147 TB81925122000White
504043 TB585614641998Blue P
DisksStorageCoresNodesYearSystem
5 © 2006 IBM Corporation
IBM Almaden Research Center
HPCSStorage
300,000 processors150,000 disk drives
Fix 3 or more concurrent errorsDetect “undetected” errors
Only minor slowing during disk rebuildDetect and manage slow disks
Unified manager for files, storageEnd-end discovery, metrics, events
Managing system changes, problem fixesGUI scaled to large clusters
5 TB/sec sequential bandwidth30,000 file creates/sec on one node
Capable of running fsck on 1 trillion files
0.001
0.1
10
1000
100000
1995 2000 2005 2010 2015CPU Performance File System CapacityNumber of Disk Drives File System Throughput
1995 2000 2005 2010
CPU Performance Number of Disk Drives File System Throughput
165,000 drives
5,000 drives
4 TF
3.6 GB/s
6 PF
6 TB/s
11,000 drives
100 TF
120 GB/s
Managable
RobustFast
6 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS Parallel File System
Cluster: thousands of nodes, fast reliable communication, common admin domain.
Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service.
Parallel: data and metadataflow to/from all nodes from/to all disks in parallel; files striped across all disks.
GPFS file system nodes
Control IP network
Disk FC network
GPFS file system nodes
Data / control IP network
GPFS disk server nodes: VSD on AIX, NSD on Linux – RPC interface to raw disks
7 © 2006 IBM Corporation
IBM Almaden Research Center
Scaling GPFS
HPCS file system performance and scaling targets– “Balanced system” DOE metrics (.001B/s/F, 20 B/F)
• This means 2-6 TB/s throughput, 40-120 PB storage!!
– Other performance goals• 30 GB/s single node to single file for data ingest• 30K file opens per second on a single node• 1 trillion files in a single file system• Scaling to 32K nodes (OS images)
8 © 2006 IBM Corporation
IBM Almaden Research Center
Extreme Scaling: Metadata
Metadata: the on-disk data structures that represent hierarchical directories, storage allocation maps, …
Why is it a problem? Structural integrity requires proper synchronization. Performance is sensitive to the latency of these (small) I/O’s.
Techniques for scaling metadata– Scaling synchronization (distributing the lock manager)– Segregating metadata from data to reduce queuing delays
• Separate disks• Separate fabric ports
– Different RAID levels for metadata to reduce latency, or solid-state memory– Adaptive metadata management (centralized vs. distributed)– GPFS provides for all these to some degree; work always ongoing– Sensible application design can make a big difference!
9 © 2006 IBM Corporation
IBM Almaden Research Center
Data loss in Petascale SystemsPetaflop systems require tens to hundreds of petabytes of storage
Evidence exists that manufacturer MTBF specs may be optimistic (Schroeder & Gibson)
Evidence exists that failure statistics may not be as favorable as simple exponential distribution
Hard error rate of 1 in 1015 means one rebuild in 30 will get an error
– Rebuild of 8+P array of 500GB drives reads 4TB, or 3.2×1013 bits
RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss
– Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code
– Stronger RAID (8+3P) increase MTTDL by 3-4 orders of magnitude for extra 10% overhead.
– Stronger RAID is sufficiently reliable even for unreliable (commodity) disk drives
MTTDL in years for 20PB system
1
10
100
1000
10000
100000
1000000
10000000
8+3P 600K hrsExponential
8+3P 300K hrsExponential
8+2P 600K hrsExponential
8+2P 300K hrsExponential
8+3P 600K hrsWeibull
8+2P 600K hrsWeibull
Configuration
MTT
DL
in y
ears
4%16%
28%
10 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS Software RAID
Implement software RAID in the GPFS NSD serverMotivations– Better fault-tolerance– Reduce the performance impact of rebuilds and slow disks– Eliminate costly external RAID controllers and storage fabric– Use the processing cycles now being wasted in the storage node– Improve performance by file-system-aware caching
Approach– Storage node (NSD server) manages disks as JBOD– Use stronger RAID codes as appropriate (e.g. triple parity for data and multi-way
mirroring for metadata)– Always check parity on read
• Increases reliability and prevents performance degradation from slow drives– Checksum everything!– Declustered RAID for better load balancing and non-disruptive rebuild
11 © 2006 IBM Corporation
IBM Almaden Research Center
Declustered RAID
Partitioned RAID
16 logical tracks
Declustered RAID
20 physical disks 20 physical disks
12 © 2006 IBM Corporation
IBM Almaden Research Center
Rebuild Work Distribution
Relative readand write
throughput for rebuild
failed disk
13 © 2006 IBM Corporation
IBM Almaden Research Center
Rebuild (2)
Upon the first failure, begin rebuilding the tracks that are affected by the failure (large arrows).
Many disks involved in performing rebuild, so work is balanced, avoiding hot spots.
14 © 2006 IBM Corporation
IBM Almaden Research Center
Declustered vs. Partitioned RAID
1E-3
1E-2
1E-1
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1 2 3
Failure tolerance
Dat
a lo
sses
per
yea
r per
100
PB partitioneddistributed
Simulation results
15 © 2006 IBM Corporation
IBM Almaden Research Center
IBM TotalStorage Productivity CenterStandard Edition
Autonomic Storage ManagementMaking Complex Tasks Simple
FabricDisk Data
A Single Application with modular components
Streamlined Installation and PackagingSingle User InterfaceSingle DatabaseSingle Set of services for consistent administration and operations
Ease of Use
Policy Based Storage ManagementSAN Best Practices
SAN Configuration ValidationStorage Subsystem PlanningFabric Security PlanningHost Planning (Multi-path)
Console Enhancements End-to-End DataPath Explorer
Integrated Storage PlannerConfiguration Change RoverConfiguration CheckerPersonalizationTSM Integration
Business ResiliencyIntegrated Repication Manager
Metro Disaster RecoveryGlobal Disaster RecoveryCascaded Disaster Recovery
Application Disaster Recovery
16 © 2006 IBM Corporation
IBM Almaden Research Center
Integrated Management
Systems
Knowledge
DB
Discovery Monitoring Reporting Configuration
ServerOSMiddleware StorageApplications File System Network
Operating systemsVirtualization software
Hardware
Middleware
Applications
Hardware
Applications
Best Practices Orchestration
Deployment Analytics
Integrated Web 2.0 GUI
Seamlessly integrate systems management across servers, storage and network & provide end-to-end problem determination and analytics capabilities
17 © 2006 IBM Corporation
IBM Almaden Research Center
PERCS Management
CIM
Provider
CIM
Repository
CIMOM
CIM
Model
Management
Server
Uses
Uses
Retrieves
Data
CIM Client
Systems
DB
PERCS GUI
Simulator
PERCS
Storage
GPFS
File System
•A unified and standards based management for GPFS and PERCS Storage
• A GUI that is designed forlarge-scale clusters
•Supporting PERCS scale•GPFS
•The PERCS UI will support:•Information collection: asset tracking, end-end discovery,metrics, events•Management: system changes, problem fixes, configuration changes
•Rich visualizations to help them maintain situational awareness of system status
•Essential for large systems•Also enable GPFS to satisfy commercial customers requiring ease-of-use
18 © 2006 IBM Corporation
IBM Almaden Research Center
AnalyticsProblem Determination and Impact Analysis– Root cause analysis: discover the finest-grain events that indicate the root
cause of the problem– Symptom suppression: correlate alarms/symptoms caused by a common cause
across the integrated infrastructure
Bottleneck Analysis– Post-mortem, live and predictive analysis
Workload and Virtualization Management– Automatically monitor multi-tiered, distributed, heterogeneous or homogeneous
workloads – Migrate virtual machines to satisfy performance goals
Integrated Server, Storage and Network Allocation and Migration– Integrated allocation accounting for connectivity, affinity, flows, ports based on
performance workloads
Disaster Management– Provides integrated server/storage disaster recovery support
19 © 2006 IBM Corporation
IBM Almaden Research Center
VisualizationIntegrated Management is centered around Topology Viewer capabilities based on Web 2.0 technologies– Data Path Viewer for Applications, Servers, Networks and Storage– Progressive Information Disclosure– Semantic Zooming– Information Overlays– Mixed Graphical and Tabular Views– Integrated Historical and Real Time Reporting
20 © 2006 IBM Corporation
IBM Almaden Research Center
The Changing Nature of ArchiveCurrent Archive: Data landfill
Store and forget
Not easily accessible, typically offline and offsite with access time measured in days
Not organized for usage, retained just in case needed
Readily accessible, access time measured in seconds
Indexed for effective discovery
Mined for business value
Emerging Archive: Leverage information for business advantage
21 © 2006 IBM Corporation
IBM Almaden Research Center
Building Storage Systems Targeted at Archive
ScalabilityScale to huge capacity
Exploit tiered storage with disk and tapeLeverage commodity disk storage
Handle extremely large number of objectsSupport high ingest ratesEffect data management actions in a scalable fashion
ReliabilityEnsure data integrity and protectionProvide media management and rejuvenationSupport long-term retention
FunctionalityConsistently handle multiple kinds of objectsManage and retrieve based on data semantics
E.g. Logical groupings of objectsSupport effective search and discoveryProvide for compliance with regulations
Current Archive:
Data landfill
Emerging Archive: Leverage info for business advantage
22 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS Information Lifecycle Management (ILM)
GPFS ILM abstractions– Storage pool – group of LUNs– Fileset – subtree of a file system
namespace– Policy – rule for file placement,
retention, or movement among pools
ILM Scenarios– Tiered storage – fast storage for
frequently used files, slower for infrequently used files
– Project storage – separate pools for each project, each with separate policies, quotas, etc.
– Differentiated storage – e.g. place media files on media-friendly storage (QoS)
GPFS Manager Node•Cluster manager•Lock manager•Quota manager•Allocation manager•Policy manager
System Pool Data Pools
GPFS Clients
Storage Network
GoldPool
SilverPool
PewterPool
GPFS RPC Protocol
GPFSPlacement
Policy
Application
GPFSPlacement
Policy
Application
GPFSPlacement
Policy
Application
GPFSPlacement
Policy
ApplicationPosix
GPFS File System (Volume Group)
23 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS 3.1 ILM Policies
Placement policies, evaluated at file creation, example
Migration policies, evaluated periodically
Deletion policies, evaluated periodicallyGPFS Manager Nod
•Cluster manager•Lock manager•Quota manager•Allocation manager•Policy manager
System Pool Data Pools
GPFS Clients
Storage Network
GoldPool
SilverPool
PewterPool
GPFS RPC Protocol
GPFS Placement Policy
Application
GPFS Placement Policy
Application
GPFS Placement Policy
Application
GPFS Placement Policy
ApplicationPosix
GPFS File System (Volume Group)
24 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS Policy Engine
Migrate and delete rules scan the file system to identify candidate files– Conventional backup and HSM systems also do this
• Usually implemented with readdir() and stat()• This is slow – random small record reads, distributed locking• Can take hours or days for a large file system
GPFS Policy Engine uses efficient sort-merge rather than slow readdir()/stat()– Directory walk builds list of path names (readdir() but no stat()!)– List sorted by inode number, merged with inode file, then evaluated– Both list building and policy evaluation done in parallel on all nodes– … > 105 files/sec per node!
25 © 2006 IBM Corporation
IBM Almaden Research Center
Storage Hierarchies – the old way
Normally implemented one of two ways:Explicit control
– archive command (IBM TSM, Unitree)– copy into special “archive” file system
(IBM HPSS)– copy to archive server (HPSS, Unitree)– … all of which are troublesome and
error-prone for the user
Implicit control through an interface like DMAPI
– File system sends “events” to HSM system (create/delete, low space)
– Archive system moves data and punches “holes” in files to manage space
– Access miss generates event; HSM system transparently brings file back
Client Cluster Computers
HPSSAPI
ClientDomain
TapeLibraries
MetadataDisks
HPSS SANDisk HPSS Cluster Computers
Core Server and Movers
HPSSData Store
IP Network
HPSS FC SAN
1. Client issues HPSS Write or Put to HPSS Core Server
2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover
Client Cluster Computers
HPSSAPI
ClientDomain
Client Cluster Computers
HPSSAPI
Client Cluster Computers
HPSSAPI
Client Cluster Computers
HPSSAPI
Client Cluster Computers
HPSSAPI
ClientDomain
TapeLibraries
MetadataDisks
HPSS SANDisk HPSS Cluster Computers
Core Server and Movers
HPSSData Store
IP Network
HPSS FC SAN
TapeLibrariesTapeLibrariesTapeLibraries
MetadataDisksMetadataDisks
HPSS SANDisk
HPSS SANDisk
HPSS SANDisk HPSS Cluster Computers
Core Server and MoversHPSS Cluster ComputersCore Server and Movers
HPSSData Store
IP NetworkIP Network
HPSS FC SANHPSS FC SAN
1. Client issues HPSS Write or Put to HPSS Core Server
2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover
1. Client issues HPSS Write or Put to HPSS Core Server
2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover
GPFS Cluster
GPFSDiskArrays
HPSS Cluster
HPSS Movers
HPSS Core
ServerDB2
HPSSDiskArrays
HPSSTapeLibraries
Mover-less SAN
Data transfers
HSM Control Information
IP LAN Data
transfers
GPFS I/O Nodes
HPSS
Interface
GPFSSession
Node
HPSS
HSM
processes DB2
Tape –disk
transfers
GPFS ClusterGPFS Cluster
GPFSDiskArrays
GPFSDiskArrays
HPSS ClusterHPSS Cluster
HPSS MoversHPSS
Movers
HPSS Core
ServerDB2
HPSSDiskArrays
HPSSDiskArrays
HPSSTapeLibraries
HPSSTapeLibraries
Mover-less SAN
Data transfers
HSM Control Information
IP LAN Data
transfers
GPFS I/O Nodes
HPSS
Interface
GPFS I/O Nodes
HPSS
Interface
GPFS I/O Nodes
HPSS
Interface
GPFSSession
Node
HPSS
HSM
processes
GPFSSession
Node
HPSS
HSM
processes DB2
Tape –disk
transfers
GPFS 3.1 and HPSS 6.2 DMAPI Architecture
HPSS 6.2 API Architecture
26 © 2006 IBM Corporation
IBM Almaden Research Center
DMAPI ProblemsNamespace events (create, delete, rename)
– Synchronous and recoverable– Each is multiple database
transactions– Slow down the file system
Directory scans– DMAPI low-space events trigger
directory scans to determine what to archive
• can take hours or days on large FS– Scans have little information upon
which to make archiving decisions (what you get from “ls –l”)
• As a result, data movement policies are usually hard-coded and primitive
Read/write managed region– Blocks the user program while data
brought back from HSM system– Parallel data movement isn’t in the
spec, but everyone implements it anyway
– Data movement is actually the one thing about DMAPI worth saving
GPFS Cluster
GPFSDiskArrays
HPSS Cluster
HPSS Movers
HPSS Core
ServerDB2
HPSSDiskArrays
HPSSTapeLibraries
Mover-less SAN
Data transfers
HSM Control Information
IP LAN Data
transfers
GPFS I/O Nodes
HPSS
Interface
GPFSSession
Node
HPSS
HSM
processes DB2
Tape –disk
transfers
GPFS ClusterGPFS Cluster
GPFSDiskArrays
GPFSDiskArrays
HPSS ClusterHPSS Cluster
HPSS MoversHPSS
Movers
HPSS Core
ServerDB2
HPSSDiskArrays
HPSSDiskArrays
HPSSTapeLibraries
HPSSTapeLibraries
Mover-less SAN
Data transfers
HSM Control Information
IP LAN Data
transfers
GPFS I/O Nodes
HPSS
Interface
GPFS I/O Nodes
HPSS
Interface
GPFS I/O Nodes
HPSS
Interface
GPFSSession
Node
HPSS
HSM
processes
GPFSSession
Node
HPSS
HSM
processes DB2
Tape –disk
transfers
GPFS 3.1 and HPSS 6.2 DMAPI Architecture
27 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS Approach: “External Pools”
External pools are really interfaces to external storage managers, e.g. HPSS or TSM– External pool “rule” defines script to call to migrate/recall/etc. files
RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’]
GPFS policy engine builds candidate lists and passes them to external pool scripts
External storage manager actually moves the data– Using DMAPI managed regions (read/write invisible, punch hole)– Or using conventional Posix API’s
28 © 2006 IBM Corporation
IBM Almaden Research Center
GPFS ILM Demonstration
SC’06Tampa, FL
GPFS1M active filesFC, SATA disks
NERSCOakland, CA
HPSSArchive on tapes with disk buffering,connected via 10Gb link
High bandwidth,parallel data
movement acrossall devices and
networks
29 © 2006 IBM Corporation
IBM Almaden Research Center
Nearline Information – conceptual view
Scale-out Archiving Engine(GPFS Cluster)
TSM Deep Storage
NFS/CIFSClient
TSM Archive Client/API
Migration via TSM Archive Client
TSM Archive Client
DMAPIGlobal Index
and Search Capability
Admin /Search
• Provides capability to handle extended meta-data
•Meta-data may be derived from data content
•Extended attributes (integrity code, retention period, retention hold status, and any application meta-data)
• Global index on content and EA meta-data• Allow for application-specific parsers (e.g., DICOM)
NFS/CIFSServer
TSM Archive API
30 © 2006 IBM Corporation
IBM Almaden Research Center
Summary
Storage environments moving from petabytes to exabytes– Traditional HPC– New archive environments
Significant challenges for reliability, resiliency, and manageability
Meta-data becomes key for information organization and discovery