1. 2 corollary 3 system overview second key idea: specialization think googlefs
TRANSCRIPT
1
2
Corollary
3
System Overview
Second Key Idea: Specialization
• Think GoogleFS
http://netsyslab.ece.ubc.ca 5
Third idea: Enable cross-layer optimizationsLayered Architectures: High benefits, but …
• TCP/IP
• File System
• Benefits, but…– … limits
information flow across layers.
API
http://netsyslab.ece.ubc.ca 6
Cross-Layer Optimizations
• Examples– IP– Storage systems – ….
• Applications Storage System– Performance– QoS requirements– Consistency requirements
• Applications Storage System– Provide storage-level information to applications
Data Intensive Schedulers:
Notification about data movements
Data Intensive Applications:
Co-usage of files
What’s missing? A vehicle to pass information across layers
http://netsyslab.ece.ubc.ca 7
Traditional Use of Custom Metadata
Application Layer
File System Layer
Storage System Layer
Metadata Manager
File Organization Module
Basic File System
Author=Smithinput.datFile Browser
POSIX API
http://netsyslab.ece.ubc.ca 8HPDC'08
Cross-Layer Communication
Application Layer
File System Layer
Storage System Layer
Metadata Manager
File Organization Module
Basic File System
Replicateinput.dat
3x
input.datmoved from
node1 to node3
OK. Schedule Task on node3
POSIX API
Recap
• Object-based storage• Enable specialization --> performance • Enable cross-layer optimization --> genrality
10
One intended use: A Workflow-Aware Storage
System
11
Workflow Example - ModFTDock
• Protein docking application
Simulates the creation of a complex protein from two known proteins
• Applications
Drugs design
Protein interaction prediction
Platform Example – Argonne BlueGene/P
160K cores
10 Gb/s Switch
Complex
10 Gb/s Switch
Complex
GPFS
24 I/O servers
IO rate: 8GBps = 51KBps / core !!
2.5K IO NodesTorus N
etwork
2.5 GBpsper node3D Torus
850 MBps per 64 nodes
Tree
The central storage is a potential bottleneckUnderused resources
Background – ModFTDock in Argonne BG/P
13
Backend file system (e.g., GPFS, NFS)
Scale: 40960 Compute nodes
File based communication
Large IO volumeWorkflow Runtime
Engine
1.2 M Docking
Tasks
IO rate : 8GBps= 51KBps / core
App. task
Local storage
App. task
Local storage
App. task
Local storage
App. task
Local storage
App. task
Local storage
Intermediate Storage Approach
14Backend file system (e.g., GPFS, NFS)
App. task
Local storage
App. task
Local storage
App. task
Local storage
Intermediate Storage
…
POSIX API
Workflow Runtime
EngineScale: 40960 Compute nodes
Stage In
Stage Out
Usage scenario II:
• Support for deduplication
Stakeholders
• The final clients– Financing agencies ($)
• DoE• NSERC
– Science teams• Development team
– Graduate students (6+)– Undergraduate students, visitors (10+)
• Me
Stakeholders – and their goals
• The final clients– Financing agencies ($)
• DoE• NSERC
– Science teams• Development team
– Graduate students (6+)– Undergraduate students, visitors (10+)
• Me
Requirements
1. Easy to deploy2. Easy to integrate with applications3. Versatility and ability to configure4. Efficiency / high-performance /scalability 5. Ability to support versioning and partially
similar data.
All have big architectural implications
Early architectural decisions
1) Object-based storage - system structure
2.) Network/protocol stack: uniform- Stateless to the degree possible
Application
Chunk_4info
Chunk_3info
Chunk_2info
Chunk_1infoSystem Access
Interface - 1
Donor node - 1
Ext-3 file system
Donor node - 1
Ext-3 file system
ManagerRoot
/project/file_1
Control messages
Data messages
Metadatamessages
Early architectural decisions
3.) FUSE-based implementation - Impact: structure, deployability
4.) Policy to manage tension between code maturity and need to experiment
Mid-way architectural decisions
5.) GeneralIO hack6.) Test-driven design
- integrate 3month projects
Implicit architectural policies
7.) Personnel management: - prioritize ‘fun’ - Flat Team structure - Bottom-up decision making / prioritization:- ‘campaigns’
8.) Align ‘values’
Key architectural decisions
1) Object-based storage 2.) Uniform protocol stack3.) POSIX, FUSE-based implementation, 4.) Policy to manage tension between code maturity and need to experiment5.) GeneralIO hack6.) Test-driven design7.) Personnel management: prioritize ‘fun’ 8.) Align values