functional partitioning to optimize end-to-end performance ... · functional partitioning to...
TRANSCRIPT
Functional Partitioning to Optimize End-to-End Performance
on Many-core Architectures
Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman
Virginia Tech, Oak Ridge National Laboratory, North Carolina State University
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Many-cores are driving HPC
2
There is a need for redesigning the HPC software stack to benefit from increasing number of cores
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Can’t apps simply use more cores?
3
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0 5 10 15 20 25 30
Spee
dup
Number of Cores
mpiBLAST FLASH
Simply assigning more cores to applications does not scale.
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Growing computation-I/O gap degrades performance
4
2000 2002 2004 2006 2008 2010
Perf
orm
ance
Disk-based Storage Systems
Storage Wall !
Server/CPUs
Source: storagetopic.com
25X
2X
Performance Growing Trend
Research question: Can the underutilized cores be leveraged
to bridge the Compute-I/O gap?
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Observation: All workflow activities (not just compute) affect overall performance
5
Computation Checkpointing Preprocessing
Script
Timeline
Computation I/O Computation I/O End-to-end Performance
Computation I/O Computation I/O Performance gain!
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Our Contribution: Functional Partitioning (FP) of Cores
• Idea: Partition the app core allocation • Dedicate partitions to different app activities
• Compute, checkpoint, format transformation, etc.
• Bring app support services into the compute node • Transforms the compute node into a scalable unit for service composition
• A generalized FP-based I/O runtime environment • SSD-based checkpointing • Adaptive checkpoint draining • Deduplication • Format transformation
• A thorough evaluation of FP using a160-core testbed
6
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Functional Partitioning (FP)
7
Checkpointing core Deduplication core
Parallel File System
… Aggregate Buffer
FP Configuration FP(2,8)
Launch Application
Data
Data
Script
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Agenda
• Motivation • Functional Partitioning (FP) • FP Case Studies • Evaluation • Conclusion
8
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Challenges in FP design
• How to co-execute the support services with the app? • How to assign cores for the support activities? • How to share data between compute and support activities?
• How to make the FP runtime transparent?
• How to have a flexible API for different support activities?
• How to do adapt support partitions based on progress?
• How to minimize the overhead of FP runtime?
9
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
FP runtime design
• Uses app-specific instances setup as part of job startup
• Uses interpositioning strategy for data management: • Initiates after core allocation by the scheduler and before
application startup (mpirun) • Pins the admin software to a core • Sets up a fuse-based mount point for data sharing between
compute and support services
• Initiates the support services and the application’s main compute to use the shared mount space
10
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Aux-apps: Capturing support activities • Provide an API for writing code for support activities • Describe actions to take when data is accessed
Adavantages: • Decouple application design from support activity design • Provide a flexible, reusable interface • Support recycling of common activities across apps
• Reduce application development time
11
int dedup_write (void * output_buffer, int size){ int result=SUCCESS; //process output in chunks while((chunk=get_chunk(&out_buffer,size))!=null){ // compute hash on output_buffer chunks char* hash=sha1(chunk);
//write the new chunk if(!hashtable_get(hash)) result=data_write(chunk);
// update de-dup hash-table hashtable_update(&result,chunk,hash); } return result; }
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Assigning cores to aux-apps
• Per-activity partition: dedicate a core to each aux-app • Intra-Node: Dedicated cores are co-located with the main app • Inter-Node: Dedicated cores are on specialized nodes
• Shared partition: multiple cores for multiple aux-apps • One service runs on multiple cores • One core runs multiple services
12
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Key FP runtime components for managing aux-apps • Benefactor: Software that runs on each node
• Manages a node’s contributions, SSD, memory, core • Serves as a basic management unit in the system • Provides services and communication layer between nodes • Uses FUSE to provide a special transparent mount point
• Manager: Software that runs on a dedicated node • Manages and coordinates benefactors • Schedules aux-apps and orchestrates data transfers
• Manager and benefactors are application specific and utilize cores from the application’s allotment
13
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Minimizing FP overhead
• Minimize main memory consumption • Use non-volatile memory, e.g. SSD, instead of DRAM
• Minimize cache contention • Schedule aux-apps based on socket boundaries
• Minimize interconnection bandwidth consumption • Coordinate the application and FP aux-apps • Extend the ioclt call to the runtime to define blackout periods
14
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Agenda
• Motivation • Functional Partitioning(FP) • FP Case Studies • Evaluation • Conclusion
15
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
SSD-based checkpointing
• FP can help compose a scalable service out of node-local checkpointing
• Why SSD checkpointing: More efficient than memory-checkpointing • Does not compete with app main-memory demands • Provide fault tolerance • Cost less
• How: Aggregate SSD on multiple nodes as an aggregate buffer • Provide faster transfer of checkpoint data to Parallel FS • Utilize dedicated core memory for I/O speed matching
16
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
SSD-based checkpointing aux-app
17
Application
Fuse:/
M
Application
Fuse:/
M
Application
Fuse:/
M
Manager
MetaInfo
M
…
…
Parallel File system
Compute Nodes/Benefactors Compute Nodes Checkpointing core
Aggregate SSD Store
Configuration FP(1,8)
Launch Application
Script
SSDs
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Deduplication of checkpoint data
• FP cores can be used to perform compute-intensive de-duplication, in-situ, on the node
• Why: Reduce the data written and improve I/O throughput
• How: Identify similar data across checkpoints • If data is duplicate, update only the metadata • Co-located with ssd-checkpointing app on the same core
18
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Deduplication aux-app
19
Application
Fuse:/
M
Application
Fuse:/
M
Application
Fuse:/
M
Manager
MetaInfo
M
…
…
Parallel File system
Compute Nodes/Benefactors
Aggregate SSD Store
Configuration FP(1,8)
Launch Application
Script
Checkpointing Deduplication core
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Agenda
• Motivation • Functional Partitioning(FP) • FP Case Studies • Evaluation • Conclusion
20
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Evaluation objectives
• How effective is functional partitioning?
• How efficient is the SSD-checkpointing aux-app? • Real world workload • Synthetic workload
• How efficient is the deduplication aux-app? • Synthetic workload
21
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Experimentation methodology
• Testbed: • 20 nodes, 160 cores, 8G memory/node, Linux 2.6.27.10 • HDD model: WD3200AAJS SATAII 320GB, 85MB/s • SSD model: Intel X25-E Extreme, sequential read 250MB/s,
sequential write 175MB/s, capacity 32G
• Workloads: • FLASH: Real-world astrophysics simulation application • Synthetic benchmark: A checkpoint application that generates
250MB/process every barrier step
• FP(x,y) -> dedicate x out of y cores on each node to aux-apps
22
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Impact of SSD-checkpointing using real-world workload
23
0
50
100
150
200
80 160
Tim
e(s)
Number of Total Cores
Local Disk non-FP(0,8) Local Disk non-FP(0,7) Aggregate Disk FP(1,8) Aggregate SSD FP(1,8)
15%
41%
27%
44% FP effectively improves application end-to-end performance
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
SSD-checkpointing I/O throughput using synthetic workload
24
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120 140 160
I/O T
hrou
ghpu
t (M
B/s
)
Number of Compute Cores
In Memory(20) FP(1,8) SSD(20) FP(1,8)
SSD checkpointing provides sustained high throughput vs. an in-memory approach without memory overhead
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Varying the number of benefactors
25
0
1000
2000
3000
4000
5000
6000
7000
5 10 16 20 25 30 36 40
I/O T
hrou
ghpu
t(MB
/s)
Number of Benefactors
In Memory FP(0-2,8)
A small number of benefactors can significantly improve performance
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Varying the number of compute cores
26
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
10 30 50 70 90 110 130 150
I/O T
hrou
ghpu
t (M
B/s
)
Number of Compute Cores
N=1 N=2 N=5 N=10 N=20
SSD FP(0-1,8), N=benefactors
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Efficiency of De-duplication Aux-app Using FP(2,8)
0
0.5
1
1.5
2
10 30 50 70 90 110 130
Nor
mal
ized
I/O
Thr
ough
put
Number of Compute Cores
Dedup(0.25) Dedup(0.50) Dedup(0.75) Dedup(0.90) Non-dedup
27
60%
For compute intensive tasks such as deduplication assigning more cores to the aux-app improve end-to-end performance
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Impact on end-to-end performance
28
0
200
400
600
800
1000
1200
1400
20 40 60 80 100 120 140 160
Tim
e (s
)
Number of Compute Cores
Execution Time Checkpointing Time Total Time
FP effectively improves application end-to-end performance
In Memory FP(1,8)
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Agenda
• Motivation • Functional Partition(FP) • Sample core services • Evaluation • Conclusion
29
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Conclusion • Created a novel FP run-time for many-cores systems
• Transparent to applications • Easy-to-use, flexible, and support recyclable aux-apps
• Implemented several sample FP support services • SSD-based checkpointing • Deduplication • Format transformation • Adaptive checkpoint data draining
• Showed that FP can improve end-to-end application performance by reducing support activity time with minimal overhead to compute
30
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Future work
• Explore dynamic functional partitioning • Implement more FP-based services • Utilize FP for non-I/O-based activities
• Contact information Viginia Tech: Min Li, Ali R. Butt -- [email protected], [email protected] ORNL: Sudharshan S. Vazhkudai – [email protected] NCSU: Xiaosong Ma – [email protected]
31
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Backup slides
32
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Adaptive checkpoint data draining
• Why: Data cannot be stored in the SSD buffer forever
• How: Lazily draining the data to PFS every k checkpoints • Periodically update the manager with free space status • The manager uses this info to determine when to drain • Dedicated cores can be used to facilitate the draining and
support tasks
33
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Adaptive checkpoint data draining
34
Application
Fuse:/
M
Application
Fuse:/
M
Application
Fuse:/
M
Manager
MetaInfo
M
…
…
Parallel File system
Compute Nodes/Benefactors Checkpointing core
Aggregate SSD Store
Launch Application
Draining core
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Removed
35
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Deduplication aux-app
36
Application
Fuse:/
M
Application
Fuse:/
M
Application
Fuse:/
M
Manager
MetaInfo
M
…
…
Parallel File system
Compute Nodes/Benefactors
Checkpointing Deduplication core
Aggregate SSD Store
Launch Application
Draining core
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University
Efficiency of de-duplication aux-app
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
10 30 50 70 90 110 130 150
I/O T
hrou
ghpu
t (M
B/s
)
Number Of Compute Cores
Non-dedup Dedup(0.90) Dedup(0.75)
Dedup(0.50) Dedup(0.25) Dedup(0.10)
37
Using a core to support a deduplciation aux-app improves I/O throughput and in turn improve end-to-end application performance