supporting gpu sharing in cloud environments with a transparent runtime consolidation framework...
TRANSCRIPT
Supporting GPU Sharing in CloudEnvironments with a Transparent
Runtime Consolidation Framework
Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)
Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Laboratories America)
1
Two Interesting Trends
• GPU, “Big player” in High Performance Computing– Excellent “price-performance” and “performance-per-watt”
ratio– Heterogeneous architectures – AMD Fusion APU, Intel
Sandy Bridge, NVIDIA Denver Project– 3 out of top 4 super computers (Tianhe-1A, Nebulae, and
Tsubame)
• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances , High-speed interconnects for HPC users– Amazon, Nimbix GPU instances
2
BIG FIRST STEP!But at initial stages
Motivation
• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node
• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Better resource utilization
• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential
3
Related Work
• vCUDA (Shi et al.)• GViM (Gupta et al.)• gVirtuS (Guinta et al.)• rCuda (Duato et al.)
4
Enable GPU Visibility from Virtual Machines
Limitation: Only from Single Process Context
How to share GPUs from Virtual Machines?
CUDA Compute 2.0 + Supports Task Parallelism
Contributions
• A Framework for transparent GPU sharing in cloud– No source code changes required, feasible in cloud– Propose sharing through consolidation
• Solution to conceptual consolidation problem– New method for computing consolidation affinity scores– Two new molding methods– Overall Runtime consolidation algorithm
• Extensive evaluation with 8 benchmarks on 2 GPUs– At high contention, 50% improved throughput– Framework overheads are small
5
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
6
Outline
•Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
7
BACKGROUND
8
• GPU Architecture• CUDA Mapping and Scheduling
Background
9
SM
SH MEM
SM
SH MEM
SM
SH MEM
..
....
GPU Device MemoryResource Requirements < Max Available Inter-leaved execution
Resource Requirements > Max Available Serialized execution
Outline
• Background
• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
10
UNDERSTANDING CONSOLIDATION on GPU
11
• Demonstrate Potential of Consolidation
• Relation between Utilization and Performance• Preliminary experiments with consolidation
GPU Utilization vs Performance
12
0
2
4
6
8
10
12
14
2*256 4*256 8*256 16*256 32*256 64*256
Scal
abili
ty O
ver 1
*256
Execution configuration
Black Scholes Binomial Options PDE Solver Image Processing
Scalability of Applications
Linear
Sub-Linear
No Significant Improvement
Good Improvement
Consolidation with Space
and Time Sharing
13
SM
SH MEM
SM
SH MEM
SM
SH MEM
SM
SH MEM
App 1 App 2
Cannot utilize all SMs effectivelyBetter Performance at large no. of blocks
Outline
• Background• Understanding Consolidation on GPU
•Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
14
FRAMEWORK DESIGN
15
• Challenges• gVirtuS Current Design• Consolidation Framework & its Components
Design Challenges
16
Enabling GPU Sharing
When & What to Consolidate
Overheads
Need a Virtual Process Context
Need Policies and Algorithms to decide
Light-Weight Design
gVirtuS Current Design
17
Guest-Host Communication Channel
GPU1 GPUn…
Linux / VMM
Frontend Library
CUDA App2
VM2
Frontend Library
CUDA App1
VM1
CUDA Driver
CUDA Runtime
gVirtuS Backend
Backend Process 1
Backend Process 2
Guest Side
HostSide
• Fork Process• No Communication
b/w processes
Runtime Consolidation
Framework
18
BackEnd Server
Dispatcher
Policies Heuristics
GPU GPU
VirtualContext
VirtualContext
Workload Consolidator
Workload Consolidator
Queues Workloads to Dispatcher
Queues Workloads to Virtual Context Ready Queue
HOST SIDE
Workloads arrive from Frontend
Consolidation Decision Maker
Thread
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions
19
CONSOLIDATION DECISION MAKING LAYER
• GPU Sharing Mechanisms & Resource Contention• Two Molding Policies• Consolidation Runtime Scheduling Algorithm
20
Sharing Mechanisms &
Resource Contention
21
Sharing Mechanisms
Consolidation by Space Sharing
Consolidation by Time Sharing
Res
ourc
e C
onte
nti
on
Large No. of Threads with in a block
Pressure on Shared Memory B
asis
of A
ffin
ity
Sco
re
Molding Kernel Configuration
• Perform molding dynamically• Leverage gVirtuS to intercept kernel launch• Flexible for configuration modification• Mold the configuration to reduce contention• Potential increase in application latency• However, may still improve global throughput
22
Two Molding Policies
23
Molding Policies
Time Sharing with Reduced Threads
Forced Space Sharing
14 * 256
7 * 256
14 * 512
14 * 128
May resolve shared memory
Contention
May reduce register pressure in
the SM
Consolidation SchedulingAlgorithm
• Greedy-based Scheduling Algorithm• Schedule “N” kernels on 2 GPUs• Input: 3-Tuple Execution Configuration list of all kernels • Data Structure: Work Queue for each Virtual Context
24
Overall Algorithm
Generate Pair-wise Affinity
Generate Affinity for List
Get Affinity By Molding
Consolidation Scheduling Algorithm
25
Configuration list
Create Work Queues for
Virtual Contexts
Generate Pair-wise Affinity
Find the pair with min. affinitySplit the pair into diff. Queues
(a1, a2) = Generate Affinity For List for each rem. KernelWith each Work Queue
(a3, a4) = Get Affinity By Molding for each rem. Kernel
With each Work QueueFind Max(a1, a2, a3, a4)
Push kernel into QueueDispatch Queues into
Virtual Contexts
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer
•Experimental Results• Conclusions
26
EXPERIMENTAL RESULTS• Setup, Metric & Baselines• Benchmarks• Results
27
Setup, Metric & Baselines
• Setup– A Machine with Two Intel Quad core Xeon E5520 CPU– Two NVIDIA Tesla C2050 GPU Cards
• 14 Streaming Multi Processors, each containing 32 cores• 3 GB Device Memory• 48 KB Shared Memory per SM
– Virtualized with gVirtuS 2.0
• Evaluation Metric– Global Throughput benefit obtained after consolidation of kernels
• Baselines– Serialized execution, based on CUDA Runtime Scheduling– Blind Round-Robin based consolidation (Unaware of exec. configuration)
28
Benchmarks & Goals
29
Benchmarks and its Characteristics
Benchmarks Memory characteristics Data Set DescriptionImage Processing (IP) No ShMem 2*3584*3584 pointsPDE Solver (PDE) No ShMem 2*3584*3584 pointsBlackScholes (BS) No ShMem 1,000,000 optionsBinomial Options (BO) Low ShMem (upto 3KB) 256 options, 2048 stepsK-Means Clustering (KM) Med ShMem (upto 16KB) 4194304 pointsK-Nearest Neighbour (KNN) Med ShMem (upto 16KB) 4194304 pointsEuler (EU) Heavy ShMem (upto 48KB) 10,000 nodes, 60,000 edgesMolecular Dynamics (MD) Heavy ShMem(upto 48KB) 130,000 nodes, 16,200,000 edges
Benefits of Space and Time Sharing Mechanisms
30
Space Sharing Time Sharing
• No resource contention• Consolidation through Blind Round-Robin algorithm• Compared against serialized execution of kernels
Drawbacks of Blind Scheduling
31
Presence of Resource Contentions
No benefit from Consolidation
Large Number of ThreadsShared Memory Contention
Effect of Molding
32
Contention – Large Threads Contention – Shared Memory
Time Sharing with Reduced Threads
Forced Space Sharing
Effect of Affinity Scores
33
Kernel Configurations• 2 kernels with 7*512• 2 kernels with 14*256
• No affinity – Unbalanced Threads per SM• With affinity – Better Thread Balancing per SM
Benefits at High Contention Scenario
34
8 Kernels on 2 GPUs
6 out of 8 Kernels molded31.5% improvement over Blind Scheduling50% over serialized execution
Framework Overheads
35
No Consolidation With Consolidation
Compared to plain gVirtuS executionOverhead always less than 1%
Compared with manually consolidated executionOverhead always less than 4%
Outline
• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results
•Conclusions36
Conclusions
• A Framework for transparent sharing of GPUs• Use Consolidation as a mechanism for sharing GPUs• No source code level changes• New Affinity and Molding methods• Runtime Consolidation Scheduling Algorithm• At high contention, significant throughput benefits• The overheads of the framework are small
37
38
Thank You for your attention!
Questions?
Authors Contact Information:• [email protected]• [email protected]• [email protected]• [email protected]
Impact of Large Number of Threads
39
Per Application Slowdown/ Choice of Molding
40
Application Slowdown Choice of Molding Type