1. scheduling of memory accesses in the memory interface of gpus - luis
Post on 04-Jun-2018
217 Views
Preview:
TRANSCRIPT
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
1/36
Luis Garrido, 2012
IEE5008Autumn 2012
Memory SystemsSurvey on the Off-Chip Scheduling of
Memory Accesses in the Memory
Interface of GPUsGarrido Platero, Luis Angel
EECS Graduate Program
National Chiao Tung University
luis.garrido.platero@gmail.com
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
2/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Outline
Introduction
Overview of GPU Architectures
The SIMD Execution Model
Memory Requests in GPUs
Differences between GDDRx and DDRx
State-of-the-art Memory Scheduling Techniques
Effect of instruction fetch and memory scheduling inGPU Performance
An alternative Memory Access Scheduling in Manycore
Accelerators
2
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
3/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Outline
DRAM Scheduling Policy for GPGPU Architectures
Based on a Potential Function
Staged Memory Scheduling: Achieving High
Performance and Scalability in Heterogeneous Systems
ConclusionReferences
3
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
4/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Introduction
Memory controllers are a critical bottleneck of
performance
Scheduling policies compliant with the SIMD
execution model.
Characteristics of the memory requests in GPU
architectures
Integration of GPU+CPU systems
4
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
5/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Overview: SIMD execution model
5
Core DPUnitLD/ST
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
DPUnit
DPUnit
LD/ST
LD/ST
LD/ST
LD/ST
Core DPUnitLD/ST
Core
Core
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
LD/ST
LD/ST
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
LD/ST
LD/ST
REGISTER FILE (65536 x 32-bit)
Warp Scheduler Warp Scheduler Warp Scheduler
Instruction Cache
Interconnect Network
64 KB Shared Memory / L1 Cache
48 KB Read-Only Data Cache
Tex Tex Tex TexTex Tex Tex Tex
Core DPUnitLD/ST
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
DPUnit
DPUnit
LD/ST
LD/ST
LD/ST
LD/ST
Core DPUnitLD/ST
Core
Core
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
LD/ST
LD/ST
Core
Core
Core
Core
Core
Core
DPUnit
DPUnit
LD/ST
LD/ST
REGISTER FILE (65536 x 32-bit)
Warp Scheduler Warp Scheduler Warp Scheduler
Instruction Cache
Interconnect Network
64 KB Shared Memory / L1 Cache
48 KB Read-Only Data Cache
Tex Tex Tex TexTex Tex Tex Tex
L2 Unified Cache
MemoryController
Memo
ryController
MemoryController
MemoryController
GigaThread Engine
Block (0,0) Block (1,0) Block (2,0)
Block (0,1) Block (1,1) Block (2,1)
Block (0,2) Block (1,2) Block (2,2)
GRID
Thread (0,0) Thread (0, 1) Thread (0,2)
Thread (0,1) Thread (1, 1) Thread (2,1)
Thread (0,2) Thread (1, 2) Thread (2,2)
Block(0,2)
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
Memory
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
6/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Overview: Memory requests in GPUs
Varying number accesses
with different characteristics
generated simultaneously
6
i
A
A
B
B
i+1
C
C
D
D
i+2
E
E
B
C
i+3
A
F
A
-
i+4
F
F
B
F
i+5
A
E
E
E
i+6
B
B
B
B
Load/Store units handle the data fetchConcept of memory coalescing
Accesses can be merged
Intra-core merging
Inter-core merging
Reduce number of transactions
-
8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis
7/36
NCTU IEE5008 Memory Systems 2012Luis Garrido
Overview: Memory requests in GPUs
7
Number of transactions depend on:
Parameters of memory sub-system: how many cache
levels, cache line size,
Applications behavior
Scheduling policy and GDDRx controller capabilities
0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384
0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384
0 32 64 96 128 160 192 224 256 288 320 352 384
a) b)
Coalesced Accesses =
1 transaction
c) d)
e)
Scattered accesses =
k transactions
Same word accesses =
1 transactionMisaligned accesses =
2 transactions
Permutated accesses =
1 transaction
b)
d)
Same word access =
1 transaction
Permuted accesses =
4 transactions
0 32 64 96 128 160 192 224 256 288 320 352 384
a)
Coalesced accesses =
4 transactions
c)
Scattered accesses =
k transactions
Misaligned access =
top related