1 macsim tutorial (in isca-39, 2012). thread fetch policies branch predictor thread fetch policies...
TRANSCRIPT
MacSim Tutorial (In ISCA-39, 2012)
Architecture Studies Using MacSim
• Thread fetch policies
• Branch predictor
• Software and Hardware prefetcher
• Cache studies (sharing, inclusion)
• DRAM scheduling• Interconnection
studies
• Power model
Front-end Memory System Misc.
2/8
MacSim Tutorial (In ISCA-39, 2012)
Prefetcher Study
Memory System
Trace Generator(PIN, GPUOCelot)
Hardware Prefetcher
FrontendSoftware prefetch instructionsPTX prefetch, prefetchux86 prefetcht0, prefetcht1, prefetchnta
Hardware prefetch requests
Stream, stride, GHB, …
• Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]• When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]
MacSim
3/8
MacSim Tutorial (In ISCA-39, 2012)
Cache and NoC Studies
| Cache studies – sharing, inclusion property| On-chip interconnection studies
• TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]
$ $$ $ $ $ $
Shared $
Interconnection
Private Caches
Interconnection
Shared Cache
4/8
MacSim Tutorial (In ISCA-39, 2012)
Heterogeneity Aware NoC
| Heterogeneous link configuration
Ring NetworkGPU
CPU
L3
MC
Different topologies
C C M M
C C M M
C C G G
C C G G
C0
L3
G0
M1
C1 C2 G1 G2
M0 L3 L3 L3
C0
L3
G0
M1
C1C2 G1 G2
M0 L3 L3 L3
• On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review]
5/8
MacSim Tutorial (In ISCA-39, 2012)
Instruction Fetch and DRAM Scheduling
Execution
Trace Generator(GPUOCelot)
Frontend
• Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010]
DRAM
RR, ICOUNT, FAIR, LRF, …
FCFS, FRFCFS, FAIR, …
6/8
MacSim Tutorial (In ISCA-39, 2012)
DRAM Scheduling in GPGPUs
DRAM Bank
DRAM Controller
Core-0 Core-1
Qs for Core-0
RHRMRMRMRM
RHRMRMRM
RHRMRM
W0 W1 W2 W3
Tolerance(Core-0) < Tolerance(Core-1)
Qs for Core-1
RHRMRMRM
RH
W0 W1 W2 W3
Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α
= 4α + 3α + 5α (α < 1)
Reduction in potential if:row hit from queue of length L is serviced next Lα – (L – 1)α
row hit from queue of length L is serviced next Lα – (L – 1/m)α
m = cost of servicing row miss/cost of servicing row hit
Tolerance(Core-0) < Tolerance(Core-1) select Core-0
Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next
• DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011]
7/8
MacSim Tutorial (In ISCA-39, 2012)
Power Research & Validation
| Verifying simulator and GTX580| Modeling X86-CPU power| Modeling GPU power
Still on-going research
8/8
Fetch3%
Decode1% Schedule
3%
RF4%
EX_alu6%
EX_fpu48%EX_SFU
1%
EX_LD/ST3%
Execution0%
MMU0%
L126%
SharedMem1%
ConstCache1%
TextureCache1%