1 macsim tutorial (in isca-39, 2012). thread fetch policies branch predictor thread fetch policies...

MacSim Tutorial (In ISCA-39, 2012)1

MacSim Architecture Studies

MacSim Tutorial (In ISCA-39, 2012)

Architecture Studies Using MacSim

• Thread fetch policies

• Branch predictor

• Software and Hardware prefetcher

• Cache studies (sharing, inclusion)

• DRAM scheduling• Interconnection

studies

• Power model

Front-end Memory System Misc.

2/8


Prefetcher Study

Memory System

Trace Generator(PIN, GPUOCelot)

Hardware Prefetcher

FrontendSoftware prefetch instructionsPTX prefetch, prefetchux86 prefetcht0, prefetcht1, prefetchnta

Hardware prefetch requests

Stream, stride, GHB, …

• Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]• When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]

MacSim

3/8


Cache and NoC Studies

| Cache studies – sharing, inclusion property| On-chip interconnection studies

• TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]

$ $$ $ $ $ $

Shared $

Interconnection

Private Caches

Interconnection

Shared Cache

4/8


Heterogeneity Aware NoC

| Heterogeneous link configuration

Ring NetworkGPU

CPU

L3

MC

Different topologies

C C M M

C C M M

C C G G

C C G G

C0

L3

G0

M1

C1 C2 G1 G2

M0 L3 L3 L3

C0

L3

G0

M1

C1C2 G1 G2

M0 L3 L3 L3

• On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review]

5/8


Instruction Fetch and DRAM Scheduling

Execution

Trace Generator(GPUOCelot)

Frontend

• Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010]

DRAM

RR, ICOUNT, FAIR, LRF, …

FCFS, FRFCFS, FAIR, …

6/8


DRAM Scheduling in GPGPUs

DRAM Bank

DRAM Controller

Core-0 Core-1

Qs for Core-0

RHRMRMRMRM

RHRMRMRM

RHRMRM

W0 W1 W2 W3

Tolerance(Core-0) < Tolerance(Core-1)

Qs for Core-1

RHRMRMRM

RH

W0 W1 W2 W3

Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α

= 4α + 3α + 5α (α < 1)

Reduction in potential if:row hit from queue of length L is serviced next Lα – (L – 1)α

row hit from queue of length L is serviced next Lα – (L – 1/m)α

m = cost of servicing row miss/cost of servicing row hit

Tolerance(Core-0) < Tolerance(Core-1) select Core-0

Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next

• DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011]

7/8


Power Research & Validation

| Verifying simulator and GTX580| Modeling X86-CPU power| Modeling GPU power

Still on-going research

8/8

Fetch3%

Decode1% Schedule

3%

RF4%

EX_alu6%

EX_fpu48%EX_SFU

1%

EX_LD/ST3%

Execution0%

MMU0%

L126%

SharedMem1%

ConstCache1%

TextureCache1%


MacSim’s Roadmap

2012 ~ 2013

Power/Energy Model

ARM ArchitectureMobile Platform

OpenGL Program

1 macsim tutorial (in isca-39, 2012). thread fetch policies branch predictor thread fetch policies...

Documents

w1 of core

cost of servicing row

service row hits

w0 w1 w2 w3

dram bank dram controller

dram scheduling policy

queue of length

gpuocelot trace generator