eecs 583 class 20 research topic 2: stream compilation, gpu...

28
EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi

Upload: others

Post on 05-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • EECS 583 – Class 20

    Research Topic 2: Stream

    Compilation, GPU Compilation

    University of Michigan

    December 3, 2012

    Guest Speakers Today: Daya Khudia and Mehrzad Samadi

  • - 1 -

    Announcements & Reading Material

    Exams graded and will be returned in Wednesday’s

    class

    This class

    » “Orchestrating the Execution of Stream Programs on Multicore

    Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN

    2008 Conference on Programming Language Design and

    Implementation, Jun. 2008.

    Next class – Research Topic 3: Security

    » “Dynamic taint analysis for automatic detection, analysis, and

    signature generation of exploits on commodity software,” James

    Newsome and Dawn Song, Proceedings of the Network and

    Distributed System Security Symposium, Feb. 2005

  • Stream Graph Modulo Scheduling

  • - 3 -

    Stream Graph Modulo Scheduling (SGMS)

    Coarse grain software pipelining » Equal work distribution

    » Communication/computation overlap

    » Synchronization costs

    Target : Cell processor » Cores with disjoint address spaces

    » Explicit copy to access remote data

    DMA engine independent of PEs

    Filters = operations, cores = function units

    SPU

    256 KB LS

    MFC(DMA)

    SPU

    256 KB LS

    MFC(DMA)

    SPU

    256 KB LS

    MFC(DMA)

    EIB

    PPE

    (Power PC)

    DRAM

    SPE0 SPE1 SPE7

  • - 4 -

    Preliminaries Synchronous Data Flow (SDF) [Lee ’87]

    StreamIt [Thies ’02]

    int->int filter FIR(int N, int wgts[N]) {

    work pop 1 push 1 {

    int i, sum = 0;

    for(i=0; i

  • - 5 -

    SGMS Overview

    PE0 PE1 PE2 PE3

    PE0

    T1 T4

    T4

    T1 ≈ 4

    DMA

    DMA

    DMA

    DMA

    DMA

    DMA

    DMA

    DMA

    Prologue

    Epilogue

  • - 6 -

    SGMS Phases

    Fission +

    Processor

    assignment

    Stage

    assignment

    Code

    generation

    Load balance Causality

    DMA overlap

  • - 7 -

    Processor Assignment: Maximizing Throughputs

    i

    ij

    j

    ij

    IIaiw

    a

    )(

    1 for all filter i = 1, …, N

    for all PE j = 1,…,P

    Minimize II

    B C

    E

    D

    F

    W: 20

    W: 20 W: 20

    W: 30

    W: 50

    W: 30

    A

    B C

    E

    D

    F

    A

    A

    D

    B

    C E

    F

    Minimum II: 50

    Balanced workload!

    Maximum throughput

    T2 = 50

    A

    B

    C

    D

    E

    F

    PE0

    T1 = 170

    T1/T2 = 3. 4

    PE0 PE1 PE2 PE3

    Four Processing Elements

    • Assigns each filter to a processor

    PE1 PE0 PE2 PE3

    W: workload

  • - 8 -

    Need More Than Just Processor Assignment

    Assign filters to processors » Goal : Equal work distribution

    Graph partitioning?

    Bin packing?

    A

    B

    D

    C

    A

    B C

    D

    5

    5

    40 10

    Original stream program

    PE0 PE1

    B

    A

    C

    D Speedup = 60/40 = 1.5

    A

    B1

    D

    C

    B2

    J

    S

    Modified stream program

    B2

    C J

    B1

    A S

    D Speedup = 60/32 ~ 2

  • - 9 -

    Filter Fission Choices

    PE0 PE1 PE2 PE3

    Speedup ~ 4 ?

  • - 10 -

    Integrated Fission + PE Assign

    Exact solution based on Integer Linear Programming (ILP)

    Split/Join overhead

    factored in

    Objective function- Maximal load on

    any PE

    » Minimize

    Result

    » Number of times to “split” each filter

    » Filter → processor mapping

  • - 11 -

    Step 2: Forming the Software Pipeline

    To achieve speedup » All chunks should execute concurrently

    » Communication should be overlapped

    Processor assignment alone is insufficient information

    A

    B

    C

    A

    C B

    PE0 PE1

    PE0 PE1

    Tim

    e

    A B

    A1

    B1 A2

    A1

    B1

    A2 A→B

    A1

    B1

    A2 A→B

    A3 A→B

    Overlap Ai+2 with Bi

    X

  • - 12 -

    Stage Assignment

    i

    j

    PE 1

    Sj ≥ Si

    i

    j

    DMA

    PE 1

    PE 2

    Si

    SDMA > Si

    Sj = SDMA+1

    Preserve causality

    (producer-consumer dependence) Communication-computation

    overlap

    Data flow traversal of the stream graph

    » Assign stages using above two rules

  • - 13 -

    Stage Assignment Example

    A

    B1

    D

    C

    B2

    J

    S

    A

    S

    B1 Stage 0

    DMA DMA DMA Stage 1

    C B2

    J

    Stage 2

    D

    DMA Stage 3

    Stage 4

    DMA

    PE 0

    PE 1

  • - 14 -

    Step 3: Code Generation for Cell

    Target the Synergistic Processing Elements (SPEs)

    » PS3 – up to 6 SPEs

    » QS20 – up to 16 SPEs

    One thread / SPE

    Challenge

    » Making a collection of independent threads implement a software

    pipeline

    » Adapt kernel-only code schema of a modulo schedule

  • - 15 -

    Complete Example

    void spe1_work()

    {

    char stage[5] = {0};

    stage[0] = 1;

    for(i=0; i

  • - 16 -

    SGMS(ILP) vs. Greedy

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

    Benchmarks

    Rela

    tive S

    peed

    up

    ILP Partitioning Greedy Partitioning Exposed DMA

    (MIT method, ASPLOS’06)

    • Solver time < 30 seconds for 16 processors

  • - 17 -

    SGMS Conclusions

    Streamroller

    » Efficient mapping of stream programs to multicore

    » Coarse grain software pipelining

    Performance summary

    » 14.7x speedup on 16 cores

    » Up to 35% better than greedy solution (11% on average)

    Scheduling framework

    » Tradeoff memory space vs. load balance

    Memory constrained (embedded) systems

    Cache based system

  • - 18 -

    Discussion Points

    Is it possible to convert stateful filters into stateless?

    What if the application does not behave as you expect?

    » Filters change execution time?

    » Memory faster/slower than expected?

    Could this be adapted for a more conventional

    multiprocessor with caches?

    Can C code be automatically streamized?

    Now you have seen 3 forms of software pipelining:

    » 1) Instruction level modulo scheduling, 2) Decoupled software

    pipelining, 3) Stream graph modulo scheduling

    » Where else can it be used?

  • Compilation for GPUs

  • - 20 -

    Why GPUs?

    20

    0

    250

    500

    750

    1000

    1250

    1500

    2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

    Th

    eore

    tica

    l G

    FL

    OP

    S/s

    NVIDIA

    GPU

    INTEL CPU

    GeForce GTX 480

    GeForce GTX 280

    GeForce 8800 GTX

    GeForce 7800 GTX

    GeForce 6800 Ultra

    GeForce FX 5800

  • - 21 -

    Efficiency of GPUs

    21

    High

    Memory

    Bandwidth

    GTX 285 : 159 GB/Sec

    i7 : 32 GB/Sec

    High Flop

    Rate i7 :102 GFLOPS

    GTX 285 :1062 GFLOPS High DP Flop Rate

    i7 : 51 GFLOPS

    GTX 285 : 88.5 GFLOPS

    GTX 480 : 168 GFLOPS

    High Flop

    Per Watt

    GTX 285 : 5.2 GFLOP/W

    i7 : 0.78 GFLOP/W

    High Flop

    Per Dollar

    GTX 285 : 3.54 GFLOP/$

    i7 : 0.36 GFLOP/$

  • - 22 -

    GPU Architecture

    22

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    Interconnection Network

    Global Memory (Device Memory) PCIe

    Bridge

    CPU Host

    Memory

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    SM 0 SM 1 SM 2 SM 29

  • - 23 -

    CUDA

    “Compute Unified Device Architecture”

    General purpose programming model

    » User kicks off batches of threads on the GPU

    Advantages of CUDA

    » Interface designed for compute - graphics free API

    » Orchestration of on chip cores

    » Explicit GPU memory management

    » Full support for Integer and bitwise operations

    23

  • - 24 -

    Programming Model

    24

    Host

    Kernel 1

    Kernel 2

    Grid 1

    Grid 2

    Device T

    ime

  • - 25 -

    Grid 1

    GPU Scheduling

    25

    SM 0

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    SM 1

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    SM 2

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    SM 3

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

    SM 30

    Shared

    Regs

    0 1

    2 3

    4 5

    6 7

  • - 26 -

    Warp Generation

    Block 0

    Block 1

    Block 3

    Shared

    Registers

    0 1

    2

    4 5

    3

    6 7

    SM0

    26

    Block 2

    ThreadId

    0 31 32 63 Warp 0 Warp 1

  • - 27 -

    Memory Hierarchy

    27

    Per Block

    Shared Memory

    __shared__ int SharedVar

    Block 0

    Per-thread

    Register

    int LocalVarArray[10]

    Per-thread

    Local Memory

    int RegisterVar

    Thread 0

    Grid 0

    Per app

    Global Memory

    Host

    __global__ int GlobalVar

    __constant__ int ConstVar

    Texture TextureVar

    Per app

    Texture Memory

    Per app

    Constant Memory

    Dev

    ice