ipdps v3 0 (2).pptx [read-only] - nus computingwongwf/papers/ipdps11_ppt.pdfmicrosoft powerpoint -...

32

Upload: others

Post on 09-Feb-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • 2

    TuneTransform

    GPU

    KernelsBlocks

    Warps

    Programming model

    Application

  • 3

    GPU architecture• SIMD constraints• Memory hierarchy

    Automated mapping• GPU model• Program transformations

    Application

    TuneTransform

    GPU

    KernelsBlocks

    Warps

    Programming model

    Application

    KernelsBlocks

    Warps

    Programming model

    Our approach

  • Stream graph Parallel instances of the entire graph

    4

    from memory

    to memory

  • Stream graph Parallel instances of the entire graphNovel memory access scheme

    5

    from memory

    to memory

  • Stream graph Parallel instances of the entire graphNovel memory access schemeUtilize fine-grained parallelism

    6

    GPU threads

  • Hierarchical stream graph◦ Well defined rates◦ Pipeline◦ Splitters / Joiners◦ Mostly stateless filters

    StreamIt compiler◦ Schedules◦ Flattens◦ Analyzes

    Peeking◦ Alternative to filters with state◦ Allows access to input consumed

    by future iteration

    7

    Stream graph

    pipeline

    splitter

    joiner

  • Udupa et al. (CGO 2009)◦ Software pipelined execution

    of stream programs on GPUsno memory prefetchingpipeline computation

    Hormati et al. (ASPLOS 2011)◦ Sponge: Portable Stream

    Programming on Graphics Enginesmemory access schemememory traffic not fully optimized

    (filters fused partially)no compression on multiple threads

    8

  • Global GM memory◦ Large, slow, visible by threads on all SM processorsShared SM memory◦ Small, fast, visible by threads on one SM processor

    Other memories, subset of GM◦ Local memory – for registers spilling◦ Texture memory – different access pattern

    9

    CPU input GM memory

    SM memory

    SM memory

    GPUCompute

    kernelCPU output

  • Array of SIMD processors (SM processors)◦ Each handles a large pool of threads grouped in warpsInterleaved execution for high throughput◦ operation latency (22 cycles)◦ memory latency (~ 400 cycles)

    10

    Core 1 Core 2 Core Nwarp 0warp 1warp 2

    warp 0warp 1warp 2

  • 11

    Vicious cycle

    Influenced by ratio of:◦ Memory access◦ Computation

    Bandwidth limitation

    400 cycles!

  • 12

  • Stream graph Parallel instances of the entire graphNovel memory access schemeUtilize fine-grained parallelism

    13

  • Misconception: Threads should not diverge◦ True only for threads belonging to a warp

    Warp:◦ SIMD execution model◦ Static thread allocation (based on thread ID)

    No penalty for this CUDA code:if (threadIdx.x < warpSize) {

    compute_action();} else if (threadIdx.x >= warpSize && threadIdx.x < 2 * warpSize)

    memory_access_action();} else if …

    14

  • Insufficient memory access warps limit performance

    15

    S1070 GPU

    G8800 GPU

    Insufficient to sustain required bandwidth

    Additional warps for memory access

    (Bitonic sort)

  • Software pipelining a group of iterations in each SM processor

    I/O streams in GM memory

    16

    INi-1 INi

    OUTi-1 OUTi OUTi+1

    INi+1

    SM processorSM memory

    GM memory

  • Software pipelining a group of iterations in each SM processor

    I/O streams in GM memory

    Double buffer (DB) for I/O exchange

    17

    DB

    SM processor

    INi-1 INi

    OUTi-1 OUTi OUTi+1

    INi+1

    SM memory

    GM memory

  • Software pipelining a group of iterations in each SM processor

    I/O streams in GM memory

    Double buffer (DB) for I/O exchange

    Workset (WS) must fit in SM memory◦ I/O data◦ All intermediate stream data

    Limited number of iterations

    18

    WS DB

    OUTi INi

    SM processor

    INi-1 INi

    OUTi-1 OUTi OUTi+1

    INi+1

    SM memory

    GM memory

  • Prefetching◦ 3x COMPUTE warps◦ 3x WS memory

    warp 0

    warp 1

    warp 2

    LOAD COMPUTE STORE

    LOAD COMPUTE STORE

    LOAD COMPUTE STORE

    time

  • Prefetching◦ 3x COMPUTE warps◦ 3x WS memory

    Specialization◦ 1x COMPUTE warp◦ 1x (WS+DB) memory

    20

    time

    warp 0

    warp 1

    warp 2

    LOAD COMPUTE STORE

    LOAD

    COMPUTE

    STORE

    COMPUTE COMPUTE

    LOAD COMPUTE STORE

    LOAD COMPUTE STORE

    LOAD

    STORE

    LOAD

    STORE

    warp 0

    warp 1

    warp 2

  • Stream graph Parallel instances of the entire graphNovel memory access schemeUtilize fine-grained parallelism

    21

    GPU threads

  • Multiple threads / stream iteration◦ Distributed schedule

    Synchronization◦ Lock-step execution of warp threads

    22

    0 1 32 4 5

    6 7 8 9

    0 1 2 3 4 5

    6 7 8 9

  • 23

    (FilterBank)

    2 threads / iteration

    1 thread / iteration

    4 threads / iteration

    G8800 GPU

    S1070 GPU

  • 24

    Workset size

    SM memoryIterations

    I/O size

    GM latency Mem. access threads

    Filter execution schedule

    GPU SIMD constraints

    Threads / iteration

    x Compute warps

    Mem. access warpsadjust to full warp

    adjust to full warp

  • 25

    Stream program

    StreamIt compiler

    Schedule Operators

    Worksetlayout

    GPU kernel mapping

    Inject code for CUDA execution

    Target GPU specifications

    Mapping configuration

    GPU compiler

    Kernel loader

    Operator code

  • 26

    Stream program

    StreamIt compiler

    Schedule Operators

    Worksetlayout

    GPU kernel mapping

    Inject code for CUDA execution

    Target GPU specifications

    Mapping configuration

    GPU compiler

    Kernel loader

    Operator code

  • Fragmentation of workset allocation◦ Small buffers are required between filters

    Liveness analysis estimation of workset size

    Fragmentation actual allocation may lead to a slight increase

    Coalesced memory access

    27

  • Inter-iteration dependencies

    Overlap input data to reconstruct the initial elements◦ For each SM processor◦ For each parallel thread group

    Intuition:◦ Warm-up intermediate buffers◦ Threads access previous iterations

    Custom synchronization ◦ Only between compute threads◦ Implemented custom barrier

    28

  • 29

    versus CGO 2009

  • 30

    Different GPU architecturesversus CGO 2009

  • Novel scheme to execute stream graphs on GPU

    Automatic heuristic for selecting efficient design points

    Novel memory access scheme through software pipelining

    31

  • Questions?

    32