overview of ocelot: architecture

50
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OVERVIEW OF OCELOT: ARCHITECTURE

Upload: kemp

Post on 22-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Overview of Ocelot: architecture. Overview. GPU Ocelot overview Building, configuring, and executing Ocelot programs Ocelot Device Interface and CUDA Runtime API Ocelot PTX Internal Representation PTX Pass Manager. 2. Ocelot: Multiplatform Dynamic Compilation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OVERVIEW OF OCELOT: ARCHITECTURE

Page 2: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

2

Page 3: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot: Multiplatform Dynamic Compilation

Just-in-time code generation and

optimization for data intensive

applications

esd.lbl.gov

R. Domingo & D. Kaeli (NEU)

Data Parallel IR

Language Front-

End

• Environment for i) compiler research, ii) architecture research, and iii) productivity tools

3

3

Page 4: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA’s Compute Unified Device Architecture (CUDA)

Integrate the concept of a compute kernel called from standard languages

Multithreaded host programsThe compute kernel specifies data parallel computation as thousands of threads

An accelerator model of computing Explicit functions for off-loading computation to GPUs Data movement explicitly managed by the programmer

4

Page 5: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

http://developer.nvidia.com/cuda-education-training

Host GPU

For access to CUDA tutorials

NVIDIA’s Compute Unified Device Architecture (CUDA)

5

Page 6: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Structure of a Compute Kernel

Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)

Barrier synchronizationMapped to single instruction stream multiple data stream (SIMD) processor

6

Parallel Thread Execution (PTX) instruction set architecture

Page 7: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA Fermi GF 100• 4 Global Processing Clusters

(GPCs) containing 4 SMs each

• Each SM has 32 ALUs, 4 SFUs, and 16 LS units

• Each ALU has access to 1024 32bit registers (total of 128kB per SM)

• Each SM has its own Shared Memory/L1 cache (64kB total)

• Unified L2 cache (768kB)• Six 64bit Memory Controllers

(total 384bit wide)

ALU Streaming multiprocessor (SM)

7

Page 8: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Structure1 PTX Kernel

1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

CUDA Application

nvcc

Ocelot is built with nvcc and the LLVM backend Structured around a PTX IR LLVM IR Translator

Compile stock CUDA applications without modification

8

Page 9: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA to PTX

PTX modules stored as string literals in fat binary We ignore accompanying binary image (GPU native

binary)9

Page 10: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

10

Page 11: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependencies Software

C++ Compiler (GCC 4.5.x) Lex Lexer Generator (Flex 2.5.35) YACC Parser Generator (Bison 2.4.1) Scons (Python 2.7) LLVM (3.1)

Libraries boost_system (1.46) boost_filesystem (1.46) boost_serialization (1.46) GLEW (optional for GL interop) (1.5) GL (for NVIDIA GPU Devices)

Library headers Boost (1.46)

http://code.google.com/p/gpuocelot/wiki/Installation

11

Page 12: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code

• Freely available via Google Code project site (New BSD License)

• ocelot/• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

http://code.google.com/p/gpuocelot/

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

12

Page 13: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13

Building GPU Ocelot Obtain source code

svn checkout  http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

Compile with Scons sudo ./build.py –install

Build and execute unit tests sudo ./build.py –test=full

Output appears in .release_build libocelot.so OcelotConfig Tests

Installation directory: /usr/local/include/ocelot /usr/local/lib

http://code.google.com/p/gpuocelot/wiki/Installation

Page 14: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Configuring Ocelot configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures raceDetector - enforces synchronized access

to .shared debugger - interactive debugger

executive: devices:

List of Ocelot backend devices that are enabled nvidia - NVIDIA GPU backend emulated – Ocelot PTX emulator (trace generators) llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON

GPU

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], }}

14

14

Page 15: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15

Building and Executing CUDA Programsnvcc -c example.cu -arch sm_23

g++ -o example example.o `OcelotConfig -l` `OcelotConfig -l` expands to ‘-locelot’

libocelot.so replaces libcudart.so

Page 16: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

16

Page 17: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA Runtime API

Ocelot implements CUDA Runtime API Transparent hooks into existing CUDA

applications override methods of

cuda::CudaDeviceInterface Maps CUDA RT onto Ocelot device interface

abstraction cuda::CudaRuntime

Extended through custom Ocelot API e.g. ocelot::registerPTXModule( );

17

Page 18: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime Overview18

Kernels execute anywhere Key to portability!

A reimplementation of the CUDA Runtime API

Compatible with existing applications

Link against libocelot.so instead of libcudart

R. Domingo & D. Kaeli (NEU)

18

Page 19: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime

Clean device abstraction

All back-ends implement same interface

Ocelot API Extensions Add/remove trace

generators Compile/launch kernels

directly in PTX Device memory sharing

among host threads Device switching

19

Page 20: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: CUDA Runtime API• ocelot/

• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions

• cuda/ -- implements CUDA runtime

• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h

• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

20

Page 21: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Ocelot CUDA Runtime API Implementation Implement interface defined by cuda::CudaRuntimeInterface

ocelot/cuda/interface/CudaRuntime.h ocelot/cuda/implementation/CudaRuntime.cpp class cuda::CudaRuntime

cuda::CudaRuntime members Host thread contexts Ocelot devices Registered modules, textures, kernels Fat binaries Global mutex

CUDA Runtime API functions eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),

Additional functions eg. _lock(), _unlock(), _registerModule()

Page 22: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Device Interface• ocelot/

• executive/ -- Device interface and backend implementations

• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h

22

Page 23: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Ocelot Device Interface class executive::Device Succinct interface for device objects

Module registration Memory management Kernel configuration and launching Global variable and texture management OpenGL interoperability Streams and Events Trace generators

Minimal set of APIs for device-oriented programming model 57 functions (versus CUDA Runtime’s 120+)

Capture device state: Memory allocations, global variables, textures, graphics interoperability

Facilitate creation of backend execution targets Implement Device interface

Enable multiple API front ends Implement front ends targeting Device interface

Page 24: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

24

Page 25: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX Intermediate Representation (IR)

Backend compiler framework for PTX Full-featured PTX IR

Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization

IR to IR translation From PTX to other IRs LLVM (x86/PowerPC/ARM) CAL (AMD GPUs)

PTX Kernel

25

Page 26: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Intermediate Representation• ocelot/

• ir/ -- internal representations (PTX, LLVM, AMD IL)

• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h

• parser/ -- parser (to PTX)

• interface/PTXParser.h

26

Page 27: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27

Ocelot PTX Internal Representation C++ classes representing PTX module

ir::PTXModule ir::PTXKernel ir::PTXInstruction ir::PTXOperand ir::GlobalVariable ir::LocalVariable ir::Parameter

Ocelot PTX Parser target, Emitter source ir::PTXInstruction::valid( )

Translator source PTX to LLVM PTX to AMD IL

Suitable for analysis and transformation Executable representation

PTX Emulator

Page 28: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX IR: Kernels.global .f32 globalVariable;

.entry sequence (.param .u64 __cudaparm_sequence_A,.param .s32 __cudaparm_sequence_N){.reg .u32 %r<11>;.reg .u64 %rd<6>;.local u32 %rp0;

. . . . . .

$LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . .$Lt_0_1026:

exit;$LDWend_sequence:

} // sequence

ir::Module

ir::Kernel

ir::BasicBlock

ir::Local

ir::Parameter

ir::Global

28

Page 29: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

add.s32 %r7, %r5, 1;

ld .param .u64 %rd1, [__cudaparm_sequence_A];

cvt.s64.s32 %rd2, %r5;

mul.wide.s32 %rd3, %r5, 4;

add.u64 %rd4, %rd1, %rd3;

st .global .s32 [ %rd4 + 0 ], %r7;

@%p1 bra $Lt_0_6146;

ir::BasicBlockir::PTXInstruction

opcode addressSpace dataType d a

addressMode: address

addressMode: register

addressMode: immediate

addressMode: indirect

ir::PTXOperand

addressMode: label

Guard predicate

Ocelot PTX IR: Instructions

29

Page 30: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Control and Data-Flow Graphs

• Data structure for representing kernels• Basic blocks

• fall-through and branch edges• instruction vector• label

• Traversals:• pre-order, topological, post-order• iterator visits blocks

• Data-flow graph overlays CFG• definition-use chains explicit• to and from SSA form

• CFG Transformations:• split blocks, edges

• DFG Transformations:• insert and remove values• iterate over def-use

30

Page 31: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks

unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it;

for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it

const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it);

if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {

std::string label = (bb_it)->label + "_bar";

kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it)

} // end for (bb_it)

31

Page 32: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

//

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ){

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

...

Page 33: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

...

for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

}

_spillBytes = std::max( bytes, _spillBytes );

}

Example: Spilling Live Values

Page 34: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

IR for AMD and LLVM

LLVM IR• Implements all of the LLVM instruction set• Decouples translator with LLVM project• Easier to construct than LLVM’s actual IR

AMD IL• Supports translation from PTX to AMD

interface

Emitters construct parseable string representations of modules

AMD Backend: R. Domingo & D. Kaeli (NEU)

34

Page 35: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

35

Page 36: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 36

PTX PassManager Orchestrates analysis and transformation passes

Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR

Page 37: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 37

Using the Pass Manager Passes added to a manager

Schedules execution Manages analysis meta-data

Ensures meta-data available Up to date; not redundantly computed

Page 38: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 38

Analysis Passes Analysis runs over the PTX IR

Generates meta-data Modifies PTX IR Possibly updates or invalidates existing meta-data

Examples Data-flow graph Dominator and Post-dominator trees Thread frontiers

Page 39: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Analysis Passes – Supported Analaysis Structures

39

Control Flow Graph ir/interface/ControlFlowGraph.h

Data Flow Graph analysis/interface/DataflowGraph.h

Dominator and Post-Dominator Trees analysis/interface/DominatorTree.h analysis/interface/PostDominatorTree.h

Superblock Analysis analysis/interface/SuperblockAnalysis.h

Divergence Graph analysis/interface/DivergenceGraph.h

Thread Frontiers analysis/interface/ThreadFrontiers.h

Page 40: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40

Transformation Passes Modify the PTX IR

Consume meta-data Examples:

Dead-code elimination transforms/interface/DeadCodeEliminationPass.h

Control-flow structuring transforms/interface/StructuralTransform.h

Sync elimination transforms/interface/SyncElimination.h

Dynamic instrumentation

Page 41: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 41

Example: Dead Code Elimination Transformation Pass

Page 42: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 42

Dead Code Elimination Approach

Run once on each kernel Consume data-flow analysis meta-data Delete instructions producing values with no users Implementation

transforms/interface/DeadCodeEliminationPass.h transforms/implementation/DeadCodeEliminationPass.cpp

Page 43: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 43

Dead Code Elimination (1 of 5) Setup pass dependencies

DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){

}

Page 44: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 44

Dead Code Elimination (2 of 5) Run pass

Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);

// cast upanalysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());

void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){

Get analysis metadata

Page 45: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 45

Dead Code Elimination (3 of 5) Loop until change

BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){ report(" Queueing up BB_" << block->id()); blocks.insert(block);}

while(!blocks.empty()){ iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block);}

Page 46: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 46

Dead Code Elimination (4 of 5) Remove unused live-out valuesAliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut){ if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); }}for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed){ block->aliveOut().erase(*killed);}

Page 47: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 47

Dead Code Elimination (5 of 5) Check if an instruction can be removedif (ptx.hasSideEffects()) return false;

for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) {

// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {

for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}

}}

Page 48: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 48

Dead Code Elimination Repeat for

phi instructions Other instructions alive-in values

Ensures meta-data is valid

Page 49: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 49

Running Passes on PTX Static optimizer

PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp

JIT optimization Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp

Page 50: Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 50

Questions GPU Ocelot

Google Code site: http://code.google.com/p/gpuocelot

Research Project site: http://gpuocelot.gatech.edu

Mailing list: [email protected]

Contributors Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin

Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili

Sponsors AMD, IBM, Intel, LogicBlox, NSF, NVIDIA