overview of ocelot: architecture - gt comparch

49
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OVERVIEW OF OCELOT: ARCHITECTURE

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OVERVIEW OF OCELOT: ARCHITECTURE

Page 2: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview�GPU Ocelot overview

�Building, configuring, and executing Ocelot programs

�Ocelot Device Interface and CUDA Runtime API

�Ocelot PTX Internal Representation

�PTX Pass Manager

2

Page 3: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot: Multiplatform Dynamic Compilation

Just-in-time code generation and

optimization for data intensive applications

esd.lbl.gov

R. Domingo & D. Kaeli (NEU)

Data Parallel IR

Language Front-End

• Environment for i) compiler research, ii) architecture

research, and iii) productivity tools

3

3

Page 4: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA’s Compute Unified Device Architecture (CUDA)

�Integrate the concept of a compute kernel called from standard languages

� Multithreaded host programs

�The compute kernel specifies data parallel computation as thousands of threads

�An accelerator model of computing� Explicit functions for off-loading computation to GPUs

� Data movement explicitly managed by the programmer

4

Page 5: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

http://developer.nvidia.com/cuda-education-training

Host GPU

�For access to CUDA tutorials

NVIDIA’s Compute Unified Device Architecture (CUDA)

5

Page 6: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Structure of a Compute Kernel

�Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)

�Barrier synchronization

�Mapped to single instruction stream multiple data stream (SIMD) processor

6

Parallel Thread Execution (PTX)

instruction set architecture

Page 7: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA Fermi GF 100•4 Global Processing Clusters (GPCs) containing 4 SMs each

•Each SM has 32 ALUs, 4 SFUs, and 16 LS units

•Each ALU has access to 1024 32bit registers (total of 128kB per SM)

•Each SM has its own Shared Memory/L1 cache (64kB total)

•Unified L2 cache (768kB)•Six 64bit Memory Controllers (total 384bit wide)

ALU Streaming multiprocessor (SM)

7

Page 8: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Structure1

PTX Kernel

1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

CUDA Application

nvcc

�Ocelot is built with nvcc and the LLVM backend� Structured around a PTX IR� LLVM IR Translator

�Compile stock CUDA applications without modification

8

Page 9: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA to PTX

� PTX modules stored as string literals in fat binary

� We ignore accompanying binary image (GPU native binary)

9

Page 10: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�GPU Ocelot overview

�Building, configuring, and executing Ocelot programs

�Ocelot Device Interface and CUDA Runtime API

�Ocelot PTX Internal Representation

�PTX Pass Manager

10

Page 11: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependencies

� Software� C++ Compiler (GCC 4.5.x)

� Lex Lexer Generator (Flex 2.5.35)

� YACC Parser Generator (Bison 2.4.1)

� Scons (Python 2.7)

� LLVM (3.1)

� Libraries� boost_system (1.46)

� boost_filesystem (1.46)

� boost_serialization (1.46)

� GLEW (optional for GL interop) (1.5)

� GL (for NVIDIA GPU Devices)

� Library headers� Boost (1.46)

http://code.google.com/p/gpuocelot/wiki/Installation

11

Page 12: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code

• Freely available via Google Code project site (New BSD License)

• ocelot/• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

http://code.google.com/p/gpuocelot/

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

12

Page 13: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Building GPU Ocelot

13

� Obtain source code� svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

� Compile with Scons� sudo ./build.py –install

� Build and execute unit tests� sudo ./build.py –test=full

� Output appears in .release_build� libocelot.so

� OcelotConfig

� Tests

� Installation directory:� /usr/local/include/ocelot

� /usr/local/lib

http://code.google.com/p/gpuocelot/wiki/Installation

Page 14: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Configuring Ocelot

� configure.ocelot� Controls Ocelot’s initial state

� Located in application’s startup directory

� trace specifies which trace generators are initially attached

� executive controls device properties

� trace:� memoryChecker – ensures

� raceDetector - enforces synchronized access to .shared

� debugger - interactive debugger

� executive:� devices:

� List of Ocelot backend devices that are enabled

� nvidia - NVIDIA GPU backend

� emulated – Ocelot PTX emulator (trace generators)

� llvm – efficient execution of PTX on multicore CPU

� amd – translation to AMD IL for PTX on AMD RADEON GPU

trace: {

memoryChecker: {

enabled: true,

checkInitialization: false

},

raceDetector: {

enabled: false,

ignoreIrrelevantWrites: true

},

debugger: {

enabled: false,

kernelFilter: "_Z13scalarProdGPUPfS_S_ii",

alwaysAttach: true

},

},

executive: {

devices: [ "emulated" ],

}

}

14

14

Page 15: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Building and Executing CUDA Programs

�nvcc -c example.cu -arch sm_23

�g++ -o example example.o `OcelotConfig -l`� `OcelotConfig -l` expands to ‘-locelot’

� libocelot.so replaces libcudart.so

15

Page 16: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�GPU Ocelot overview

�Building, configuring, and executing Ocelot programs

�Ocelot Device Interface and CUDA Runtime API

�Ocelot PTX Internal Representation

�PTX Pass Manager

16

Page 17: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA Runtime API

� Ocelot implements CUDA Runtime API

� Transparent hooks into existing CUDA applications� override methods of cuda::CudaDeviceInterface

� Maps CUDA RT onto Ocelot device interface abstraction� cuda::CudaRuntime

� Extended through custom Ocelot API� e.g. ocelot::registerPTXModule( );

17

Page 18: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime Overview18

Kernels execute anywhere � Key to portability!

�A reimplementation of the CUDA Runtime API

�Compatible with existing applications

� Link against libocelot.soinstead of libcudart

R. Domingo & D. Kaeli (NEU)

18

Page 19: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime

�Clean device abstraction

� All back-ends implement same interface

�Ocelot API Extensions� Add/remove trace

generators

� Compile/launch kernels directly in PTX

� Device memory sharing among host threads

� Device switching

19

Page 20: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: CUDA Runtime API• ocelot/

• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions

• cuda/ -- implements CUDA runtime

• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h

• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

20

Page 21: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime API Implementation

21

� Implement interface defined by cuda::CudaRuntimeInterface� ocelot/cuda/interface/CudaRuntime.h

� ocelot/cuda/implementation/CudaRuntime.cpp

� class cuda::CudaRuntime

� cuda::CudaRuntime members� Host thread contexts

� Ocelot devices

� Registered modules, textures, kernels

� Fat binaries

� Global mutex

� CUDA Runtime API functions� eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),

� Additional functions� eg. _lock(), _unlock(), _registerModule()

Page 22: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Device Interface• ocelot/

• executive/ -- Device interface and backend implementations

• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h

22

Page 23: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Device Interface

23

� class executive::Device

� Succinct interface for device objects� Module registration

� Memory management

� Kernel configuration and launching

� Global variable and texture management

� OpenGL interoperability

� Streams and Events

� Trace generators

� Minimal set of APIs for device-oriented programming model� 57 functions (versus CUDA Runtime’s 120+)

� Capture device state:� Memory allocations, global variables, textures, graphics interoperability

� Facilitate creation of backend execution targets� Implement Device interface

� Enable multiple API front ends� Implement front ends targeting Device interface

Page 24: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�GPU Ocelot overview

�Building, configuring, and executing Ocelot programs

�Ocelot Device Interface and CUDA Runtime API

�Ocelot PTX Internal Representation

�PTX Pass Manager

24

Page 25: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX Intermediate Representation (IR)

� Backend compiler framework for PTX

� Full-featured PTX IR� Class hierarchy for PTX instructions/directives� PTX control flow graph� Static single-assignment form� Dataflow/dominance analysis� Enables PTX optimization

� IR to IR translation� From PTX to other IRs� LLVM (x86/PowerPC/ARM)� CAL (AMD GPUs)

PTX Kernel

25

Page 26: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Intermediate Representation

• ocelot/

• ir/ -- internal representations (PTX, LLVM, AMD IL)

• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h

• parser/ -- parser (to PTX)

• interface/PTXParser.h

26

Page 27: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX Internal Representation

27

� C++ classes representing PTX module� ir::PTXModule

� ir::PTXKernel

� ir::PTXInstruction

� ir::PTXOperand

� ir::GlobalVariable

� ir::LocalVariable

� ir::Parameter

� Ocelot PTX Parser target, Emitter source� ir::PTXInstruction::valid( )

� Translator source� PTX to LLVM

� PTX to AMD IL

� Suitable for analysis and transformation

� Executable representation� PTX Emulator

Page 28: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX IR: Kernels

.global .f32 globalVariable;

.entry sequence (

.param .u64 __cudaparm_sequence_A,

.param .s32 __cudaparm_sequence_N)

{

.reg .u32 %r<11>;

.reg .u64 %rd<6>;

.local u32 %rp0;

. . .

. . .

$LDWbegin_sequence:

ld.param.s32 %r6, [__cudaparm_sequence_N];

setp.le.s32 %p1, %r6, %r5;

@%p1 bra $Lt_0_1026;

. . .

. . .

$Lt_0_1026:

exit;

$LDWend_sequence:

} // sequence

ir::Module

ir::Kernel

ir::BasicBlock

ir::Local

ir::Parameter

ir::Global

28

Page 29: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

add.s32 %r7, %r5, 1;

ld .param .u64 %rd1, [__cudaparm_sequence_A];

cvt.s64.s32 %rd2, %r5;

mul.wide.s32 %rd3, %r5, 4;

add.u64 %rd4, %rd1, %rd3;

st .global .s32 [ %rd4 + 0 ], %r7;

@%p1 bra $Lt_0_6146;

ir::BasicBlockir::PTXInstruction

opcode addressSpace dataType d a

addressMode: address

addressMode: register

addressMode: immediate

addressMode: indirect

ir::PTXOperand

addressMode: label

Guard predicate

Ocelot PTX IR: Instructions

29

Page 30: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Control and Data-Flow Graphs

• Data structure for representing kernels

• Basic blocks• fall-through and branch edges• instruction vector• label

• Traversals:• pre-order, topological, post-order• iterator visits blocks

• Data-flow graph overlays CFG• definition-use chains explicit• to and from SSA form

• CFG Transformations:• split blocks, edges

• DFG Transformations:• insert and remove values• iterate over def-use

30

Page 31: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin();

bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks

unsigned int n = 0;ir::BasicBlock::InstructionList::iterator inst_it;

for (inst_it = (bb_it)->instructions.begin();inst_it != (bb_it)->instructions.end();++inst_it, n++) { // iterate over instructions in *bb_it

const ir::PTXInstruction *inst = static_cast<const ir::PTXInstruction *>(*inst_it);

if (inst->opcode == ir::PTXInstruction::Bar) {if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {

std::string label = (bb_it)->label + "_bar";

kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync

// so that it’s always the last} // instruction in a blockbreak;

}} // end for (inst_it)

} // end for (bb_it)

31

Page 32: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

//

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block,

const DataflowGraph::Block::RegisterSet& alive )

{

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

...

Page 33: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

...

for( DataflowGraph::Block::RegisterSet::const_iterator

reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

}

_spillBytes = std::max( bytes, _spillBytes );

}

Example: Spilling Live Values

Page 34: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

IR for AMD and LLVM

� LLVM IR• Implements all of the LLVM instruction set• Decouples translator with LLVM project• Easier to construct than LLVM’s actual IR

� AMD IL• Supports translation from PTX to AMD interface

� Emitters construct parseable string representations of modules

AMD Backend: R. Domingo & D. Kaeli (NEU)

34

Page 35: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�GPU Ocelot overview

�Building, configuring, and executing Ocelot programs

�Ocelot Device Interface and CUDA Runtime API

�Ocelot PTX Internal Representation

�PTX Pass Manager

35

Page 36: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX PassManager

� Orchestrates analysis and transformation passes� Derived from LLVM model

� Analysis Passes generate meta-data

� Meta-data consumed by transformations

� Transformation Passes modify the IR

36

Page 37: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the Pass Manager

� Passes added to a manager� Schedules execution

� Manages analysis meta-data� Ensures meta-data available

� Up to date; not redundantly computed

37

Page 38: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Analysis Passes

� Analysis runs over the PTX IR� Generates meta-data

� Modifies PTX IR

� Possibly updates or invalidates existing meta-data

� Examples� Data-flow graph

� Dominator and Post-dominator trees

� Thread frontiers

38

Page 39: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Analysis Passes – Supported Analaysis Structures

39

� Control Flow Graph� ir/interface/ControlFlowGraph.h

� Data Flow Graph� analysis/interface/DataflowGraph.h

� Dominator and Post-Dominator Trees� analysis/interface/DominatorTree.h

� analysis/interface/PostDominatorTree.h

� Superblock Analysis� analysis/interface/SuperblockAnalysis.h

� Divergence Graph� analysis/interface/DivergenceGraph.h

� Thread Frontiers� analysis/interface/ThreadFrontiers.h

Page 40: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Transformation Passes

� Modify the PTX IR� Consume meta-data

� Examples:� Dead-code elimination

� transforms/interface/DeadCodeEliminationPass.h

� Control-flow structuring

� transforms/interface/StructuralTransform.h

� Sync elimination

� transforms/interface/SyncElimination.h

� Dynamic instrumentation

40

Page 41: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Dead Code Elimination Transformation Pass

41

Page 42: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination

� Approach

� Run once on each kernel

� Consume data-flow analysis meta-data

� Delete instructions producing values with no users

� Implementation

� transforms/interface/DeadCodeEliminationPass.h

� transforms/implementation/DeadCodeEliminationPass.cpp

42

Page 43: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination (1 of 5)

� Setup pass dependencies

43

DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis

| Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){

}

Page 44: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination (2 of 5)

� Run pass

44

Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);

// cast upanalysis::DataflowGraph& dfg =

*static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());

void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){

� Get analysis metadata

Page 45: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination (3 of 5)

� Loop until change

45

BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){report(" Queueing up BB_" << block->id());blocks.insert(block);

}

while(!blocks.empty()){iterator block = *blocks.begin();blocks.erase(blocks.begin());eliminateDeadInstructions(dfg, blocks, block);

}

Page 46: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination (4 of 5)

� Remove unused live-out values

46

AliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin();aliveOut != block->aliveOut().end(); ++aliveOut)

{if (canRemoveAliveOut(dfg, block, *aliveOut)){report(" removed " << aliveOut->id);aliveOutKillList.push_back(aliveOut);

}}for (AliveKillList::iterator killed = aliveOutKillList.begin();

killed != aliveOutKillList.end(); ++killed){block->aliveOut().erase(*killed);

}

Page 47: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination (5 of 5)� Check if an instruction can be removed

47

if (ptx.hasSideEffects()) return false;

for (RegisterPointerVector::iterator reg = instruction->d.begin();reg != instruction->d.end(); ++reg) {

// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {

for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}

}}

Page 48: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dead Code Elimination

� Repeat for� phi instructions

� Other instructions

� alive-in values

� Ensures meta-data is valid

48

Page 49: OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Running Passes on PTX

� Static optimizer� PTXOptimizer

� Runs passes on PTX assembly files

� ocelot/tools/PTXOptimizer.cpp

� JIT optimization� Runs passes before kernels are launched

� ocelot/api/implementation/OcelotRuntime.cpp

49