SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
OVERVIEW OF OCELOT: ARCHITECTURE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview�GPU Ocelot overview
�Building, configuring, and executing Ocelot programs
�Ocelot Device Interface and CUDA Runtime API
�Ocelot PTX Internal Representation
�PTX Pass Manager
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot: Multiplatform Dynamic Compilation
Just-in-time code generation and
optimization for data intensive applications
esd.lbl.gov
R. Domingo & D. Kaeli (NEU)
Data Parallel IR
Language Front-End
• Environment for i) compiler research, ii) architecture
research, and iii) productivity tools
3
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NVIDIA’s Compute Unified Device Architecture (CUDA)
�Integrate the concept of a compute kernel called from standard languages
� Multithreaded host programs
�The compute kernel specifies data parallel computation as thousands of threads
�An accelerator model of computing� Explicit functions for off-loading computation to GPUs
� Data movement explicitly managed by the programmer
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
http://developer.nvidia.com/cuda-education-training
Host GPU
�For access to CUDA tutorials
NVIDIA’s Compute Unified Device Architecture (CUDA)
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Structure of a Compute Kernel
�Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)
�Barrier synchronization
�Mapped to single instruction stream multiple data stream (SIMD) processor
6
Parallel Thread Execution (PTX)
instruction set architecture
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NVIDIA Fermi GF 100•4 Global Processing Clusters (GPCs) containing 4 SMs each
•Each SM has 32 ALUs, 4 SFUs, and 16 LS units
•Each ALU has access to 1024 32bit registers (total of 128kB per SM)
•Each SM has its own Shared Memory/L1 cache (64kB total)
•Unified L2 cache (768kB)•Six 64bit Memory Controllers (total 384bit wide)
ALU Streaming multiprocessor (SM)
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Structure1
PTX Kernel
1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .
CUDA Application
nvcc
�Ocelot is built with nvcc and the LLVM backend� Structured around a PTX IR� LLVM IR Translator
�Compile stock CUDA applications without modification
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
CUDA to PTX
� PTX modules stored as string literals in fat binary
� We ignore accompanying binary image (GPU native binary)
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
�GPU Ocelot overview
�Building, configuring, and executing Ocelot programs
�Ocelot Device Interface and CUDA Runtime API
�Ocelot PTX Internal Representation
�PTX Pass Manager
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dependencies
� Software� C++ Compiler (GCC 4.5.x)
� Lex Lexer Generator (Flex 2.5.35)
� YACC Parser Generator (Bison 2.4.1)
� Scons (Python 2.7)
� LLVM (3.1)
� Libraries� boost_system (1.46)
� boost_filesystem (1.46)
� boost_serialization (1.46)
� GLEW (optional for GL interop) (1.5)
� GL (for NVIDIA GPU Devices)
� Library headers� Boost (1.46)
http://code.google.com/p/gpuocelot/wiki/Installation
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code
• Freely available via Google Code project site (New BSD License)
• ocelot/• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations
http://code.google.com/p/gpuocelot/
svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Building GPU Ocelot
13
� Obtain source code� svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
� Compile with Scons� sudo ./build.py –install
� Build and execute unit tests� sudo ./build.py –test=full
� Output appears in .release_build� libocelot.so
� OcelotConfig
� Tests
� Installation directory:� /usr/local/include/ocelot
� /usr/local/lib
http://code.google.com/p/gpuocelot/wiki/Installation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Configuring Ocelot
� configure.ocelot� Controls Ocelot’s initial state
� Located in application’s startup directory
� trace specifies which trace generators are initially attached
� executive controls device properties
� trace:� memoryChecker – ensures
� raceDetector - enforces synchronized access to .shared
� debugger - interactive debugger
� executive:� devices:
� List of Ocelot backend devices that are enabled
� nvidia - NVIDIA GPU backend
� emulated – Ocelot PTX emulator (trace generators)
� llvm – efficient execution of PTX on multicore CPU
� amd – translation to AMD IL for PTX on AMD RADEON GPU
trace: {
memoryChecker: {
enabled: true,
checkInitialization: false
},
raceDetector: {
enabled: false,
ignoreIrrelevantWrites: true
},
debugger: {
enabled: false,
kernelFilter: "_Z13scalarProdGPUPfS_S_ii",
alwaysAttach: true
},
},
executive: {
devices: [ "emulated" ],
}
}
14
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Building and Executing CUDA Programs
�nvcc -c example.cu -arch sm_23
�g++ -o example example.o `OcelotConfig -l`� `OcelotConfig -l` expands to ‘-locelot’
� libocelot.so replaces libcudart.so
15
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
�GPU Ocelot overview
�Building, configuring, and executing Ocelot programs
�Ocelot Device Interface and CUDA Runtime API
�Ocelot PTX Internal Representation
�PTX Pass Manager
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
CUDA Runtime API
� Ocelot implements CUDA Runtime API
� Transparent hooks into existing CUDA applications� override methods of cuda::CudaDeviceInterface
� Maps CUDA RT onto Ocelot device interface abstraction� cuda::CudaRuntime
� Extended through custom Ocelot API� e.g. ocelot::registerPTXModule( );
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot CUDA Runtime Overview18
Kernels execute anywhere � Key to portability!
�A reimplementation of the CUDA Runtime API
�Compatible with existing applications
� Link against libocelot.soinstead of libcudart
R. Domingo & D. Kaeli (NEU)
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot CUDA Runtime
�Clean device abstraction
� All back-ends implement same interface
�Ocelot API Extensions� Add/remove trace
generators
� Compile/launch kernels directly in PTX
� Device memory sharing among host threads
� Device switching
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: CUDA Runtime API• ocelot/
• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions
• cuda/ -- implements CUDA runtime
• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h
• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot CUDA Runtime API Implementation
21
� Implement interface defined by cuda::CudaRuntimeInterface� ocelot/cuda/interface/CudaRuntime.h
� ocelot/cuda/implementation/CudaRuntime.cpp
� class cuda::CudaRuntime
� cuda::CudaRuntime members� Host thread contexts
� Ocelot devices
� Registered modules, textures, kernels
� Fat binaries
� Global mutex
� CUDA Runtime API functions� eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),
� Additional functions� eg. _lock(), _unlock(), _registerModule()
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: Device Interface• ocelot/
• executive/ -- Device interface and backend implementations
• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h
22
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Device Interface
23
� class executive::Device
� Succinct interface for device objects� Module registration
� Memory management
� Kernel configuration and launching
� Global variable and texture management
� OpenGL interoperability
� Streams and Events
� Trace generators
� Minimal set of APIs for device-oriented programming model� 57 functions (versus CUDA Runtime’s 120+)
� Capture device state:� Memory allocations, global variables, textures, graphics interoperability
� Facilitate creation of backend execution targets� Implement Device interface
� Enable multiple API front ends� Implement front ends targeting Device interface
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
�GPU Ocelot overview
�Building, configuring, and executing Ocelot programs
�Ocelot Device Interface and CUDA Runtime API
�Ocelot PTX Internal Representation
�PTX Pass Manager
24
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot PTX Intermediate Representation (IR)
� Backend compiler framework for PTX
� Full-featured PTX IR� Class hierarchy for PTX instructions/directives� PTX control flow graph� Static single-assignment form� Dataflow/dominance analysis� Enables PTX optimization
� IR to IR translation� From PTX to other IRs� LLVM (x86/PowerPC/ARM)� CAL (AMD GPUs)
PTX Kernel
25
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: Intermediate Representation
• ocelot/
• ir/ -- internal representations (PTX, LLVM, AMD IL)
• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h
• parser/ -- parser (to PTX)
• interface/PTXParser.h
26
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot PTX Internal Representation
27
� C++ classes representing PTX module� ir::PTXModule
� ir::PTXKernel
� ir::PTXInstruction
� ir::PTXOperand
� ir::GlobalVariable
� ir::LocalVariable
� ir::Parameter
� Ocelot PTX Parser target, Emitter source� ir::PTXInstruction::valid( )
� Translator source� PTX to LLVM
� PTX to AMD IL
� Suitable for analysis and transformation
� Executable representation� PTX Emulator
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot PTX IR: Kernels
.global .f32 globalVariable;
.entry sequence (
.param .u64 __cudaparm_sequence_A,
.param .s32 __cudaparm_sequence_N)
{
.reg .u32 %r<11>;
.reg .u64 %rd<6>;
.local u32 %rp0;
. . .
. . .
$LDWbegin_sequence:
ld.param.s32 %r6, [__cudaparm_sequence_N];
setp.le.s32 %p1, %r6, %r5;
@%p1 bra $Lt_0_1026;
. . .
. . .
$Lt_0_1026:
exit;
$LDWend_sequence:
} // sequence
ir::Module
ir::Kernel
ir::BasicBlock
ir::Local
ir::Parameter
ir::Global
28
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
add.s32 %r7, %r5, 1;
ld .param .u64 %rd1, [__cudaparm_sequence_A];
cvt.s64.s32 %rd2, %r5;
mul.wide.s32 %rd3, %r5, 4;
add.u64 %rd4, %rd1, %rd3;
st .global .s32 [ %rd4 + 0 ], %r7;
@%p1 bra $Lt_0_6146;
ir::BasicBlockir::PTXInstruction
opcode addressSpace dataType d a
addressMode: address
addressMode: register
addressMode: immediate
addressMode: indirect
ir::PTXOperand
addressMode: label
Guard predicate
Ocelot PTX IR: Instructions
29
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Control and Data-Flow Graphs
• Data structure for representing kernels
• Basic blocks• fall-through and branch edges• instruction vector• label
• Traversals:• pre-order, topological, post-order• iterator visits blocks
• Data-flow graph overlays CFG• definition-use chains explicit• to and from SSA form
• CFG Transformations:• split blocks, edges
• DFG Transformations:• insert and remove values• iterate over def-use
30
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin();
bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks
unsigned int n = 0;ir::BasicBlock::InstructionList::iterator inst_it;
for (inst_it = (bb_it)->instructions.begin();inst_it != (bb_it)->instructions.end();++inst_it, n++) { // iterate over instructions in *bb_it
const ir::PTXInstruction *inst = static_cast<const ir::PTXInstruction *>(*inst_it);
if (inst->opcode == ir::PTXInstruction::Bar) {if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {
std::string label = (bb_it)->label + "_bar";
kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync
// so that it’s always the last} // instruction in a blockbreak;
}} // end for (inst_it)
} // end for (bb_it)
31
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp
//
void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block,
const DataflowGraph::Block::RegisterSet& alive )
{
unsigned int bytes = 0;
ir::PTXInstruction move ( ir::PTXInstruction::Mov );
move.type = ir::PTXOperand::u64;
move.a.identifier = "__ocelot_remove_barrier_pass_stack";
move.a.addressMode = ir::PTXOperand::Address;
move.a.type = ir::PTXOperand::u64;
move.d.reg = _kernel->dfg()->newRegister();
move.d.addressMode = ir::PTXOperand::Register;
move.d.type = ir::PTXOperand::u64;
_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );
...
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
...
for( DataflowGraph::Block::RegisterSet::const_iterator
reg = alive.begin(); reg != alive.end(); ++reg ) {
ir::PTXInstruction save( ir::PTXInstruction::St );
save.type = reg->type;
save.addressSpace = ir::PTXInstruction::Local;
save.d.addressMode = ir::PTXOperand::Indirect;
save.d.reg = move.d.reg;
save.d.type = ir::PTXOperand::u64;
save.d.offset = bytes;
bytes += ir::PTXOperand::bytes( save.type );
save.a.addressMode = ir::PTXOperand::Register;
save.a.type = reg->type;
save.a.reg = reg->id;
_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );
}
_spillBytes = std::max( bytes, _spillBytes );
}
Example: Spilling Live Values
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
IR for AMD and LLVM
� LLVM IR• Implements all of the LLVM instruction set• Decouples translator with LLVM project• Easier to construct than LLVM’s actual IR
� AMD IL• Supports translation from PTX to AMD interface
� Emitters construct parseable string representations of modules
AMD Backend: R. Domingo & D. Kaeli (NEU)
34
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
�GPU Ocelot overview
�Building, configuring, and executing Ocelot programs
�Ocelot Device Interface and CUDA Runtime API
�Ocelot PTX Internal Representation
�PTX Pass Manager
35
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
PTX PassManager
� Orchestrates analysis and transformation passes� Derived from LLVM model
� Analysis Passes generate meta-data
� Meta-data consumed by transformations
� Transformation Passes modify the IR
36
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Using the Pass Manager
� Passes added to a manager� Schedules execution
� Manages analysis meta-data� Ensures meta-data available
� Up to date; not redundantly computed
37
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Analysis Passes
� Analysis runs over the PTX IR� Generates meta-data
� Modifies PTX IR
� Possibly updates or invalidates existing meta-data
� Examples� Data-flow graph
� Dominator and Post-dominator trees
� Thread frontiers
38
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Analysis Passes – Supported Analaysis Structures
39
� Control Flow Graph� ir/interface/ControlFlowGraph.h
� Data Flow Graph� analysis/interface/DataflowGraph.h
� Dominator and Post-Dominator Trees� analysis/interface/DominatorTree.h
� analysis/interface/PostDominatorTree.h
� Superblock Analysis� analysis/interface/SuperblockAnalysis.h
� Divergence Graph� analysis/interface/DivergenceGraph.h
� Thread Frontiers� analysis/interface/ThreadFrontiers.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Transformation Passes
� Modify the PTX IR� Consume meta-data
� Examples:� Dead-code elimination
� transforms/interface/DeadCodeEliminationPass.h
� Control-flow structuring
� transforms/interface/StructuralTransform.h
� Sync elimination
� transforms/interface/SyncElimination.h
� Dynamic instrumentation
40
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Dead Code Elimination Transformation Pass
41
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination
� Approach
� Run once on each kernel
� Consume data-flow analysis meta-data
� Delete instructions producing values with no users
� Implementation
� transforms/interface/DeadCodeEliminationPass.h
� transforms/implementation/DeadCodeEliminationPass.cpp
42
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination (1 of 5)
� Setup pass dependencies
43
DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis
| Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination (2 of 5)
� Run pass
44
Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);
// cast upanalysis::DataflowGraph& dfg =
*static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());
void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){
� Get analysis metadata
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination (3 of 5)
� Loop until change
45
BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){report(" Queueing up BB_" << block->id());blocks.insert(block);
}
while(!blocks.empty()){iterator block = *blocks.begin();blocks.erase(blocks.begin());eliminateDeadInstructions(dfg, blocks, block);
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination (4 of 5)
� Remove unused live-out values
46
AliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin();aliveOut != block->aliveOut().end(); ++aliveOut)
{if (canRemoveAliveOut(dfg, block, *aliveOut)){report(" removed " << aliveOut->id);aliveOutKillList.push_back(aliveOut);
}}for (AliveKillList::iterator killed = aliveOutKillList.begin();
killed != aliveOutKillList.end(); ++killed){block->aliveOut().erase(*killed);
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination (5 of 5)� Check if an instruction can be removed
47
if (ptx.hasSideEffects()) return false;
for (RegisterPointerVector::iterator reg = instruction->d.begin();reg != instruction->d.end(); ++reg) {
// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {
for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}
}}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dead Code Elimination
� Repeat for� phi instructions
� Other instructions
� alive-in values
� Ensures meta-data is valid
48
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Running Passes on PTX
� Static optimizer� PTXOptimizer
� Runs passes on PTX assembly files
� ocelot/tools/PTXOptimizer.cpp
� JIT optimization� Runs passes before kernels are launched
� ocelot/api/implementation/OcelotRuntime.cpp
49