ocelot: ptx emulator

30
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: PTX EMULATOR 1

Upload: anoki

Post on 22-Mar-2016

80 views

Category:

Documents


2 download

DESCRIPTION

Ocelot: PTX Emulator. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Execution Model . NVIDIA’s PTX Execution Model. Array of multiprocessors each executing a CTA (coarse grain) SIMD multiprocessors (fine grain) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 1

OCELOT: PTX EMULATOR

Page 2: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

OverviewOcelot PTX Emulator

Multicore-Backend

NVIDIA GPU Backend

AMD GPU Backend

Page 3: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Execution Model

Parallel thread Execution (PTX) Explicit memory hierarchy Cooperative thread arrays (CTAs)

and ordering constraints

Array of multiprocessors each executing a CTA (coarse grain)

SIMD multiprocessors (fine grain) Single instruction multiple thread (SIMT) execution Enables hardware to exploit

control flow uniformity and data locality

3

NVIDIA’s PTX Execution Model

Page 4: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

The Ocelot PTX EmulatorPerforms functional simulation of the PTX execution model

Access to complete machine state

Enables detailed performance evaluation/debugging

Program correctness checking Alignment checks, out of bounds

etc.

Workload characterization and modeling

Trace generation to drive architecture simulators

Timing information not availablePTX 3.0 (Fermi) support

Abstract machine model

4

4

Page 5: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Emulator ImplementationSerialized execution of CTAs

Implements the CUDA execution model semantics CUDA does not guarantee concurrent CTA execution!

Implements abstract reconvergence mechanism Multiple reconvergence mechanisms are implemented:

IPDOM, Barrier, Thread frontiers

Software support for special functions: Texture sampling

1D, 2D, 3D, cube Nearest, linear

New instructions may be prototyped

Page 6: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: PTX Emulator Device Backend• ocelot/

• executive/

• interface/EmulatorDevice.h• interface/EmulatedKernel.h• interface/EmulatorCallStack.h• interface/CooperativeThreadArray.h• interface/TextureOperations.h

• trace/

• interface/TraceEvent.h• interface/TraceGenerator.h

6

Page 7: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Trace Generator Interface7

7

Emulator broadcasts events to event trace analyzers during execution

Events provide detailed device state: PC, activity mask, operand data, thread ID, etc.

Used for error checking, instrumentation, and simulation

Page 8: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Trace Generator Interface// ocelot/trace/interface/TraceGenerator.h

//

// Base class for generating traces

class TraceGenerator {public:TraceGenerator();

virtual ~TraceGenerator();

// called when a traced kernel is launched to

// retrieve some parameters from the kernel

virtual void initialize( const executive::ExecutableKernel& kernel);

// Called whenever an event takes place.

virtual void event(const TraceEvent & event);

// called when an event is committed

virtual void postEvent(const TraceEvent & event);

// Called when a kernel is finished. There will

// be no more events for this kernel.

virtual void finish();};

8

8

Page 9: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

TraceEvent ObjectCaptures execution of a dynamic instruction, including

Device pointer to access device state Kernel grid, CTA dimensions, parameter values PC and internal representation of PTX instruction Set of active threads executing instruction Memory addresses and size of transfer Branch target(s) and diverging threads

9

9

Page 10: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

trace::TraceEvent

// \file ocelot/trace/interface/TraceEvent.hclass TraceEvent {public:

// ID of the block that generated the event ir::Dim3 blockId;

// PC index into EmulatedKernel's packed instruction sequence

ir::PTXU64 PC;

// Depth of call stack [i.e. number of contexts on the runtime stack]

ir::PTXU32 contextStackSize;

// Instruction const pointer to instruction pointed to by PC

const ir::PTXInstruction* instruction;

// Bit mask of active threads that executed this instruction

BitMask active; // Taken thread mask in case of a branch BitMask taken;

// Fall through thread mask in case of a branch BitMask fallthrough;

// Vector of memory addresses possibly generated for this instruction

U64Vector memory_addresses;

// Vector of sizes of memory operations possibly issued by this

// instruction ir::PTXU32 memory_size;

// Dimensions of the kernel grid that generated the event

ir::Dim3 gridDim;

// Dimensions of the kernel block that generated the event

ir::Dim3 blockDim; // Captures just events related to thread reconvergence ReconvergenceTraceEvent reconvergence;

};

Page 11: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX Emulator – CUDA Debugging – Race Detection

// file: raceCondition.cu__global__ void raceCondition(int *A) {

__shared__ int SharedMem[64];

SharedMem[threadIdx.x] = A[threadIdx.x];

// no synchronization barrier!

A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load}

. . .

raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr );

. . .

==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception:

==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]

- Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between.

==Ocelot== Near raceCondition.cu:9:0

11

11

Page 12: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX Emulator – CUDA Debugging- Illegal Memory Accesses

// file: memoryCheck.cu__global__ void badMemoryReference(int *A) {

A[threadIdx.x] = 0; // line 3 - faulting store}

int main() {

int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation

int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64);

badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0;

} ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range.==Ocelot== ==Ocelot== Nearby Device Allocations==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

12

12

Page 13: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Interactive DebuggerInteractive command-line debugger implemented using TraceGenerator interface

$ ./TestCudaSequenceA_gpu = 0x16dcbe0(ocelot-dbg) Attaching debugger to kernel 'sequence'(ocelot-dbg)(ocelot-dbg) watch global address 0x16dcbe4 s32[3]set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes(ocelot-dbg)(ocelot-dbg) continuest.global.s32 [%r11 + 0], %r7watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6break on watchpoint(ocelot-dbg)

• Enables • Inspection of application

state • Single-stepping of

instructions• Breakpoints and

watchpoints

• Faults in MemoryChecker and RaceDetector invoke ocelot-dbg automatically

Page 14: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Performance Tuning

....for (int offset = 1; offset < n; offset *= 2){ // line 61

pout = 1 - pout;pin = 1 - pout;__syncthreads();temp[pout*n+thid] = temp[pin*n+thid];

if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset];

}}....

• Identify critical regions• Memory demand• Floating-point intensity• Shared memory bank

conflicts

Page 15: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Inter-thread Data Flow

• Which kernels exchange computed results through shared memory?

• Track id of producer thread• Ensure threads are well

synchronized• Optionally ignore uses of

shared memory to transfer working sets

15

15

Page 16: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Current Trace Generators16

Trace Generators FunctionBranch Measures control flow uniformity and branch

divergenceInstruction Static and dynamic instruction countIntegrated Debugger GDB-like interfaceKernel Dimension Kernel grid and block dimensionsMachine Attributes Observe and record machine characteristics

Memory Working set size, memory intensity, memory efficiency

Memory Checker i) Bounds checks, ii) alignment checks, and iii) uninitialized loads (shared memory)

Memory Race Detector Race conditions on shared memoryParallelism MIMD and SIMD parallelism limitsPerformance Bound Compute and memory throughputShared Computation Extent of data flow among threads

Warp Synchronous Hot-paths/regions for warp synchronous execution

16

Page 17: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using Trace Generators

Implement TraceGenerator interface

override methods: initialize( ), event( ), postEvent( ), finish( )

Add to Ocelot runtime explicitly: ocelot::addTraceGenerator( )

or, add to trace::TraceConfiguration

Online analysis or serialize event traces

Link applications with libocelotTrace.so

17

17

Page 18: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Configuring & Executing Applications Edit configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures raceDetector - enforces synchronized access

to .shared debugger - interactive debugger

executive: devices:

List of Ocelot backend devices that are enabled emulated – Ocelot PTX emulator (trace generators)

Additional devices: nvidia – execution on NVIDIA GPUs llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON

GPU

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: true, ignoreIrrelevantWrites: true }, debugger: { enabled: true, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: false }, }, executive: { devices: [ "emulated" ], }}

18

Page 19: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Thread Load Imbalance

//! Computes number of dynamic instructions for each thread

class ThreadLoadImbalance: public trace::TraceGenerator {

public:

std::vector< size_t > dynamicInstructions;

// For each dynamic instruction, increment counters of each

// thread that executes it

virtual void event(const TraceEvent & event) {

if (!dynamicInstructions.size())

dynamicInstructions.resize(event.active.size(), 0);

for (int i = 0; i < event.active.size(); i++) {

if (event.active[i])

dynamicInstructions[i]++;

}

}

};

• Mandelbrot (CUDA SDK)

Dyna

mic

Inst

ruct

ion

Coun

t

19

19

Page 20: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20

EXAMPLE: CONTROL-FLOW DIVERGENCE

Page 21: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Control-Flow Divergence in PTX Emulator PTX Emulator facilitates customizable handlers for

control flow divergence Currently implements:

Immediate post-dominator (ipdom) Barrier divergence Thread frontiers, sorted stack Thread frontiers, GEN6

Assumes warp is CTA wide Abstract handlers implement potentially divergent control instructions

eg. Bra, Bar, Exit, Vote

Executing instructions drives TraceGenerators Reconvergence affects active threads, dynamic instruction

count, and instruction trace Analysis tools can group threads into warps of arbitrary size

Page 22: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22

class ReconvergenceMechanism// ocelot/executive/interface/ReconvergenceMechanism.h//class ReconvergenceMechanism {public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual ~ReconvergenceMechanism();

virtual void initialize() = 0;

virtual void evalPredicate(CTAContext &context) = 0;

virtual bool eval_Bra(CTAContext &context, const ir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const boost::dynamic_bitset<> & fallthrough) = 0;

virtual void eval_Bar(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Reconverge(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Exit(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Vote(CTAContext &context, const ir::PTXInstruction &instr);

virtual bool nextInstruction(CTAContext &context, const ir::PTXInstruction &instr, const ir::PTXInstruction::Opcode &) = 0; virtual CTAContext& getContext() = 0;}

Page 23: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Example: Immediate Post-Dominator Reconvergence// ocelot/executive/interface/ReconvergenceMechanism.h//class ReconvergenceIPDOM: public ReconvergenceMechanism {public: ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); void initialize(); void evalPredicate(CTAContext &context); bool eval_Bra(CTAContext &context, PTXInstruction &instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough);

void eval_Bar(CTAContext &context, PTXInstruction &instr);

void eval_Reconverge(CTAContext &context, PTXInstruction &instr);

void eval_Exit(CTAContext &context, PTXInstruction &instr);

bool nextInstruction(CTAContext &context, PTXInstruction &instr, PTXInstruction::Opcode &);

CTAContext& getContext(); size_t stackSize() const; void push(CTAContext&); void pop(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; unsigned int reconvergeEvents;};

Page 24: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24

Example: Immediate Post-Dominator Reconvergence

// ocelot/executive/implementation/ReconvergenceMechanism.cpp//void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, const ir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier#if REPORT_BAR report(" Bar called - " << context.active.count() << " of " << context.active.size() << " threads active");#endif std::stringstream message; message << "barrier deadlock:\n"; throw RuntimeException(message.str(), context.PC, instr); }}

Page 25: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25

Example: Immediate Post-Dominator Reconvergencevoid executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, const ir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; }}void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, const ir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false;}

Page 26: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26

Applications: Thread Frontiers Evaluate the impact of novel warp reconvergence

mechanisms on unstructured control-flow graphs Approach:

Control the layout of basic blocks Select threads with highest priority PC Model priority queue in hardware

Evaluation: measure impact on Activity factor - SIMD utilization Dynamic instruction count Effective memory bandwidth

Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, Sudhakar Yalamanchili. SIMD Reconvergence at Thread Frontiers.  44th International Symposium on Microarchitecture (MICRO 44). December 2011.

Page 27: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27

Applications: Thread Frontiers// ocelot/executive/interface/ReconvergenceMechanism.h//class ReconvergenceTFSortedStack: public ReconvergenceMechanism {public: ReconvergenceTFSortedStack(CooperativeThreadArray *cta); ~ReconvergenceTFSortedStack();

// ... omitted

typedef std::map<int, CTAContext> RuntimeStack; typedef std::vector<RuntimeStack> StackVector; StackVector stack; unsigned int reconvergeEvents;};

Page 28: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 28

Applications: Thread Frontiers// ocelot/executive/implementation/ReconvergenceMechanism.cpp//executive::CTAContext& executive::ReconvergenceTFSortedStack::getContext() { return stack.back().begin()->second;}

void executive::ReconvergenceTFSortedStack::eval_Exit( executive::CTAContext &context, PTXInstruction &instr) { if (stack.back().size() == 1) { context.running = false; } else { throw RuntimeException("not all threads hit the exit: ", context.PC, instr); }}

bool executive::ReconvergenceTFSortedStack::nextInstruction( executive::CTAContext &context, PTXInstruction &instr, PTXInstruction::Opcode &opcode) {

// advance to next instruction if the current instruction // wasn't a branch if (opcode != ir::PTXInstruction::Bra && opcode != ir::PTXInstruction::Call && opcode != ir::PTXInstruction::Ret) { context.PC++; } return context.running;}

Page 29: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 29

Applications: Thread Frontiers

Page 30: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 30

Summary: Ocelot PTX Emulator Used for instruction level analysis

Can be attached to trace generators and trace analyzers

Simple processing and filtering of machine state

Forms the basis for a range of productivity tools Correctness tools Debugging tools Workload characterization

Drives instruction and address traces to MACSIM (Part 2 of this tutorial)