random house cleanup. hw3 answers cis665/answer.zip

61
Random House Cleanup

Upload: shana-joseph

Post on 14-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Random House Cleanup. Hw3 Answers cis665/answer.zip

Random House Cleanup

Page 2: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 3: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 4: Random House Cleanup. Hw3 Answers cis665/answer.zip

Hw3 Answers http://www.seas.upenn.edu/~cis665/answer.zip

Page 5: Random House Cleanup. Hw3 Answers cis665/answer.zip

Optimizing Parallel Reductions

See PDF

Page 6: Random House Cleanup. Hw3 Answers cis665/answer.zip

Copyright infringed from:

Page 7: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 8: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 9: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 10: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 11: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 12: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 13: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 14: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 15: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 16: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 17: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 18: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 19: Random House Cleanup. Hw3 Answers cis665/answer.zip

Homework Problem

Use SimpleTexture for an example of loading a PGM image

Minimum distance:

Page 20: Random House Cleanup. Hw3 Answers cis665/answer.zip

Understanding and using shared memory

The local and global memory spaces are not cached which means each memory access to global memory (or local memory) generates an explicit memory access.

A multiprocessor takes four clock cycles to issue one memory instruction for a "warp". Accessing local or global memory incurs an additional 400 to 600 clock cycles of memory latency

Page 21: Random House Cleanup. Hw3 Answers cis665/answer.zip

Understanding and using shared memory

CUDA shared memory is divided into equally-sized memory modules that are called memory banks.

Each memory bank holds a successive 32-bit value (like an int or float) so consecutive array accesses by consecutive threads are very fast.

Bank conflicts occur when multiple requests are made for data from the same bank (either the same address or multiple addresses that map to the same bank).

When this happens, the hardware effectively serializes the memory operations, which forces all the threads to wait until all the memory requests are satisfied. (More in a bit…)

Page 22: Random House Cleanup. Hw3 Answers cis665/answer.zip

Understanding and using shared memory

Declare Shared memory extern __shared__ int s_data[];

Deciding on the amount of shared memory at runtime requires some setup in both host and device code.

The following code snippet allocates shared memory for an array of integers containing a number of elements equal to the number of threads in a block

int sharedMemSize = numThreadsPerBlock * sizeof(int);

Page 23: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 24: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 25: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 26: Random House Cleanup. Hw3 Answers cis665/answer.zip
Page 27: Random House Cleanup. Hw3 Answers cis665/answer.zip

GPGPU ToolkitSlabOps

Page 28: Random House Cleanup. Hw3 Answers cis665/answer.zip

Main Issue with GPU Programming

Main issue is not with writing the code for the graphics card

The main issue is interfacing with the graphics card

Page 29: Random House Cleanup. Hw3 Answers cis665/answer.zip

Issues with Interfacing with GPUs

1. You forget to do something1. Forget to initialize FBOs

2. Forget to enable the CG program

3. Forget to set the viewpoint correctly

4. ….

2. GPGPU algorithms are hacks1. You’re rendering a quad to perform an algorithm on an

array

3. Its not object oriented

Page 30: Random House Cleanup. Hw3 Answers cis665/answer.zip

Using SlabOps

GPGPU methods covered previously are fine for performing 1 or 2 programs

What about trying to manage ten or twenty programs performing hundreds of passes?

SlabOps to the rescue!

Page 31: Random House Cleanup. Hw3 Answers cis665/answer.zip

Using SlabOps

SlabOps were created by Mark Harris while getting his PHD at the University of North Carolina.

Used in his GPU Fluid Simulator to manage the large number of fragment programs required for each pass.

Page 32: Random House Cleanup. Hw3 Answers cis665/answer.zip

Using SlabOps

3 Parts1. Define

1. Define the type of SlabOp that you need (more on this later)

2. Initialization1. Initialize the program to load2. Initialize the parameters to connect3. Initialize the output

3. Run1. Update any parameters that might have changed2. Call Compute() to run the program

Page 33: Random House Cleanup. Hw3 Answers cis665/answer.zip

Initialization

void initSlabOps() { // Load the program g_addMatrixfp.InitializeFP(cgContext, "addMatrix.cg", "main"); // Set the texture parameters g_addMatrixfp.SetTextureParameter("tex1", inputYTexID); g_addMatrixfp.SetTextureParameter("tex2", inputXTexID); // Set the texture coordinates and output rectangle g_addMatrixfp.SetTexCoordRect( 0,0, texSizeX, texSizeY); g_addMatrixfp.SetSlabRect( 0,0, texSizeX, texSizeY); // Set the output texture g_addMatrixfp.SetOutputTexture(outputTexID, texSizeX, texSizeY, textureParameters.texTarget, GL_COLOR_ATTACHMENT0_EXT);}

Page 34: Random House Cleanup. Hw3 Answers cis665/answer.zip

Run

g_addMatrixfp.Compute();

One line to run the program: Sets the variables Enables the program Sets the viewpoint Builds the geometry to perform the processing Perform the computation Get the output into the buffer or texture Disable the program Reset the viewpoint

Page 35: Random House Cleanup. Hw3 Answers cis665/answer.zip

Comparing Saxpy (SlabOp)

// Do calculations for(int i = 0; i < numIterations; i++) { g_saxpyfp.SetTextureParameter("textureY", yTexID[readTex]); g_saxpyfp.SetOutputTexture(yTexID[writeTex], texSize, texSize, textureParameters.texTarget, attachmentpoints[writeTex]);

g_saxpyfp.Compute(); swap(); }

SlabOp

Page 36: Random House Cleanup. Hw3 Answers cis665/answer.zip

Comparing Saxpy (Non-SlabOp 1)

// attach two textures to FBO glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[writeTex],

textureParameters.texTarget, yTexID[writeTex], 0); glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[readTex], textureParameters.texTarget,

yTexID[readTex], 0);// check if that worked

if (!checkFramebufferStatus()) {printf("glFramebufferTexture2DEXT():\t [FAIL]\n");

// PAUSE();exit (ERROR_FBOTEXTURE);

} else if (mode == 0) {printf("glFramebufferTexture2DEXT():\t [PASS]\n");

}// enable fragment profilecgGLEnableProfile(fragmentProfile);// bind saxpy programcgGLBindProgram(fragmentProgram);// enable texture x (read-only, not changed during the iteration)cgGLSetTextureParameter(xParam, xTexID);cgGLEnableTextureParameter(xParam);// enable scalar alpha (same)cgSetParameter1f(alphaParam, alpha);// Calling glFinish() is only neccessary to get accurate timings,// and we need a high number of iterations to avoid timing noise.

glFinish();

Page 37: Random House Cleanup. Hw3 Answers cis665/answer.zip

Comparing Saxpy (Non-SlabOp 2)

for (int i=0; i<numIterations; i++) {// set render destinationglDrawBuffer (attachmentpoints[writeTex]);// enable texture y_old (read-only)cgGLSetTextureParameter(yParam, yTexID[readTex]);cgGLEnableTextureParameter(yParam);// and render multitextured viewport-sized quad// depending on the texture target, switch between // normalised ([0,1]^2) and unnormalised ([0,w]x[0,h])// texture coordinates

// make quad filled to hit every pixel/texel // (should be default but we never know)glPolygonMode(GL_FRONT,GL_FILL);// and render the quadif (textureParameters.texTarget == GL_TEXTURE_2D) {

// render with normalized texcoordsglBegin(GL_QUADS);

glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);glTexCoord2f(1.0, 0.0); glVertex2f(texSize, 0.0);glTexCoord2f(1.0, 1.0); glVertex2f(texSize, texSize);glTexCoord2f(0.0, 1.0); glVertex2f(0.0, texSize);

glEnd();} else {

// render with unnormalized texcoordsglBegin(GL_QUADS);

glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);glTexCoord2f(texSize, 0.0); glVertex2f(texSize, 0.0);glTexCoord2f(texSize, texSize); glVertex2f(texSize, texSize);glTexCoord2f(0.0, texSize); glVertex2f(0.0, texSize);

glEnd();}// swap role of the two textures (read-only source becomes // write-only target and the other way round):swap();

}

Page 38: Random House Cleanup. Hw3 Answers cis665/answer.zip

Comparing Saxpy

Ok, that looked a little worse than we know it is But… using SlabOps did look a little easier

Saxpy only had one program being run for multiple iterations.

What about something more complicated… Fluid Flow

Page 39: Random House Cleanup. Hw3 Answers cis665/answer.zip

Fluids Follow Stams method We’re not going to cover how to do fluids so much

as the program flow and how SlabOps help contain the problem

1. Advection2. Impulse3. Vorticity Confinement4. Viscous Diffusion5. Project Divergent Velocity

1. Compute Divergence2. Compute Pressure Disturbances3. Subtract gradient(p) from u

6. Display

“Fast Fluid Dynamics Simulation on the GPU”, Mark Harris. In GPU Gems.

Page 40: Random House Cleanup. Hw3 Answers cis665/answer.zip

Lets not forget Boundary Conditions

Boundaries and interior are computed in separate passes and may require separate programs

Page 41: Random House Cleanup. Hw3 Answers cis665/answer.zip

Implementation

Harris’ implementation contained 15 GPU programs (including 4 for display)

The simulation takes about 20 passes for each time-step,

(not including 2, 50 pass runs for the poisson solver)

Switch to code:(Note, code can be found in GPU Gems 1)

Page 42: Random House Cleanup. Hw3 Answers cis665/answer.zip

Point:

Creating something as complex as a fluid solver would be very difficult without some kind of abstraction

So what’s so special about SlabOps Versatility Policy-Based Design

Page 43: Random House Cleanup. Hw3 Answers cis665/answer.zip

SlabOp Versatility

Remember we skipped over how to define a SlabOp.

Each SlabOp is actually composed of 6 objects working together.

Each of the six objects can be replaced according to the specific task

In other words to alter a SlabOp to display to the screen instead of the back buffer, I just replace the Update object.

Page 44: Random House Cleanup. Hw3 Answers cis665/answer.zip

The 6 objects that define a SlabOp Render Target Policy

Sets up / shuts down any special render target functionality needed by the SlabOp

GL State Policy Sets and unsets the GL state needed for the SlabOp

Vertex Pipe Policy Sets up / shuts down vertex programs

Fragment Pipe Policy Sets up / shuts down fragment programs

Compute Policy Performs the computation (usually via rendering)

Update Policy Performs any copies or other update functions after the computation

has been performed

Page 45: Random House Cleanup. Hw3 Answers cis665/answer.zip

Defining a SlabOp

Luckily you do not need to create each of those objects.

You just need to replace one when it doesn’t do what you want.

Harris created 3 predefined SlabOps DefaultSlabOp – performs simple fragment program

rendered to a quad BCSlabOp – performs boundary condition fragment

program rendered as lines DisplayOp – displays a texture to the screen

Page 46: Random House Cleanup. Hw3 Answers cis665/answer.zip

More complex SlabOpsObjects defined to perform: Flat 3d texture computations

- computing for voxel grids Flat3DTexComputePolicy Flat3DBoundaryComputePolicy Flat3DVectorizedTexComputePolicy Copy3DTexGLUpdatePolicy

Multi-texture output - rendering with multiple texture outputs MultiTextureGLComputePolicy

Volume computations - rendering with multiple texture coordinates VolumeComputePolicy, VolumeGLComputePolicy

Page 47: Random House Cleanup. Hw3 Answers cis665/answer.zip

Defining a SlabOp

typedef SlabOp < NoopRenderTargetPolicy, NoopGLStatePolicy, NoopVertexPipePolicy, GenericCgGLFragmentPipePolicy, SingleTextureGLComputePolicy, CopyTexGLUpdatePolicy > DefaultSlabOp;

Include a Noop where a policy is not used, Include the preferred policy where one is needed

Page 48: Random House Cleanup. Hw3 Answers cis665/answer.zip

Next Generation SlabOps?

Version on course website has been extracted out of Harris’ fluid simulator and updated to use frame buffer objects instead of render texture

Easy to update SlabOps to use the geometry processor also

Additional policies could be created to render to non-quad surfaces, i.e. an object

Page 49: Random House Cleanup. Hw3 Answers cis665/answer.zip

How do SlabOps work?

The rest of this lecture will explain policy based design. There will be no more GPU talk during the remainder of the lecture

Why? SlabOps were a good implementation of Policy

Based Design You should have some exposure to design

patterns and templates Because I’m the one holding the chalk.

Page 50: Random House Cleanup. Hw3 Answers cis665/answer.zip

Where did Policy Based Design Come from?

Modern C++ DesignGeneric Programming and Design Patterns Applied

By: Andrei Alexandrescu

Excellent Bedtime reading

- Asleep within 2 pages

Contains unique implementations of

design patterns using templates

Page 51: Random House Cleanup. Hw3 Answers cis665/answer.zip

What is a design pattern?

Design Pattern: A general repeatable solution to a commonly occurring problem in software design.

- Wikipedia (The irrefutable source on everything)

The most commonly known design pattern?

Page 52: Random House Cleanup. Hw3 Answers cis665/answer.zip

The Singleton

One of the simplest and most useful design pattern

Goal: To only have one instance of an object, no matter where it is created in the program

Page 53: Random House Cleanup. Hw3 Answers cis665/answer.zip

The Singletonclass Singleton {public:

static Singleton & Instance();~Singleton();

private:static Singleton * m_singleton;

};

Singleton & Singleton::Instance() {if(m_singleton == null)

m_singleton = new Singleton();return *m_singleton;

}

// in Cpp fileSingleton::m_singleton = null;

Page 54: Random House Cleanup. Hw3 Answers cis665/answer.zip

C++ Templates Templates – functions that can operate with generic

types The STL is a library of templates

hence its name Standard Template Library Example Templates:

cout, cin vector<int> string

template <class myType> myType GetMax (myType a, myType b)

{ return (a>b?a:b); }

Example Template:

int x,y; GetMax <int> (x,y);

Example Template Use:

Modern C++ Design – Book on design patterns using templates

Page 55: Random House Cleanup. Hw3 Answers cis665/answer.zip

Policy Based Design

Defines a class with a complex behavior out of many little classes (called policies), each which takes care of one behavioral or structural aspect.

You can mix and match policies to achieve a combinatorial set of behaviors by using a small core of elementary components

Page 56: Random House Cleanup. Hw3 Answers cis665/answer.zip

How it works

Multiple Inheritance One class that inherits the properties of numerous

other classes Templates

Systems that operate with generic types

Multiple Inheritance + Templates => Policy Based Design

Page 57: Random House Cleanup. Hw3 Answers cis665/answer.zip

Policies

Each policy is a simple class that implements one aspect of the overall goal

Policies do not need to be templates (in many cases they’re not)

Policies do need to have specific known functions that they implement

Page 58: Random House Cleanup. Hw3 Answers cis665/answer.zip

Encapsulation Class

One class needs to use multiple inheritance to combine all the policies together

template < class RenderTargetPolicy, class GLStatePolicy, class VertexPipePolicy, class FragmentPipePolicy, class ComputePolicy, class UpdatePolicy>class SlabOp : public RenderTargetPolicy, public GLStatePolicy, public VertexPipePolicy, public FragmentPipePolicy, public ComputePolicy, public UpdatePolicy{public: SlabOp() {} ~SlabOp() {} Compute();};

Page 59: Random House Cleanup. Hw3 Answers cis665/answer.zip

The Compute Method

// The only method of the SlabOp host class is Compute(), which // uses the inherited policy methods to perform the slab computation. // Note that this also defines the interfaces that the policy classes // must have. void Compute() { // Activate the output slab, if necessary ActivateRenderTarget();

// Set the necessary state for the slab operation GLStatePolicy::SetState(); VertexPipePolicy::SetState(); FragmentPipePolicy::SetState(); SetViewport();

// Put the results of the operation into the output slab. UpdateOutputSlab();

// Perform the slab operation ComputePolicy::Compute();

ResetViewport();

// Reset state FragmentPipePolicy::ResetState(); VertexPipePolicy::ResetState(); GLStatePolicy::ResetState();

// Deactivate the output slab, if necessary DeactivateRenderTarget(); }};

Page 60: Random House Cleanup. Hw3 Answers cis665/answer.zip

The Other Methods

But wait, what about all the other functions that we called inside our GPU program?

Those exist in the individual policies Example:

InitializeFP(CGcontext context, string fpFileName, string entryPoint)

Exists in the FragmentPipePolicy

Page 61: Random House Cleanup. Hw3 Answers cis665/answer.zip

Conclusion

SlabOps are one of many GPGPU abstractions Happens to be my favorite because they are the

most versatile and are easy to useIssues: Does not include basic GPGPU functions such as

Reduce() There is a learning curve Difficult to find out where things are actually going

on