a smoothed particle hydrodynamic simulation utilizing the parallel

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings Universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping

LiU-ITN-TEK-A--09/052--SE

A smoothed particlehydrodynamic simulation

utilizing the parallelprocessing capabilities of the

GPUsViktor Lundqvist

2009-09-30

LiU-ITN-TEK-A--09/052--SE

A smoothed particlehydrodynamic simulation

utilizing the parallelprocessing capabilities of the

GPUsExamensarbete utfört i vetenskaplig visualisering

vid Tekniska Högskolan vidLinköpings universitet

Viktor Lundqvist

Handledare Magnus WrenningeExaminator Jonas Unger

Norrköping 2009-09-30

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Viktor Lundqvist

ABSTRACT

Simulating fluid behavior has proven to be a demanding challenge which requires

complex computational models and highly efficient data structures. Smoothed

Particle Hydrodynamics (SPH) is a particle based computational model used to

simulate fluid behavior that has been found capable of producing convincing results.

However, the SPH algorithm is computational heavy which makes it cumbersome to

work with.

This master thesis describes how the SPH algorithm can be accelerated by utilizing

the GPU’s computational resources. It describes a model for how to distribute the

work load on the GPU and presents a suitable data structure. In addition, it proposes

a method to represent and handle moving objects in the fluids surroundings. Finally,

the performance gain due to the GPU is evaluated by comparing processing times

with an identical implementation running solely on the CPU.

ACKNOWLEDGMENTS

I would like to thank everyone at Sony Pictures Imageworks, especially the

application group, for giving me this opportunity and making me feel part of their

team. A special thank to my supervisor Magnus Wrenninge who has been an

invaluable discussion partner and a great support all through the work on this thesis.

I am also deeply grateful to Anders Ynnerman and Aida Vitoria at Linköping’s

University who helped me to get into the IPAX program. Thank you!

CONTENTS

INTRODUCTION ............................................................................................................ 2

1. 1 MOTIVATION ..................................................................................................................... 2 1.2 WORK ENVIRONMENT ......................................................................................................... 3 1.3 OUTLINE OF REPORT ............................................................................................................ 3

BACKGROUND & RELATED WORK ................................................................................. 4

2.1 RELATED WORK .................................................................................................................. 4 2.2 SMOOTHED PARTICLE HYDRODYNAMICS (SPH) ........................................................................ 5 2.2.1 Modeling Fluid Dynamics ........................................................................................... 6 2.2.2 Smoothing Kernels ...................................................................................................... 9 2.3 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) ................................................................ 10 2.3.1 NVIDIA’s Hardware Architecture ............................................................................... 10 2.3.1.1 Execution Model..................................................................................................... 10 2.3.2 Programming Interface ............................................................................................. 12

IMPLEMENTATION ..................................................................................................... 14

3.1 SPH ALGORITHM .......................................................................................................... 14 3.1.1 Data Structure ........................................................................................................... 15 3.1.2 SPH and Parallel Processing ...................................................................................... 16 3.1.2 Normalize Density ..................................................................................................... 19 3.1.3 Smoothing Kernels .................................................................................................... 19 3.2 INTEGRATION METHOD ...................................................................................................... 20 3.2.1 Step Size .................................................................................................................... 20 3.3 PARTICLE SOURCES ...................................................................................................... 21 3.4 COLLISION DETECTION ....................................................................................................... 21 3.4.1 Collision Objects ........................................................................................................ 21 3.4.2 Collision Algorithm .................................................................................................... 22

RESULTS ..................................................................................................................... 24

4.1 PARTICLE BEHAVIOR .......................................................................................................... 24 4.2 PERFORMANCE.............................................................................................................. 26 4.3 DISCUSSION ..................................................................................................................... 27 4.4 FURTHER IMPROVEMENTS .................................................................................................. 28 4.4.1 Performance Improvements ..................................................................................... 28 4.4.2 Functionality ............................................................................................................. 29

REFERENCES ............................................................................................................... 31

CHAPTER 1

INTRODUCTION

This chapter aims to introduce the reader to the purpose and the underlying structure of

this thesis.

1. 1 MOTIVATION The motion of fluids has proven to be one of the most challenging physical phenomena

to simulate. The complex behavior and physical relations involved in e.g. a raising pillar

of smoke, the breaking of an ocean wave or water being poured into a glass is probably

what is making it fascinating to watch. However, mimicking these behaviors in computer

graphics requires complex computational models and highly efficient data structures. Smoothed Particle Hydrodynamics (SPH) is a particle based computational model used

to simulate fluid dynamics. The SPH model has been found capable of producing

convincing and physically based computer animations of fluid dynamics. However, the

computational burden of fluid simulation is heavy which generally results in low frame

rates. Therefore, fluid phenomena are typically simulated off‐line and then rendered in a

second step. The high simulation time together with the difficulties of predicting fluids

behavior makes it somewhat frustrating to work with, since it is often necessary to re‐

run a simulations several times in order to achieve the desired result.

During the last years the performance growth of single core contemporary general‐

purpose processors (CPUs) has stagnated. This has led to an increased interest in

multicore chip organizations, a field that the vendors of graphics processors (GPUs) have

put a lot of research into. Today’s GPUs provides a vast number of simple but deeply

multithreaded cores with high internal memory bandwidth. Although, GPUs traditionally

have been dedicated to strictly processing data strongly coupled to the rendering

process, last year’s development have led to increased programmability capabilities,

making them attractive for non‐graphical purposes. The SPH simulation algorithm is in

many ways suitable for parallel processing.

The purpose of this thesis is to examine how the computational resources on a

modern graphic card can be utilized in order to speed up the process of an SPH

simulation.

2

1.2 WORK ENVIRONMENT The main work of this thesis was carried out during an internship at Sony Pictures

Imageworks. The SPH simulation was implemented to suite the production pipeline at

Sony Picture Imageworks with the aim to be useful in future productions.

The resulting simulation tool is written as a plug‐in to the 3D visual effects program

Houdini, coded in C++ and Compute Unified Device Architecture (CUDA). CUDA is a

general purpose GPU programming language developed by NVIDIA in order to provide

users with extended control over the processing capabilities of the GPU. See section 2.3

for a detailed description of CUDA. Sony also provided several in‐house C++ libraries

facilitating certain processes, such as handling field data.

1.3 OUTLINE OF REPORT Chapter 2 consists of the background information and previous work which this thesis is

based upon. The first part describes the basic theories behind simulating liquids using

SPH. Thereafter, the CUDA hardware and programming approaches are described. This

chapter aims to give the reader the basic understanding of the SPH algorithm and CUDA

principles necessary to grasp the following chapters and can therefore be disregarded by

readers already familiar with these fields of knowledge.

Chapter 3 describes how the theories behind the SPH algorithm were implemented in

a way that maximizes the utilization of the GPU’s capabilities. It also describes the

implementation of other parts, i.e. collision handling and integration technique,

necessary to make the simulation practically useful.

In chapter 4 the results are being presented and discussed, both in terms of

computational performance as well as visual aspects. Finally, some suggestions on

further improvements are proposed.

3

CHAPTER

BACKGROUND & RELATED WORK

2

2.1 RELATED WORK

Computational Fluid Dynamics (CFD) is a well established research area with a long

history. In 1845 Claude Navier and George Stokes managed to describe the dynamics of

fluids in the Navier‐Stokes equations. In 1977 Monaghan et al. introduced SPH in order

to simulate nonaxisymmetric phenomena in astrophysics (Gingold & Monaghan, 1977).

In contrast to grid‐based (Euler) simulation techniques it models the dynamics of fluids

by applying forces to a particle system (Lagrange), the applied forces ensures the Navier‐

Stokes equations. The core of the method is the use of a smoothing kernel, a function

with certain properties which is used to sum up each particles contribution to various

field values in a fluid. SPH was found to be rugged, easily extendable and intuitive to

work with. Since then, SPH has been shown applicable to a wide variety of fields such as

the study of gravity currents near black holes (Evans & Kochanek, 1989), viscous flows

(Takeda, Miyama, & Sekiya, 1994), wave propagation (Monaghan & Kos, Solitary waves

on a Cretan Beach, 1999) and incompressible flows (Monaghan & Humble, 1991).

The recent growth of the computational power of GPUs has resulted in an increased

interest in accelerate the simulation process by performing all (Kolb & Cuntz, 2005), or

part (Amad, Imura, Yasumoto, Yamabe, & Chihara, 2004) of the computations on the

GPU. Harada et al. proposed a method for SPH simulations running on the GPU using a

flat 3D texture to store the complex data structure in the graphic cards video memory

(Harada, Kawaguchi, & Koichiro, 2007). Even though they experience some limitations

due to the design and accessibility of the texture memory, they manage to simulate a

particle system containing 60, 000 particles in real‐time and experienced an extensive

speed up for complex off‐line simulations.

The above GPU implementations were achieved by using existing 3D‐rendering APIs,

DirectX (Microsoft) and OpenGL (Kessenish & Baldwin). Since the original purposes of

these APIs where to handle rendering the implementations need to be posed in the

context of polygon rasterization. This leads to difficulties when e.g. implementing

complex data structures which have made this approach cumbersome.

The need of general‐purpose computing on the GPU (GPGPU) has lately been

recognized by leading graphic cards vendors. Computer Unified Device Architecture

(CUDA) is a software architecture for general‐purpose programming on the GPU,

released by NVIDA in 2007. CUDA allows the user to access the highly parallel

4

performance capabilities of the GPU without the need to go through the whole graphic

pipeline. CUDA is supported by the NVIDIA GeForce 8 series and newer NVIDIA cards, as

well as some Quadro GPUs and Tesla cards. CUDA has shown excellent potential in

parallel processing for a broad spectrum of applications (Che, Boyer, Meng, Tarjan,

Sheaffer, & Skadron, 68:1370‐1380).

2.2 SMOOTHED PARTICLE HYDRODYNAMICS (SPH) A fluid is represented by a set of non‐ordered particles which hold fluid properties at

discrete locations in the fluid. The SPH simulation technique uses interpolation theory to

evaluate field properties at certain points in the fluid. A scalar value (A) can be

int ming up all particles contributions, erpolated at location r by sum

A r ∑ mAW r r , h (1)

where j iterates over all particles, m is the mass of the particle with index j, A the

approximated scalar quantity at r and ρ the density r . at

The core of the function is the smoothing kernel (W) which determines each particles

contribution to the field value. The smoothing kernel has cut‐off radius (h) for which W 0 when r r . The smoothing kernel has to be even (equation 2) and

normalized (equation 3) in order to assure a second order of accuracy in the

interpolation.

W r, h (2) W r, h

W r dr 1 (3)

There are several factors to be considered when designing a suitable smoothing kernel.

This will be further discussed in section 2.2.5.

The den ity field o a fluid

ρ r ∑ m W r r , h . (4)

s f can be evaluated using the following equation,

Most fluid equations involve the derivatives of various field quantities. One main

advantage with the SPH interpolation technique is that such derivatives only affect the smoothing kernel, since the rest of the variables are constants. The gradient of is simply,

∑ , ∑ , . (5)

Likewise, will the Laplacian evaluate to,

∑ , . (6)

5

2.2.1 MODELING FLUID DYNAMICS

The governing equations for incompressible fluid dynamics are the mass conservation

equation (equation 7) and the Navier‐Stokes equation (equation 8) which formulates the

co e mentum, ns rvation of mo

0 (7)

(8)

where is the gravitational acceleration, the fluids viscosity coefficient and the

velocity. It is important to note that equation 8 represents a simplified version of the

Navier‐Stokes equation, used for viscous incompressible fluids.

The conservation of mass is a trivial task in a SPH simulation. Since the number of

particles is constant and each particle has a constant mass throughout the simulation,

the conservation of mass will be assured as long as the smoothing kernel is normalized

(equation 3).

The right hand side of the Navier‐Stokes equation above consists of three different

components. The first component ) models the pressure, the second represents

external forces and the third one models the viscosity of fluids. The contribution

of each component will be further discussed in section 2.2.1, 2.2.2 and 2.2.4.

The particles in a SPH system should be considered as point masses and the forces

acting on them has to be described in the form of point forces. However, fluid equations

in literature is often described in terms of force fields. The point force acting on a

a force field is described in equation 9. particle in

(9)

The point force can be approximated using SPH interpolation over one particle (equation

10).

, , (10)

Rewriting this relation with Newton's second law in mind gives the connection between

applied force and acceleration (equation 11).

(11)

6

2.2.1.1 PRESSURE

The force acting on a particle due to pressure can be described using Newton's second

law (equation 12).

(12)

Application of the SPH rule (equation 1) to the pressure term results in the following

relation (equation 13).

∑ , (13)

The above equation does not produce symmetrical forces and is thereby an example of

one of the issues with SPH. The resulting relations of a SPH interpolation does not

guarantee to satisfy any physical principles. The symmetry between forces (every

reaction leads to a counter reaction) is vital to get valid simulation results. Different ways

to achieve symmetrization of equation 13 have been proposed. Müller et al. suggests

using the mean of the interacting particles pressure values (Müller, Keiser, Nealen, Pauly,

Gross, & Alexa, 2003) resulting in the following equation (equation 14).

∑ . ., (14)

The pressure can then be computed using a modified version of the ideal gas state

equation (Desbrun & Cani, 1996),

(15)

where is a gas constant determined by the speed of sound in the specific liquid and

is the liquid's rest density.

2.2.1.2 VISCOSITY

Viscosity models the way particles with different speeds within the same liquid interacts

with each other. Applying the SPH interpolation technique on the viscosity term yields

th .e following equation

∑ , (16)

This equation suffers from the same symmetrical problems as equation 13, and

therefore needs to be modified before being used in the simulation. Müller et al.

suggests using the relative speed between to particles to balance the forces (Müller,

Keiser, Nealen, Pauly, Gross, & Alexa, 2003). Viscosity forces do only depend on the

difference in velocities, which makes this a natural approach. This results in equation 17.

7

∑ , (17)

Clavet et al. presents a different approach, applying radial pairwise impulses, to model

viscosity (Clavet, Beaudoin, & Poulin, 2005). The size of the impulses depends on the two

pa rds each other (equation 18 and 19). rticles speed towa

· ̂ (18)

(19)

and are coefficients used to control the viscosity’s linear and quadratic

dependencies on velocity. Impulses are only applied if is greater than 0, the proposed

algorithm will therefore only cause forces when particles are moving towards each

other.

2.2.1.3 SURFACE TENSION

The Navier‐Stokes model does not include forces due to surface tension. This is therefore

often added as a separate part. Surface tension can be seen as a force striving to

minimize curvature by applying forces towards the core of the liquid in the direction of

the surface normal. The more curved a surface is, the greater surface tension force will

be generated.

Morris proposes the use of a color field to determine forces acting upon a set of

particles to su Morris, 2000). The color field is defined as, due rface tension (

∑ , . (20)

This is simply a measure of the particle distribution. The surface normal field of a set of

pa s as the gradient of the color field (equation 21). rticle is defined

(21)

The curva

| |

ture of a surface can be calculated as,

. (22)

The forc due to surface tension ae c n then be calculated using the following equation.

| | (23)

Where is a scalar coefficient used to scale the amount of surface tension to be applied.

The magnitude of the normal can be used to restrict the surface tension to only be

applied to parts of the liquid close to a surface.

2.2.1.4 EXTERNAL FORCES

External forces such as gravity and collision forces can be applied directly to the

particles, without the use of any interpolation technique. See chapter 3.6 for a detailed

description of the implemented collision handling algorithm.

8

9

2.2.2 SMOOTHING KERNELS

The outcome of a simulation in terms of accuracy, speed and stability is greatly affected

by the choice of smoothing kernels. As discussed earlier (see section 2.1) it is necessary

to choose kernels that are even (equation 2) and normalized (equation 3), to be able to

limit the interpolation error. In order to obtain stability it is crucial to pick smoothing

kernels that are zero with vanishing derivates at its boundaries (Müller, Keiser, Nealen,

Pauly, Gross, & Alexa, 2003).

Apart from these basic constraints it is important to consider computational time

when designing a kernel. The kernel has to be evaluated several times for each particle

in the simulation at every iteration. Even a small change in the complexity of the kernel

can have devastating effects on the simulations performance.

There are several different smoothing kernels suggested in literature, designed to

achieve different results in terms of speed and particle behavior. The most common

approach is to use different kernel types depending on the scalar component to be

interpolated. Müller et al. suggests the following set of kernels.

, 0 0

(24)

This kernel has a big computational advantage. It is relatively simple in its form and the

fact that the distance variable r only appears squared, has the advantage that the kernel

can be evaluated without computing any square roots. Which is necessecary when

calculating the distance between two particles. Müller et al. uses this kernel for density

computations.

Since the above kernel (equation 24) has vanishing gradients close to its center it is not

usefull to calculate pressure forces. The magnitudes of the forces due to pressure are

stictly dependent on the smoothing kernels gradient (equation 13) hence would the

above kernel lead to vanishing forces between particles close to each other. This is an

unwanted behaviour that tend to cause clustering of particles. The following kernel

(equation 25) was designed to address this problem and create strong repelling forces

between particles close to each other.

, 0 0

(25)

As mentioned earlier, Müller et al. uses the Laplacian to calculate forces due to viscosity

(equation 13). A negative Laplacian would cause the viscosity to increase the difference

in velocities rather than smoothen out the velocity field. Since both of the above kernels

have negative Laplacian values at some points within theire cut‐off radus, a third kernel

th an ex y s (equation 26). wi lusivl positive Laplacian i presented

, 1 0 0

(26)

2.3 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) CUDA, or Computer Unified Device Architecture, is NVIDIAS software and hardware

architecture for general‐purpose programming on the GPU. CUDA was released in 2007

with the purpose of making the parallel processing capabilities of the GPU accessible

through an API not related to graphics. A CUDA compatible GPU can be considered a as a

device providing a set of parallel co‐processing units to the main CPU, the host. From this

point on the GPU will be referred to as the device and the CPU as the host.

2.3.1 NVIDIA’S HARDWARE ARCHITECTURE It is necessary to have a basic understanding of NVIDIA’s hardware architecture in order

to fully utilize a CUDA compatible GPU's processing power.

CUDA enabled GPU's consists of a number of unified general‐purposed processing

units referred to as streaming multiprocessors. Each streaming multiprocessors consists

of a number of streaming processors. The multiprocessors utilizes a single instruction

multiple data (SIMD) architecture, meaning that each streaming processor, at a given

point in time, perform the exact same instructions but on different sets of data.

2.3.1.1 EXECUTION MODEL A function executed on the device is called a kernel. The same kernel is executed on the

device's all streaming multiprocessors. Each kernel executes over a grid of blocks, where

each block consists of a number of threads. It is up to the user to define the dimensions

of the grid and the blocks.

F IGURE 1 . Illustrates the CUDA execution model. Each kernel is executed over a two‐dimensional grid of

blocks. A block consists of threads organized in up to three dimensions.

Threads within the same block have access to the same shared memory (see section

2.2.1.2) and can be synchronized through a barrier‐like construct. It is however not

possible to synchronize threads executed in different blocks. The only way to achieve

synchronization over all threads is to split the workload into two separate kernels for

sequential execution. As mentioned earlier, the threads are executed in a SIMD manner.

The grid and block organization is typically used to determine which data to operate on.

10

All these facts makes the grid and block configuration a crucial part of most CUDA

implementations.

When a kernel is executed, the blocks within the kernel are distributed over the

available streaming multiprocessors. The threads within a block are then divided into

groups of 32, called warps. Threads contained by a warp are then distributed over the

streaming processors and executed in parallel. At any point in time the hardware

executes instructions from one selected warp.

2.3.1.2 MEMORY HIERARCHY

There are several different types of memory available on the device, each with different

properties such as accessibility and functionality.

Device memory (VRAM) is the counterpart of the Random Access Memory (RAM)

available on the CPU and is divided into, global memory, constant memory, texture

memory and local memory.

The global memory is the only memory type that gives full access (read/write) to both

the device and the host. It is therefore the only way to handle data transactions from the

device to the host. However, the global memory is not cached, which makes the access

times relatively high. It is therefore desirable to minimize transactions to and from this

memory type.

Both the texture and the constant memory are cached, which substantially reduces

the access times in situations where the same memory is accessed several times. In

addition, the texture memory provides hardware support for certain functions such as

interpolation etc.

The local memory has the same performance capabilities as the global memory with

the difference that it is only accessible from the specific thread that wrote to it. Since it is

slow it should be avoided and can often be substituted with the multiprocessors local

memory (on‐chip memory).

There are two types of on‐chip memory, shared memory and registers. They both have

extremely fast access times. Shared memory is shared between blocks on the same

streaming multiprocessors. It is often used as manually controlled cache, where data

from the global memory can be copied when accessed frequently within the same block

(see section 2.3.3.1). Registers are used for local storage of data which is only accessible

from a specific thread. The downside however is that each streaming multiprocessors

only have a very limited amount of on‐chip memory. If it is used without care and

consideration it might greatly affect the computational performance of an

implementation and possibly even prevent an application from executing.

11

F IGURE 2 . Different memory types vary in terms of access speed and accessibility.

Parallel processes generally consume a lot of data, it is therefore extremely important to

make the data as easily accessible as possible to prevent memory access time from

becoming a bottleneck. This is further discussed in section 2.3.2.1.

2.3.2 PROGRAMMING INTERFACE There are currently two different interfaces supported to write CUDA programs: CUDA

driver API and C for CUDA. The first one is a lower‐level API which offers a very high level

of control but also requires more code. C for CUDA on the other hand is basically an

extension to the C language, exposing the CUDA programming model and is the interface

I used for my implementation. It provides the programmer with libraries necessary to

transfer data back and forth to the graphic card, execute kernels and synchronize

threads etc. NVIDIA's CUDA compiler (NVCC) automatically separates the CUDA source

files into host code and device code and then compile the device code into binary code

and leaves the host code to be compiled by the systems standard C compiler.

2.3.2.1 OPTIMIZATION STRATEGIES

Functional CUDA code is far from a guarantee of good performance. The real challenge is

to program the CUDA kernels in a way that fully utilizes the GPU’s processing capability.

There are three main strategies to achieve this, maximizing parallel execution, optimizing

memory usage and optimizing instruction usage (NVIDA Corporation).

One step towards maximizing parallel execution is to provide the streaming

multiprocessors with as many warps as possible. This increases the streaming

multiprocessors’ possibilities to hide memory latencies by switching from warps

currently reading from memory, to warps that are ready for execution. However, the on‐

chip resources limit the amount of warps, threads and thread blocks a multiprocessor

12

can handle simultaneously. The on‐chip resources available on the GeForce‐8 series is

summarized below.

• Max 24 warps per multiprocessor

• Max 768 threads per multiprocessor

• Max 28 thread blocks per multiprocessor

• Max 8192 32‐bit registers per multiprocessor

• Max 16384 bytes shared memory per multiprocessor

These resources do vary slightly depending on the graphic card available in the machine.

However, all card specific figures presented in this report are reflecting the Geforce‐8

series resource specifications, since that was the card type available at Sony Pictures

Imageworks during my internship. The ratio of active warps to the maximum number of

warps available on the GPU is called the occupancy. Higher occupancy will not

necessarily lead to better performance but it will prevent bottlenecks due to long

memory access times.

Memory access times can be reduced by avoiding data transfers to memories with low

bandwidth. Transfers between host and device units should be avoided and transfers

between global and local device memory should be minimized. One way of minimizing

threads access to the global memory is to let threads within a block load parts of the

global memory into the shared memory. This does however require that threads within

the same block needs access to data at the same place in the global memory. Another

strategy to reduce global memory access is simply to recalculate data rather than fetch

the result from previous a calculation that has been stored in the memory. Moreover, it

is possible to reduce access time by optimizing memory access patterns. It is for example

faster to fetch data that is organized in a sequential order than data that is scattered all

over the memory. Since data required for a specific warp is fetched from the memory at

the same time, a lot can be gained from making sure that the threads within that warp

will require memory that is organized sequentially in memory. Another notable fact

about the device is that it is optimized to fetch data that is organized in groups of four

(traditionally red/green/blue/alpha). Depending a little on what the access patterns look

like it is often more efficient to organize and fetch data in groups of four even if one of

the data spots is not used.

As for optimizing instruction usage, some computational heavy arithmetic functions

can be replaced with less expensive versions especially modified to perform well on the

device. This does however mean some loss of precision and should only be done in cases

where it will not affect the quality of the end result.

All these optimization techniques are discussed in more depth in NVIDIA’s CUDA

programming guide (NVIDA Corporation).

13

CHAPTER 3

IMPLEMENTATION

Knowing the advantages and weaknesses of the GPU, it is possible to implement the

algorithm in a way that will result in maximum possible utilization of the available

processing resources. Even though the SPH algorithm is the driving element behind the

whole implementation there are several important parts necessary to set up and control

the behavior of the simulation. Figure 3 gives an overview of the different stages

involved in executing one iteration of a simulation and where the data is processed. Each

stage is described in more details later on in this chapter.

F IGURE 3 . Illustrates the different stages involved in the execution of one iteration. The left side lists operations being processed on the host while the right side lists operations being executed on the device.

Note that a CPU version of the entire algorithm has been implemented in addition to the

GPU version. There are two main reasons for this. Firstly, implementing an identical

version running on the CPU makes it possible to investigate the computational gain of

utilizing the GPU’s processing power. Secondly, only a small part of the machines

currently available at Sony have CUDA enabled graphics cards installed.

3.1 SPH ALGORITHM

When considering the SPH algorithm from a practical perspective, it soon becomes

obvious that interpolating particle values is going to be the most resource demanding

part. It requires stepping through each particle to sum up its neighbors contribution to

the current value. Note that a particle is only considered a neighbor to another particle if

it lies within the specified kernel radius of the smoothing kernel. Bearing in mind that a

full scale simulation can consist of hundreds of thousands of particles and that it is not

14

unusual that particles have around 20‐50 neighbors, a lot can be gained by speeding up

this process. Firstly, it will be necessary to find a data structure capable of quickly find a

particle’s neighbors. Secondly, the workload coupled to the interpolation has to be

processed in a way that fully utilizes the GPU’s capacity.

3.1.1 DATA STRUCTURE As discussed earlier, the nature of the SPH algorithm makes it crucial to efficiently find a

particle’s neighbors within the kernel radius. To exhaustively search through and

compare every particle in the set is a waste of resources and practically unthinkable for

anything else than extremely small particle sets. It is therefore necessary to sort the

particles into some kind of data structure that can be used to limit or guide the search

algorithm through the particle set. There are numerous different approaches to do this,

where Kd‐trees and different hash tables are examples of possible solutions. However, I

choose to sort the particles, based on position, into a uniform and wrapped grid

structure and then limit the search for particles within the kernel radius to the

corresponding and neighboring grids. By choosing a cell size equal to the kernel radius,

no points are neglected. The size and dimension of the grid is static throughout the

s each point is hashed into the grid using a simple hashing function: imulation but

% (27)

% (28)

% (29)

(30)

The above described structure makes it possible to sort an infinitely large geometric span

into the grid without changing its dimensions. Even though the hashing leads to distance

comparisons between particles that are potentially far away from each other this has

turned out to be negligible considering the computational time.

When a hashing key have been calculated for each particle, the particles are being

sorted and the position array will be reordered based on grid id. This means that

particles hashed into the same cell will be placed next to each other in the position array.

This grids structure is represented by a set of arrays. One array storing the position of

each particle, sorted in order of their hash keys and two arrays used to store the

beginning and end of each cell.

15

F IGURE 4 . The left side illustrates six particles’ spatial distribution in relation to a grid system; the dashed

lines symbolize cells that will be wrapped; the gray circles denote the particles’ interaction radius. The top

array on the right side demonstrate how the x and y positions of the particles is stored sequentially in a flat

array structure. Each particle is given a hash number (second array) and the particle position array is sorted

based on this number (third array). Note how the particle in grid (1, 3) is given the same hash as the particle in

grid (1, 1). The two last arrays are used to keep track of where each cell starts and end. Only the last three

arrays are necessary to represent the particles sorted into the grid structure.

As mentioned earlier, data is accessed faster if organized into groups of four. For that

reason are all arrays that are storing three‐dimensional data such as; position, velocity,

acceleration and force, stored with an extra “dummy” data value. This will obviously

require more memory but it does decreases memory access time which has been

prioritized in this case.

3.1.2 SPH AND PARALLEL PROCESSING Computation of density, pressure and the different forces acting on the particles are all

evaluated on the device. Note that computations bound to the collision handling are

processed on the host. The reasons for not processing the collision handling on the

device is both because of problems with storing and handling collision object data in

device memory and lack of implementation time.

Since the result for each particle is independent of the result of the other particles, it

is possible to distribute the workload so that each particle gets a thread each. Little

would be gained by splitting the computations necessary for one particle over multiple

threads, since it would not change the amount of registers used by each thread. The

other option, to let one thread handle multiple particles would lead to a way too high

register count per thread and thereby limit the numbers of blocks simultaneously loaded

on to a streaming multiprocessor. Therefore does one thread per particle seem to be

both the most intuitive and effective solution.

16

3.1.2.1 GRID AND BLOCK DIMENSION

When it comes to find the most appropriate grid and block configuration it is important

to determine if there are any dependencies between threads that can be taken

advantage of. For example, if the threads can be organized into blocks where each

thread will require the same data from global memory, a lot can be gained by loading

that specific part of the global memory into the shared memory. Such a relation can be

found between particles sorted into the same cell, since they got the same potential

neighbors.

However, the amount of particles in one cell can vary greatly, some cells may be

empty or only containing a few particles where as other cells may contain hundreds or

even thousands of particles. Since the amount of threads in a block has to be static

throughout a kernel this uneven distribution makes it hard to take advantage of this

similarity by using the shared memory.

Nonetheless, a certain level of similar or even identical data dependencies can be

achieved by sorting the particle array on cell index and then divide the array into equally

big chunks and distribute them over the blocks. Some block will contain particles from

several different cells with rather different data dependencies while as other blocks will

contain particles from only on cell with very similar data dependencies. Even if this

would not make it possible to utilize the shared memory it will at least lead to better

coherence in the memory reads. In addition, this brings up the possibility to make use of

the texture memory. Since the above described setup will result in a lot of identical

memory requests close to each other (in time) it has great potential of taking advantage

of the texture memory’s ability to cache data. Accessing data in the cached memory is

nearly as fast as to read from on‐chip memory. By binding the global memory arrays to

textures and use texture lookups to fetch data the memory access time is improved

drastically.

The conclusion is that since we cannot find a way to predict the spatial limits of

particles sorted into a specific block, the best solution would be a flat grid and block

configuration.

3.1.2.2 OCCUPANCY

To fully take advantage of the GPUs parallel processing capability it is necessary to

maximize the amount of active threads on each multiprocessor. As discussed in section

2.3.3.2 the on‐chip resources limits the amount of threads a multiprocessor can handle

simultaneously. The most limiting factor in this implementation turned out to be the

number of registers possible to load on to a multiprocessor. The original kernel used to

interpolate the forces acting on the particles each thread consumed 28 registers. With

that amount of registers per thread and 64 threads per block the streaming

multiprocessor would only be able to handle 8 warps simultaneously out of the

maximum of 24 active warps. This leads to an occupancy value of 33%. Such a low

occupancy value will leave the streaming processors with few warps to choose between

hence it poses a major risk to cause memory access times to limit the CPU’s

performance. By moving the least accessed data from the register to the private global

17

memory and thereby decreasing the register use from 28 to 20 the occupancy value can

be increased to 50%. Note that such a change can potentially harm the performance of a

kernel since it will increase the access time to the data moved to the global memory. To

fully predict the outcome of such a change is almost impossible. Extensive testing and

clocking of processing times has proven that this change led to significant speedups.

3.1.2.3 KERNELS

Since the different forces acting on a particle all depends on density and it is not possible

to synchronize threads across a kernel, it is necessary to first execute the density

computations in a separate kernel. This is the only way to guarantee that the density for

all particles have been evaluated before proceeding to calculate forces. The same type of

division into separate kernels is necessary to guarantee that all particles have been

sorted into the grid before computing density. The process order can be summaries as,

1. Load particle position and velocity arrays on to the device.

2. Execute kernel to sort particles into grid structure.

3. Execute kernel to compute density.

4. Execute kernel to compute forces and corresponding accelerations.

5. Load acceleration array from device to host.

3.1.2.4 MEMORY RESTRICTIONS

Independent of what graphics card available there will always be a restriction in how

much data that can be loaded onto the device. It will thereby limit the amount of

particles that can be processed in a single kernel call. One way to handle this issue, is to

split the work load into different kernels and execute them separately. Finding an

appropriate way to split the work load is not trivial. It is necessary to find a way to evenly

distribute the work load amongst multiple kernels and still limit and keep track of each

kernel’s data dependencies.

In our case, we do have a clear spatial restriction of data dependencies. If the grid

structure is divided into coherent chunks of cells the particles inside each chunk will only

require particle data from either inside the same division or cells on the outer border of

the division. However, the uneven distribution of particles complicates process of

splitting a set of particles into even chunks. One approach would be to start with one cell

and its neighboring cells and then step by step start to expand the set of included cells,

continually checking so that the included particles do not exceed the maximum amount

that can be loaded onto the device. There are however some issues with such an

approach. Firstly, particle sets with particles distributed over a very limited set of cells

will require copying the same particle data to more than two different kernels and can

be seen as a waste of resources. It is theoretical possible that the particle count within a

cell and its neighbors exceeds the device’s particle limit. Secondly, there is no guarantee,

not even probable that such an approach would be able to find the optimal split. As a

18

consequent there is a risk of using more kernels than possible and that would greatly

harm the performance of the algorithm.

It is probably possible to modify and fine tune the above described algorithm to find

solutions to these problems. However, my implementation does not support splitting

work load between kernels and the maximum amount of particles in a simulation is

thereby limited. On the specific card available at Sony the amount of particles is limited

to about 3 millions. This is a relatively high limit considering that typical simulations

rarely exceeds a couple of hundred of thousands of particles. This and the limited time

frame for this project were the main reasons for not putting more effort into resolving

this potential problem.

3.1.2 NORMALIZE DENSITY A consequence of approximating the density using equation 4 is that particles in the

outer bounds of a liquid will receive a lower density value than other particles. This is

due to the lower amounts of neighbors. It is with other words impossible to arrange a set

of particles in a structure resulting in uniform density. This results in that the pressure

force continuously try to compensate for the difference in density by pushing particles

on the surface towards the core of the liquid, preventing the liquid to reach a rest state.

One way of compensating for the lower density values near the surface is to use a

“normalized” value which is calculated using the neighboring particles density values

quat n 31). (e io

∑ , (31)

This is evening out the density in areas where the density is changing rapidly and

compensates for the low density of particles near the surface. However, normalizing the

density field requires an additional loop through all the particles and will thereby lead to

a considerable amount of additional computations. It is therefore implemented as an

option to be considered in cases where problems due to an uneven density field are

experienced.

3.1.3 SMOOTHING KERNELS Choosing an appropriate set of smoothing kernels is crucial for the outcome of the

simulation. A badly choose set of kernels will harm the performance of the simulation.

Not only in terms of computational time but more importantly it may cause

inappropriate particle behavior.

I have experimented with multiple kernel sets, both sets proposed in the literature

(Clavet, Beaudoin, & Poulin, 2005) and own suggestions. However, best performance

was achieved with the set proposed by Müller et al., described in section 2.2.5, hence

they are the kernels used in the final implementation.

19

3.2 INTEGRATION METHOD The integration step is what drives the simulation forward by integrating the particles

attributes (position and velocity) to move the particles through space. The discontinuous

and non‐linear nature of most computer simulations make it impossible to find an exact

solution to differential equation of this sort. There are however several methods that

can be used to find approximate solutions.

The Euler method simply updates the particles velocities using the result of the

applied forces, and then uses the velocity to move the particles to a new position. The

simplicity and intuitive approach of Euler has made it a commonly used integration

method. However, the error arising from the Euler method is directly proportional to the

step size which can lead to numerical instability during stiff conditions. It is therefore

often necessary to maintain higher‐order techniques to maintain stability and accuracy

throughout a simulation. Runge‐Kutta is a commonly used higher‐order integration

technique which uses a weighted average of a particles velocity at different points

between its current and next position. The higher level of stability and accuracy comes at

the cost of calculating the resulting forces/velocities at additional times.

The implemented sph‐solver supports both Euler and a fourth‐order Runge‐Kutta

method.

3.2.1 STEP SIZE The time difference between two iterations in a simulation is vital for the outcome. A too

big step size will lead to instabilities and poor accuracy, while a too short step size will

lead to an unnecessary long computation time. The challenge is to find a step size that is

just long enough to give an acceptable level of accuracy. The distance a particle is moved

between two iterations is in proportion to the accuracy of the approximation of the

particles path. Meaning that the longer distance a particle move between two time steps

the bigger will the error and instability of the simulation become. This is a common

problem encountered when explicit time‐marching schemes are used to find numerical

solutions to differential equations. Courant et al. proposed a partial solution to this

problem by using the velocity to constrain the size of the time step (Courant, Friedrichs,

& Lewy, 1928); this constraint is often referred to as the CFL‐condition.

The level of accuracy can be controlled by using the fastest moving particle to

determine the step size. In my implementation the smoothing kernels cut‐off radius is

used to limit the allowed maximum distance a particle can move during one time step.

step is calculated as The time

(32)

where is a user defined coefficient used to control the strictness of the displacement

condition. The above expression is used in combination with a user defined minimum

and maximum step size to handle certain extreme cases and to be able to limit the

maximum processing time for one iteration.

20

3.3 PARTICLE SOURCES

Particles are added to the simulations by initializing particles within the bounds of a

user‐defined source objects. These objects are represented by signed distance fields

(Bærentzen & Aanæs, 2002). Signed distance fields are discrete scalar fields were each

voxel contains the shortest distance to the surface of an object. A positive value

indicates that the voxel is outside an object and a negative value that it is located inside

an object.

The initial distance between the particles within the bounding volume determines not

only the amount of particles but also the initial behavior of the liquid. Birthing particles

close to each other (relatively to the interaction radius) will lead to a higher initial

density than adding particles with a long distance between each other. In most cases it is

desirable to initialize particles with an initial mean density close to the rest density. Such

a particle set would neither expand nor contract but simply fill up the user‐defined

volume which makes it easy to handle.

Particle source objects are being passed to the solver at every frame. Particles are

being added to the simulation if there interpolated distance value is negative and they

are on a user‐specified distance from any other particle in the simulation. The particles

are uniformly distributed within the defined source object and organized in a tetrahedral

pattern. There initial velocities are set according to a separate vector field passed in

together with the distance field.

3.4 COLLISION DETECTION Objects in the particles environment will interact with and restrict particle’s movement

in different ways. Such objects will further on be referred to as collision objects. This

interaction between particles and external objects has to be handled in an effective yet

reasonable accurate manner. It does not only require a suitable collision handling

algorithm but also a way to represent the collision objects in a precise and efficient

manner.

3.4.1 COLLISION OBJECTS Collision objects are represented by a signed distance field and a velocity field, just like

the particle source objects described in section 3.4. The surface normal is approximated

and stored in a separate field by calculating the gradient of the distance field. The

velocity field will be searched through to determine if the collision object contains any

moving points at the current iteration or if it can be flagged as a static object.

3.4.1.1 ADVECTION OF COLLISION OBJECTS

Collision object are passed to the simulation at every frame and the solver might be

forced to run substeps to acquire stability (see section 3.31). This leads to an

21

unacceptable discontinuity in the collision object representation. However, the

corresponding velocity field can be used to approximate a collision objects position at a

specific substep. The basic idea is to use the velocity field to track the movement of the

distance field values. This is a very similar approach to the technique Doyub et al.

purposed to use to acquire Eulerian motion blur (Doyub & Hyeong‐Seok, 2007). This is a

relatively expensive operation, especially if a large number of substeps are required, but

algorithms of similar sort have due to their isolated and repetitive manner been proven

suitable for being processed on the GPU (Fridlund, 2009). Fridlund’s implementation was

available at Sony Pictures Imageworks and has been used as a base to track the collision

objects between frames.

3.4.2 COLLISION ALGORITHM All particles are checked for possible collision at every iteration. This is done by first

controlling if the particle's new position ( ) is within the bounds of any

collision object and if so, it's distance value is interpolated from the distance field. If the

value is negative a collision has occurred.

There are basically two different reasons for collision. A particle moves into an object

or an object moves into a particle. The main different when handling the two cases is

that if a particle moves into an object the point of collision can be calculated using linear

interpolation,

(33)

where is the particle’s new position, and the distance from the collision object

to and respectively and the velocity of the particle.

Determining the point of impact is harder if the object moves into the particle.

However, in that case the particle can be pushed out of the collision object by using the

gradient and the point on the surface where the particle surfaces can be used as the

collision point. When the point of impact have been determined the particles velocity

are divided into two components, velocity in the direction of the surface normal ( ) and

ng the tangent of the object ( ), velocity alo

(34) ·

. (35)

ped and reflected. The is dam

(36)

Fr ti is modeled b

· · (37)

ic on y applying an opposing force acting over time to ,

If the velocity of the collision object (along its surface normal) is greater than then

is set to match the collision object's velocity.

The particle is then moved the distance between to in the direction of its newly

calculated and . Figure 5 illustrates the involved parameters relations.

22

F IGURE 5 . Illustrates the different values involved in the collision handling algorithm.

23

CHAPTER 4

RESULTS This chapter will focus on the results of the previously described implementation. The

result can both be judged by how well the particles behavior manage to represent a

fluid, and how efficient the implementation is computational wise.

4.1 PARTICLE BEHAVIOR One of the most challenging parts of achieving good particle behavior has been to

scale the forces between particles. If the force is too strong it will create a hyperactive

particle behavior and prolong the simulation time by increasing the amount of iterations.

On the other hand, a too week force would not be able to maintain the distance

between particles. This becomes extra obvious when a strong gravitation force is acting

upon the particles and pushing them towards a collision object. If the force is too week

the particles will be pressed flat onto the object while a too strong force would cause an

“explosion” of particles. It is important to find a balance. A force strong enough to

prevent the gravitation to compress the liquid and yet not overcompensating for

particles that are too close to each other.

The implemented algorithm manages to produce particle sets that well represent

basic fluid behavior. Figure 6 shows the result of a simulation of a liquid first being

poured into a cup and poured from the cup onto a ground plane. Approximately 300,000

particles were used in the simulation and it took about 12 seconds per frame to process

it.

F IGURE 6 . Illustrates the result of a simulation involving about 300, 000 particles organized in time order,

from left to right and top to bottom. The particles have been transformed into a polygon surface before being

lightened and rendered.

24

Figure 7 shows the result of a simulation consisting of 320, 000 particles. A big solid

sphere of liquid is dropped into a tank. A while later when the liquid is close to its rest

state, a second sphere of liquid is dropped into the tank. The simulation took about nine

seconds per frame to process.

F IGURE 7 . Illustrates the result of a simulation consisting of approximately 320, 000 particles. The particles

are organized in time order, from left to right and top to bottom. The particles have been transformed into a

polygon surface before being lightened and rendered.

The above simulations shows that the particles manage to mimic various important fluid

behavior. They do “fill up” containers and react on changes in their environment. In

addition, there seem to be a level of “randomness” on a detailed level, creating

interesting shapes and irregular patterns. The rather complex liquid pattern on the

ground in the lower right part of figure 6 is a good example.

The particles’ ability to maintain a certain distance from each other is an absolute

requirement in order to obtain good particle behavior since this is what “drives” the

whole simulation. The lower left part of figure 7 is a good example of how a set of

particles near a rest state reacts when a second set of particles suddenly is added.

It is important to note that my implementation only produces a set of particles

describing a liquid. To transform a set of particles into a polygon surface is far from trivial

but is not considered within the scope of this thesis.

25

4.2 PERFORMANCE

The computational time strongly depends on several parameters: The amount of

particles, the radius of the smoothing kernel and the spatial distribution of the particles.

A CPU version of the algorithm was implemented in order to easily investigate the

performance gained by utilizing the GPU.

A benchmark test was setup to compare execution times between the two

implementations. Particles were initialized within the bounds of a sphere at an initial

mean density equal to the rest density. No gravity was applied and no collision objects

were included in the setup. Each liquid was simulated for 4 seconds (100 frames, 25

frames/second), the time value presented in figure 8 are the mean execution time per

frame. As the figure shows, the GPU accelerated algorithm is approximately 40 times

faster than the one running solely on the CPU.

F IGURE 8 . Illustrates the difference in terms of execution time when processing the simulation on the GPU

versus the CPU. The time values presented are the mean execution time per frame, based on a 100 frame long

simulation.

0,03 0,1 0,18 0,35 0,491,32

3,4

6,36

12,31

0,

5,

10,

15,

20,

25k 100k 200k 400k 600k

20,18

GPU

CPU

Number of particles

Execution tim

e (secon

ds/frame)

Figure 9 and 10 shows how the execution time is divided amongst the different parts of a

typical simulation executed on the CPU and the GPU respectively. Note that this is only

an example and the processing time of the different parts do vary a lot depending on

several factors such as, particle distribution and number of collisions etc. However, it is

clear that by using the GPU to accelerate the sorting, density and force computations the

processing of the collision handling has become a large part of the total processing time.

26

5%21%

73%

1%Sorting particles into gridCompute density

Compute forces

Handle collisions

F IGURE 9 . Shows how the processing time is distributed amongst the different parts of a typical simulation

executed on the CPU.

3%

14%

48%

35%

Sorting particles into gridCompute density

Compute forces

Handle collisions

F IGURE 10. Shows how the processing time is distributed amongst the different parts of a typical simulation

executed on the GPU.

4.3 DISCUSSION The main goal of this thesis was to investigate how the parallel processing power of a

CUDA enabled graphic can be used to speed up an SPH simulation. By looking at figure 8

in the previous chapter it is easy to see that utilizing the GPU’s processing capabilities led

to a performance gain. It is possible that the CPU implementation could be fine tuned in

a way that would increase its performance slightly, hence should the exact values be

27

looked upon with a certain level of skepticism. However, there is no doubt that a

performance gain has been achieved.

The SPH algorithm is in many ways suitable to process in a parallel manner. It requires

a high amount of rather independent elements to be processed in a very similar way

which fits well into the CUDA principle of executing the same instructions on different

data. However, the unpredictable spatial distribution of particles complicates the matter.

Firstly, it complicates the process of splitting a too resource demanding kernel into to a

series of separate kernels. Secondly, it makes it hard to utilize the GPU’s shared memory

resources. However, this thesis shows that good performance can be achieved by

processing the particles in a way that take advantage of the particles’ similar data

dependencies by utilizing the device’s cached memory resources.

The performance gain makes it practically possible to work with bigger sets of

particles. This is of great importance when simulating liquids for the purpose of the

motion picture industry, which have very high demands on the level of detail. Fast

processing time is of extra importance when working with simulation tools were the

result is hard to predict. Hence, a simulation often has to be re‐executed several times

before desired result has been acquired.

It is of course harder to measure the visual simulation results but the implemented

algorithm does clearly manage to mimic the most important fluid behaviors. The

particles manage to conserve the volume of a liquid by maintaining particle’s distances

from each other and the particles are affected by its environment. The simulation

produces visually appealing results with a high level of details. The quality of the visual

results is in my opinion in level with the results of commercial products, such as

Houdini’s particle fluid simulation tool. It would however be desirable to provide better

control of the direction of the simulation and to be able to simulate more advance fluid

behavior, such as elasticity and plasticity. More about this in section 4.4.

4.4 FURTHER IMPROVEMENTS There is room for improvements of the implemented algorithm, both in terms of

optimization and extended support for different functionalities.

4.4.1 PERFORMANCE IMPROVEMENTS The biggest performance gain would probably be from processing the collision handling

on the device. It would not only decrease the computational time spent on handling

collisions but also drastically decrease the amount of data to be copied back and forth to

the device between two time steps. However, storing and handling big environments of

collision objects would require a lot out of the device. Issues due to memory limitations

would have to be solved. An alternative way to speed up the collision handling would be

to multithread the host implementation. Even if the gain in performance probably would

not be exactly as significant, there would be no problems to either handle or store big

environments.

28

There is also room for improvement due to increased occupancy of the streaming

multiprocessors. As discussed in section 3.1.2.2 the occupancy of the current

implementation is 50%. If the register usage can be further reduced (without moving

data from the registers to the global memory) it is likely to experience a performance

gain. The importance of such a change is hard to predict since it depends on how big the

current delays caused by memory access time are.

4.4.2 FUNCTIONALITY The current implementation supports the necessary parameters to set up and simulate

the basic behavior of a liquid. However, to make an SPH simulation tool real useful in the

motion picture industry it is important to provide ways to control the outcome of a

simulation. It is important to support simulation of a wide spectrum of different types of

liquids but it is also useful to have tools to control the outcome in more explicit ways.

4.4.2.1 ELASTICITY AND PLASTICITY

A perfectly elastic fluid will remember its original rest shape and will strive to regain it

while a perfectly plastic substance always will consider the current state as its rest state.

Elastic behaviors can me simulated by adding springs between pairs of neighboring

particles and plastic behavior can be modeled by changing these springs (Clavet,

Beaudoin, & Poulin, 2005).

4.4.2.2 STICKINESS

Stickiness between particles and collision objects can be simulated by simply adding an

impulsive force, in the direction of the surface normal, to particles within a certain range

from a collision object’s surface.

4.4.2.3 SINKS

Provide the user with the possibility to remove certain particles from the simulation.

Sinks can be used to remove particles that are too far away, or for other reasons are not

important for the outcome of the simulation.

4.4.2.4 FORCE FIELDS

Force fields allow the user to add external forces to a particle set. The amount and

direction of the force to be added can be explicitly defined through a force field.

4.4.3.5 FRICTION FIELD

The current implementation only supports the use of a constant friction coefficient. A better solution would be to read the friction coefficient from a scalar field

29

coupled to each collision object. That way it would be possible to define different friction coefficients for different objects and even on different parts of an object.

4.4.3.6 SUPPORT FOR UNLIMITED NUMBER OF PARTICLES

As discussed in section 3.1.2.4, the current implementation can only handle a limited amount of particles determined by the memory available on the device. A solution to this issue would be to split the workload into multiple kernels and execute them after another.

30

CHAPTER 5

REFERENCES

Amad, T., Imura, M., Yasumoto, Y., Yamabe, Y., & Chihara, K. (2004). Particle‐based fluid

simulaton on gpu. ACM Workshop on General‐Purpose Computing on Graphics

Processors.

Bærentzen, J. A., & Aanæs, H. (2002). Computing discrete signed distance fields from

triangle meshes. Lyngby: Informatics and Mathematical Modelling, Technical University

of Denmark, DTU.

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., & Skadron, K. (68:1370‐1380). A

performance study of general‐purpose applications on graphics processors using CUDA.

Journal of Parallel and Distributed Computing , 2008.

Clavet, S., Beaudoin, P., & Poulin, P. (2005). Particle‐based Viscoelastic Fluid Simulation.

Eurographics/ACM SIGGRAPH Symposium on Computer Animation, (ss. 219‐228).

Courant, R., Friedrichs, K., & Lewy, H. (1928). Über die partiellen Differenzengleichungen

der mathematischen Physik. Mathematische Annalen , 32‐74.

Desbrun, M., & Cani, M. P. (1996). Smoothed particles: A new paradigm for animating

highly deformable bodies. In Proceedings of EG Workshop on Computer Animation and

Simulation, (ss. 61‐76).

Doyub, K., & Hyeong‐Seok, K. (2007). Eulerian Motion Blur. ACM SIGGRAPH /

Eurographics Symposium on Computer Animation, (ss. 120‐131).

Evans, C., & Kochanek, C. (1989). The tidal disruption of a star by a massive black hole.

Astrophysical Journal, Part 2 ‐ Letters (ISSN 0004‐637X) , 346:L13‐L16.

Fridlund, A. (2009). Voxel Processing for Visual Effects. Norrköping: Linköpings

Universitet.

Gingold, R., & Monaghan, J. (1977). Smoothed particle hydrodynamics: theory and

application to non‐spherical stars. Notices of the Royal Astronomical Society , 181:375‐

389.

Harada, T., Kawaguchi, Y., & Koichiro, K. (2007). Smoothed Particle Hydrodynamics on

GPUs. Computer Graphics International.

Kessenish, J., & Baldwin, R. (u.d.). OpenGL Shading Language. Hämtat från OpenGL:

http://www.opengl.org/documentation/glsl/ den 1 Mars 2009

Kolb, A., & Cuntz, N. (2005). Dynamic Particle Coupling for GPU‐based Fluid Simulation.

Proc. Of 18th Symposium on Simulation Technique, (ss. 722‐727).

Lucy, L. (1977). A Numerical Approach to the Testing of the Fission Hypothesis.

Astronomical Journal , 82(12)1013‐1024, 1013‐1024.

Microsoft. (n.d.). DirectX 10. Retrieved Mars 1, 2009, from

http://www.microsoft.com/games/en‐US/aboutgfw/Pages/directx10‐a.aspx

31

32

Monaghan, J., & Humble, R. (1991). Arbitrary Incompressible Flow with SPH.

Monaghan, J., & Kos, A. (1999). Solitary waves on a Cretan Beach. J. Waterway, Port,

Coastal, and Ocean Engrg, ASCE , 125,3.

Morris, J. (2000). Simulating surface tension with smoothed particle hydrodynamics.

International Journal for Numerical Methods in Fluids , 33(3):333‐353.

Müller, M., Keiser, R., Nealen, A., Pauly, M., Gross, M., & Alexa, M. (2003). Particle‐based

fluid simulation for interactive applications. In Proceedings of the 2003 ACM SIGGRAPH/

Eurographics‐ Symposium on Computer Animation, (ss. 154‐159).

NVIDA Corporation. (n.d.). CUDA Zone. Retrieved June 1, 2009, from NVIDA CUDA

Programming Guide 2.0:

http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Progra

mming_Guide_2.0.pdf

Takeda, H., Miyama, S., & Sekiya, M. (1994). Numerical Simulation of Viscous Flow by

Smoothed Particle Hydrodynamics. Progress of Theoretical Physics , 92:939‐960.

a smoothed particle hydrodynamic simulation utilizing the parallel

Documents