proceedings of the dcictia-miccai 2012 workshopnyx.unice.fr/publis/montagnat:2012.pdf · 2014. 9....

82
Proceedings of the DCICTIA-MICCAI 2012 Workshop http://www.i3s.unice.fr/ ~ johan/DCICTIA-MICCAI Data- and Compute-Intensive Clinical and Translational Imaging Applications Edited by Johan Montagnat

Upload: others

Post on 30-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of theDCICTIA-MICCAI 2012 Workshop

    http://www.i3s.unice.fr/~johan/DCICTIA-MICCAI

    Data- and Compute-Intensive Clinical and TranslationalImaging Applications

    Edited byJohan Montagnat

    http://www.i3s.unice.fr/~johan/DCICTIA-MICCAI

  • Preface

    The proliferation and increased reliance on high-resolution, multimodalitybiomedical images/videos present significant challenges for a broad spectrum ofclinical practitioners and investigators throughout the clinical and research com-munities. A growing spectrum of new and improved imaging modalities and newimage processing techniques can significantly affect diagnostic and prognosticaccuracy and facilitates progress in the areas of biomedical research and discov-ery. However, the impact of these new technologies in both time-critical clinicalapplications and high-throughput research pursuits depends, in large part, onthe speed and reliability with which the imaging data can be visualized, analyzedand interpreted. Conventional serial computation is grossly inadequate and in-efficient for managing these increasing amounts of data and the employment ofthe new advances in medical imaging is often limited by insufficient computeand storage resources.

    High-performance computing (HPC) and distributed computing infrastruc-tures (DCI) ranging from multi-core CPUs and GPU-based processing to parallelmachines, Grids and Clouds are effective mechanisms for overcoming such limita-tions. They allow for significant reduction of computational time, running largeexperiments campaigns, and speed-up the development time for new algorithmswhile increasing the availability of new methods for the research community, andsupporting large-scale multi-centric collaborations.

    The Data- and Compute-Intensive Clinical and Translational Imaging Ap-plications Workshop (DCICTIA-MICCAI’12) was organized with the goals ofunderstanding current trends in HPC/DCI medical imaging research and demon-strating some of the latest developments in the field. Ten papers were submittedto the workshop, out of which 7 were selected for presentation. They reportoriginal work related to the application of GPU-based, High-Performance orHigh-Throughput Computing techniques to medical image analysis, as well asdistributed medical databases management. In addition, a keynote talk address-ing the challenges and future trends of GPU computing was delivered by ManuelUjaldón, from the University of Malaga.

    The workshop chairs would like to thank all contributors who made theorganization of this workshop possible, in particular the Program Committeemembers, MICCAI organizers, invited speakers and authors.

    Nice, October 1st, 2012,

    Johan Montagnat Joel H. SaltzI3S laboratory Emory UniversityCNRS / U. Nice Sophia Antipolis Center for Comprehensive InformaticsFrance USA

  • Workshop chairs

    Johan Montagnat CNRS / I3S, FranceJoel H. Saltz Emory University, USA

    Program committee

    Christian Barillot CNRS / IRISA, FranceUmit Catalyurek The Ohio State University, USAYann Cointepas CEA, FranceAlan Evans MNI / McGill University, CanadaDavid J. Foran Cancer Institute of New Jersey, USADagmar Krefting University of Applied Sciences Berlin, GermanyTristan Glatard CNRS / CREATIS, FranceLeiguang Gong IBM Watson Research Center, USARon Kikinis Harvard Medical School, USACasimir Kulikowski Rutgers University, USATahsin Kurc Emory University, USAJohan Montagnat CNRS / I3S, FranceToshiharu Nakai National Center for Geriatrics and Gerontology, JapanSilvia D. Olabarriaga AMC, NetherlandsJoel H. Saltz Emory University, USATony Solomonides CCRI, NorthShore University HealthSystem, USAGeorge Teodoro Emory University, USAManuel Ujaldón University of Malaga, SpaniLin Yang University of Kentucky, USA

  • Table of content

    The GPU on High Performance Biomedical Computing 1Manuel Ujaldón (invited speaker, University of Malaga)

    Accelerating MI-based B-spline Registration using CUDA Enabled GPUs 11James Shackleford, Nagarajan Kandasamy and Gregory Sharp

    A Fast Parallel Implementation of Queue-based Morphological Reconstructionusing GPUs

    21

    George Teodoro, Tony Pan, Tahsin Kurc, Lee Cooper, Jun Kong and Joel Saltz

    Optimization and Parallelization of the Matched Masked Bone EliminationMethod for CTA

    31

    Henk Marquering, Jeroen Engelbers, Paul Groot, Charles Majoie, Ludo Beenen,Antoine van Kampen and Silvia Olabarriaga

    Semantic Federation of Distributed Neurodata 41Gaignard Alban, Johan Montagnat, Catherine Faron Zucker and Olivier Corby

    Content-based Image Retrieval on Imaged Peripheral Blood Smear Specimensusing High Performance Computation

    51

    Xin Qi, Fuyong Xing, Meghana Ghadge, Ivan Rodero, Moustafa Abdelbaky,Manish Parashar, Evita Sadimin, David Foran and Lin Yang

    Content-based Parallel Sub-image Retrieval 61Fuyong Xing, Xin Qi, David Foran, Tahsin Kurc, Joel Saltz and Lin Yang

    A MapReduce Approach for Ridge Regression in Neuroimaging-Genetic Studies 71Benoit Da Mota, Michael Eickenberg, Soizic Laguitton, Vincent Frouin, GaelVaroquaux, Jean-Baptiste Poline and Bertrand Thirion

  • The GPU on High Performance BiomedicalComputing

    Manuel Ujaldón

    CUDA Fellow at Nvidia andA/Prof at Computer Architecture Department,

    University of Malaga (Spain)

    Abstract. GPUs (Graphics Processing Units) constitute nowadays asolid alternative for high-performance computing [1]. An increasing num-ber of today’s supercomputers include commodity GPUs to bring us un-precedented levels of performance in terms of GFLOPS/cost. In this pa-per, we provide an introduction to CUDA programming paradigm whichcan exploit SIMD parallelism and high memory bandwidth on GPUs.Less demanding alternatives from a biomedical perspective like OpenCLand OpenACC are also briefly described.

    Keywords: GPU, High Performance Computing, CUDA, OpenCL, OpenACC.

    1 Introduction

    We are witnessing a steady transition from multi-core to many-core systems.Many-core GPUs offer excellent performance/cost ratios for scientific applica-tions which fit their architectural idiosyncrasies, and an increasing number ofdevelopers are learning to program these processors to take full advantage oftheir resources via data parallelism.

    In this paper, we summarize the ideas required to program GPUs with CUDA(Compute Unified Device Architecture) [2], Nvidia’s programming solution formany-core architectures. We also introduce OpenCL [3], an standardization ef-fort to provide an open standard API, and OpenACC [4], a higher level APIusing directives for many-core architectures.

    1.1 GPU advantages

    GPUs gather a number of attractive features as high performance platforms.At user level, we may highlight two issues: (1) Popularity and low cost,

    thanks to being manufactured as commodity processors, which means wide avail-ability at an affordable budget, and (2) Easier than ever to be programmed,evolving from the old challenging architecture into a new straightforward plat-form endowed with an increasing number of libraries, legacy codes, debuggers,profiles, cross-compilers and back-ends.

    1

  • At hardware level, we remark two other important features: (1) Raw com-putational power, where GPUs are an order of magnitude ahead (more thana TFLOPS versus 100-150 GFLOPS for a CPU of similar price), and (2) Mem-ory bandwidth, where GDDR5 video memory works up to 6 GHz for a datapath 256 bits wide to deliver up to 192 GB/s (a typical CPU operates on adual-channel motherboard to deliver 128 bits at 1600 MHz using the popularDDR3 memory, which results in 25.6 GB/s of data bandwidth).

    1.2 Code programming

    Multicore systems can be programmed using task-level parallelism, either relyingon the scheduler of the operating system or programming explicitly via pthreads.But when we enter into many-core territory, a different programming methodbased on data parallelism has to be embraced to enhance scalability. The SIMDparadigm (Single Instruction Multiple Data) re-emerges in this context as a vi-able solution with a simple idea: We run the same program in all cores, but eachthread instantiates on a different data subset for each core to work effectivelyin parallel. This way, we face a key issue for massive parallelism: data parti-tioning. Straightforward decompositions are usually derived from 1D, 2D and3D matrices as input data sets, overall on regular access patterns. Particularlyinteresting from our biomedical perspective is the area of high-resolution imageprocessing, where massive parallelism benefits from regular data partitioning.On the opposite end, the goal turns more difficult in the presence of irregularelements like data addressed through pointers of arrays accessed through indicesunknown at compile-time like another array.

    2 GPUs for general-purpose computing

    In general, expectations for a particular algorithm to meet growing speedup fac-tors when running on GPUs depend on a number of requirements to be fulfilled.From less to more important, we may cite:

    1. Small local data requirements (memory and registers): Programmer skills todecompose computations in tiles.

    2. Stream computing, or the ability to avoid recursive algorithms and controland data dependencies.

    3. Arithmetic intensity, which is usually coupled with high data reuse.4. Memory bandwidth, that is, fast movement on high data volumes.5. Data parallelism. In short, data independency and balanced partitioning.

    Fortunately, a generic biomedical application is more likely to satisfy thosecriteria listed at the end, which makes them good candidates for GPU accel-eration. During the rest of the paper, we will illustrate in Section 3 the basicconcepts to implement codes on GPUs using CUDA. Section 4 describes someoptimizations for threads deployments and memory usage, and finally Section 5introduces some alternatives that have emerged lately for programming GPUsin high performance computing.

    2

  • (a) Hardware resources. (b) Programming model.

    Fig. 1. CUDA highlights on architectural and software side.

    3 CUDA

    CUDA [2] is a programming interface and set of supported hardware to enablegeneral purpose computation on Nvidia GPUs. The initiative started in late2006, and in five years, according to Nvidia and the NSF it has recruited morethan 100.000 developers, 1.000.000 compiler downloads, and 15.000.000 develop-ers across very diverse problem domains worldwide, with more than 500 researchpapers published annually and more than 500 institutions including CUDA intheir academic programs.

    The CUDA programming interface is ANSI C extended by several keywordsand constructs which derive into a set of C language library functions as aspecific compiler generates the executable code for the GPU in conjunction withthe counterpart version running on the CPU acting as a host.

    Since CUDA is particularly designed for generic computing, it can leveragespecial hardware features not visible to more traditional graphics-based GPUprogramming, such as small cache memories, explicit massive parallelism andlightweight context switch between threads.

    3.1 Hardware platforms

    All the latest Nvidia developments on graphics hardware are CUDA-enabledprocessors: For low-end users and gamers, we have the GeForce series startingfrom its 8th generation; for high-end users and professionals, the Quadro series;for general-purpose computing, the Tesla boards. Overall, it is estimated to existmore than three hundred fifty million CUDA-enabled GPUs at the end of July,2012.

    3

  • 3.2 Execution model

    CUDA architectures are organized into multiprocessors, each having a numberof cores (see Figure 1.a). As technology evolves, future architectures from Nvidiawill support the same CUDA executables, but they will be run faster for includ-ing more multiprocessors per die, and/or more cores, registers or shared memoryper multiprocessor (see Table 1). For example, the GF 100 (Fermi) architecturedoubles the number of cores and increases the clock frequency with respect tothe previous generation, the GT200. And this one almost doubles the number ofmultiprocessors and 32-bit register within each multiprocessor compared to theoriginal G80.

    The CUDA parallel architecture is a SIMD processor endowed with hundredsof cores which are organized into multiprocessors. Each multiprocessor can runa variable number of threads, and the local resources are divided among them.In any given cycle, each core in a multiprocessor executes the same instructionon different data based on its threadId, and communication between multipro-cessors is performed through global memory.

    The CUDA programming model guides the programmer to expose fine-grainedparallelism as required by massively multi-threaded GPUs, while at the sametime providing scalability across the broad spectrum of physical parallelism avail-able in the range of GPU devices.

    Another key design goal of CUDA is to guarantee a smooth learning curve formost of programmers. To do so, it aims to extend a popular sequential program-ing language like C/C++ with a minimalist set of abstractions for expressingparallelism. This lets the programmer focus on the important issues of parallelismrather than grappling with the mechanics of an unfamiliar and cumbersome lan-guage.

    3.3 Memory spaces

    The CPU host and the GPU device maintain their own DRAM and addressspace, referred to as host memory and device memory. The latter can be ofthree different types. From inner to outer, we have constant memory, texturememory and global memory. They all can be read from or written to by the hostand are persistent through the life of the application. Texture memory is themore versatile one, offering different addressing modes as well as data filteringfor some specific data formats. Global memory is the actual on-board videomemory, usually exceeding 1 GB. of capacity and embracing GDDR3/GDDR5technology. Constant memory has regular size of 64 KB and latency time close toa register set. Texture memory is cached to a few kilobytes. Global and constantmemory are not cached at all.

    Multiprocessors have on-chip memory that can be of two types: registers andshared memory (see Figure 1.b). Each processor has its own set of local 32-bitread-write registers, whereas a parallel data cache of shared memory is shared byall the processors within the same multiprocessor. The local and global memoryspaces are implemented as read-write regions of device memory and were notcached until the Fermi architecture was born.

    4

  • 3.4 Programming Elements

    Major elements involved in the conception of a CUDA program which are key forunderstanding the programming model are shown in Figure 1.b. We summarizesthem below.

    – A program is decomposed into blocks running in parallel. Assembled bythe developer, a block is a group of threads that is mapped to a singlemultiprocessor, where they can share 16 KB or 48 KB of SRAM memory.All the threads in blocks concurrently assigned to a single multiprocessordivide the multiprocessor’s resources equally amongst themselves. The datais also divided amongst all of the threads in SIMD fashion explicitly managedby the developer.

    – A warp is a collection of 32 threads that can physically run concurrently onall of the multiprocessors. The developer has the freedom to determine thenumber of threads to be executed, and when there are more threads than thewarp size, they are time-shared on the actual hardware resources. This canbe advantageous, since time-sharing the ALU resources amongst multiplethreads can overlap the memory latencies when fetching ALU operands.

    – A kernel is a code function compiled to the instruction set of the device,downloaded on it and executed by all of its threads. Threads run on differentprocessors of the multiprocessors sharing the same executable and global ad-dress space, though they may not follow exactly the same path of execution,since conditional execution of different operations on each multiprocessorcan be achieved based on a unique thread ID. Threads also work indepen-dently on different data according to the SIMD model. A kernel is organizedinto a grid as a set of thread blocks.

    – A grid is a collection of all blocks in a single execution, explicitly definedby the application developer, that is assigned to a multiprocessor. The pa-rameters invoking a kernel function call define the sizes and dimensions ofthe thread blocks in the grid thus generated, and the way hardware groupsthreads in warps affects performance, so it must be accounted for.

    – A thread block is a batch of threads executed on a single multiprocessor.They can cooperate together by efficiently sharing data through its sharedmemory, and synchronize their execution to coordinate memory accessesusing the syncthreads() primitive. Synchronization across thread blockscan only be safely accomplished by terminating a kernel. Each thread blockhas its own thread ID, which is the number of the thread within a 1D,2D or 3D array of arbitrary size. The use of multidimensional identifiershelps to simplify memory addressing when processing multidimensional data.Threads placed in different blocks from the same grid cannot communicate,and threads belonging to the same block must all share registers and sharedmemory on a given multiprocessor. This tradeoff between parallelism andthread resources must be wisely solved by the programmer to maximizeexecution efficiency on a certain architecture given its limitations. Theselimitations are listed in the lower rows of Table 1 for the past three hardwaregenerations and CUDA compute capabilities.

    5

  • Table 1. The evolution of CUDA hardware and programming constraints over thepast three generations. We take as flagship of each generation what we consider themost popular GPU, together with a representative video memory in graphics cards atthat time (around $400 cost at launching date). We highlight in bold font those valueswhich are around an order of magnitude ahead of CPUs of similar cost, namely, peakperformance and memory bandwidth.

    GPU architecture for each hardware generation G80 GT 200 GF 100Representative commercial model (GeForce) 8800 GTX GTX 280 GTX 480Year released on the marketplace 2006 2008 2010Million of transistors within the silicon die 681 1400 3000GPU frequency (MHz) 575 602 700

    Hardware Number of multiprocessors (SMs) 16 30 15resources SIMD cores on each multiprocessor 8 8 32and raw Total number of SIMD cores 128 240 480

    processing Clock frequency for the SIMD cores (GHz) 1.35 1.30 1.40power Peak performance in GFLOPS (single precision) 345 622 1344

    Peak performance in GFLOPS (double precision) none 78 672SRAM Shared memory for each multiprocessor 16 KB. 16 KB. 16/48 KB.memory L1 cache for each multiprocessor none none 48/16 KB.hierarchy L2 cache for the entire GPU none none 768 KB.

    Global memory size 512 MB. 1 GB. 1.5 GB.DRAM Global memory technology GDDR3 GDDR5 GDDR5memory Frequency for the memory clock (MHz) 2 x 900 2 x 1107 4 x 924

    subsystem Bus memory width (bits) 384 512 384Memory bandwidth (GB./sc.) 86.4 141.7 177.4CUDA compute capabilities 1.0, 1.1 1.2, 1.3 2.0Threads / warp (that is, the warp size) 32 32 32

    Programming Maximum number of thread blocks per multipr. 8 8 8constraints Maximum number of threads per block 512 512 1024

    Maximum number of threads per multiprocessor 768 1024 153632-bit registers for each multiprocessor 8192 16384 4096

    At the highest level, a program is decomposed into kernels mapped to thehardware by a grid composed of blocks of threads scheduled in warps. Nointer-block communication or specific schedule-ordering mechanism for blocksor threads is provided, which guarantees each thread block to run on any mul-tiprocessor, even from different devices, at any time.

    The number of threads in a thread block is limited to 512 on G80 and GT200 generations, and to 1024 on GF 100 (Fermi). Therefore, blocks of equaldimension and size that execute the same kernel can be batched together into agrid of thread blocks. This comes at the expense of reduced thread cooperation,but in contrast, this model allows thread blocks of the same kernel grid to run onany multiprocessor, even from different devices, at any time. Again, each blockis identified by its block ID, which is the number of the block within a 1D or 2Darray of arbitrary size for the sake of a simpler addressing to memory.

    Kernel threads are extremely lightweight, i.e. creation overhead is negligibleand context switching between threads and/or kernels is essentially free.

    3.5 Programming style

    CUDA tries to simplify the programming model by hiding thread handling fromprogrammers, i.e. there is no need to write explicit threaded code in the conven-

    6

  • tional sense. Instead, CUDA includes C/C++ software development tools thatallow programmers to combine host code for the CPU with device code for theGPU [7]. To do so, CUDA programming requires a single program written inC/C++ with some extensions [8]:

    – Function type qualifiers to specify whether a function executes on the hostor on the device and whether it is callable from the host or from the device( device , global and host ).

    – Function type qualifiers for functions that execute on the device ( globaland device ).

    – Variable type qualifiers to specify the memory location on the device of avariable ( device , constant and shared ).

    – Variable type qualifiers for variables that reside on device memory ( device ,shared and constant ).

    – Four built-in variables that specify the grid and block dimensions, the blockindex within the grid and thread index within the block (gridDim, blockDim,blockIdx and threadIdx), accessible in global and device func-tions.

    – An execution configuration construct to specify the dimension of the gridand blocks when launching kernels, declared with the global directive,from host code (for example, function >(parameter list)).

    4 Optimizations

    Managing those limits introduced by major hardware and software constraintsare critical when optimizing applications. Programmers still have a great degreeof freedom, though side-effects may occur when deploying strategies to avoid onelimit, causing other limits to be hit.

    We consider two basic pillars when optimizing a CUDA code: First, orga-nize threads in blocks to maximize parallelism, enhance hardware occupancyand avoid memory banks conflicts. Second, access to shared memory wisely tomaximize arithmetic intensity and reduce global memory usage. We now addresseach of these issues separately.

    4.1 Threads deployment

    Each multiprocessor contains a number of registers which will be split evenlyamong all the threads of the blocks assigned to that multiprocessor. Hence, thenumber of registers needed in the computation will affect the number of threadswhich can be executed simultaneously, and given the constraints outlined inTable 1, the management of registers becomes important as a limiting factor forthe amount of parallelism we can exploit.

    The CUDA documentation suggests a block to contain between 128 and 256threads to maximize execution efficiency. A tool developed by Nvidia, the CUDA

    7

  • Occupancy Calculator [9], may also be used as guidance to attain this goal. Forexample, when a kernel instance consumes 16 registers, only 512 threads canbe assigned to a single multiprocessor on a G80 GPU. This can be achieved byusing one block with 512 threads, two blocks of 256 threads, and so on.

    An iterative process is usually followed to achieve the lowest execution time:First, the initial implementation is compiled using the CUDA compiler and aspecial -cubin flag that outputs the hardware resources (memory and registers)consumed by the kernel on a given hardware generation. Using these values inconjunction with the CUDA Occupancy Calculator, we can analytically deter-mine the number of threads and blocks that are needed to use a multiprocessorwith maximum efficiency. When enough efficiency is not achieved, the code hasto be revised to reduce the register requirement.

    4.2 Memory usage

    Even though video memory delivers a magnificent bandwidth, it is still a frequentcandidate to hold the bottleneck when running the application because of itspoor latency (around 400 times slower compared to shared memory) and thehigh floating point computation performance of the GPU.

    Attention must be paid to how the threads access banks of shared memory,since each bank only supports one memory access at a time; simultaneous mem-ory bank accesses are serialized, stalling the rest of the multiprocessor’s runningthreads until their operands arrive. The use of shared memory is explicit withina thread, which allows the developer to solve bank conflicts wisely. Althoughsuch optimization may represent a daunting effort, sometimes can be very re-warding: Execution times may decrease by as much as 10x for vector operationsand latency hiding may increase by up to 2.5x [10].

    5 Programming alternatives

    5.1 OpenCL

    Almost 1.5 years after CUDA SDK became public, Apple initiated a new stan-dardization effort, and in December 2008 the OpenCL 1.0 specification [11] wasreleased by the Khronos Group, an industry consortium focused on the creationof open standard, royalty-free APIs. The initiative is supported by all majorGPU and CPU vendors, including Nvidia, AMD/ATI and Intel.

    OpenCL is based on C99, a dialect of C that was adopted as an ANSI stan-dard in May 2000, but it is a low-level standard when compared with other alter-natives, primarily targeted at expert developers with few productivity-enhancingabstractions. OpenCL aims to provide a programming environment for those de-velopers to write efficient, portable code for high-performance compute servers,desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such asDSPs. This way, OpenCL should fuel growth and advancement in utilization

    8

  • of multicore and many-core technologies by addressing two critical problems:Programming challenges and protection of the software investment as hardwaretechnology evolves rapidly.

    OpenCL supports both data and task parallelism. Even though the latterdoes not fit well on GPUs, is applicable to other target architectures like multi-cores. We may see OpenCL like a superset of CUDA, so programmers may decideto start learning CUDA as an intermediate step towards OpenCL. Or even useSwan [12], a source-to-source translator from CUDA to OpenCL.

    OpenCL will enable developers to parallelize an application on a multicoreCPU, manycore GPU, or both, amplifying the possibilities for high-performancecomputation on commodity hardware. Given that each generation of GPU evo-lution adds flexibility to previous high-throughput GPU designs, software devel-opers in many fields are likely to take interest in the extent to which CPU/GPUarchitectures and programming systems ultimately converge.

    5.2 OpenACC

    OpenACC is an open parallel programming standard oriented to average pro-grammers and originally developed by PGI, Cray, CAPS and Nvidia. Initialsupport by multiple vendors will help accelerate its rate of adoption, providinga stable and portable platform for developer to target.

    OpenACC uses directives to enable GPU acceleration as core technology.These directives allow programmers to provide simple hints to the compiler,identifying parallel regions of the code to accelerate without requiring the pro-grammer to modify or adapt the underlying code. The compiler automaticallyparallelizes loops in your application, which gives you access to the massivelyparallel processing power of the GPU while making programming straightfor-ward. On platforms that do not support OpenACC directives, the compiler willsimply ignore them, ensuring that the code remains portable across many-coreGPUs and multi-core CPUs.

    5.3 Compilers and tools

    A CUDA source code can also be compiled for other target platforms differ-ent than Nvidia GPUs (namely, AMD GPUs or x86 64 CPUs) using cross-compilers and compilation environments like the PGI CUDA compiler, Ocelotand MCUDA. And CUDA has also third party wrappers available for Python,Perl, Java, Fortran, Ruby, Lua, Haskell, MatLab and IDL.

    It is worth mentioning DirectCompute, Microsoft’s approach for GPU pro-gramming. And the latest effort on CUDA interoperability is Libra, released inApril’12 by the company GPU Systems as a platform and SDK to uniformlyaccelerate software applications across major operating systems (Windows, Macand Linux). This is done via standard programming environments C/C++, Java,C# and Matlab, to execute math operations on top of GPUs and x86 64 CPUswith broad support for compute devices compatible with CUDA, OpenCL andeven OpenGL.

    9

  • 6 Concluding remarks

    We have described the CUDA programming model and hardware interface as acompelling alternative for assisting non-computer scientists to adapt high per-formance biomedical applications to GPUs.

    GPUs provide computational power, great scalability and low cost, and maybe multiplied on a cluster of GPUs to enhance parallelism and provide even fasterresponses. Alternatively, we may think of a CPU-GPU hybrid system where anapplication can be decomposed into two parts to take advantage of the benefitsof this bi-processor platform. Recent movements in the hardware industry con-firm a tendency towards a stronger CPU-GPU coupling, since not only GPUshave moved closer to CPUs in terms of functionality and programmability, butalso CPUs have also acquired functionality similar to that of GPUs. Two goodexamples are the Fusion project led by AMD which integrates a CPU and GPUon a single chip, and the Knights Corner by Intel to develop another hybridplatform between multicore x86 and a GPU.

    The GPU will continue to adapt to the usage patterns of both graphics andgeneral-purpose programmers, with a focus on additional processor cores, num-ber of threads and memory bandwidth available for applications to become asolid alternative for high performance computing. As challenging and computa-tionally demanding areas, biomedical applications constitute of the most excitingfields to take the opportunity offered by CUDA, OpenCL and OpenACC, andthose programming models must evolve to include programming heterogeneousmany-core systems including both CPUs and GPUs.

    References

    1. GPGPU, “General-Purpose Computation on GPUs”, http://www.gpgpu.org.2. CUDA Home Page, http://developer.nvidia.com/object/cuda.html.3. The Khronos Group, “The OpenCL Core API Specification, Headers and Docu-

    mentation” http://www.khronos.org/registry/cl.4. CAPS, Cray, Nvidia and PGI, “OpenACC: Directives for Accelerators ”

    http://openacc.org.5. Nvidia, “CUDA showcase” http://www.nvidia.com/object/cuda-apps-flash-

    new.html.6. The PhysX Co-processor, http://www.nvidia.com/object/nvidia physx.html.7. T. R. Halffill, “Parallel Processing with CUDA” MicroProcessor Report Online.

    January 2008.8. “NVIDIA CUDA C Programming Guide version 4.0”. June, 2011.9. “The CUDA Occupancy Calculator” http://developer.download.nvidia-

    .com/compute/cuda/CUDA Occupancy calculator.xls, 2009.10. M. Fatica, D. Luebke, I. Buck, D. Owens, M. Harris, J. Stone, C. Phillips, and

    B. Deschizeaux, “CUDA tutorial at Supercomputing’07” November 2007.11. The Khronos Group, “OpenCL: The open standard for parallel programming of

    heterogeneous systems” http://www.khronos.org/opencl/, November 2011, v. 1.2.12. G. De Fabritiis et al., “Swan: Converting CUDA codebases to OpenCL”

    http://www.multiscalelab.org/swan.

    10

  • Accelerating MI-based B-spline Registration

    using CUDA Enabled GPUs

    James A. Shackleford,1, Nagarajan Kandasamy,2 and Gregory C. Sharp1

    1 Department of Radiation Oncology,Massachusetts General Hospital, Boston, MA 02114, USA

    2 Electrical and Computer Engineering Department,Drexel University, Philadelphia, PA 19104, USA

    Abstract. This paper develops methods for accelerating mutual infor-mation based B-spline deformable registration suitable for use on multi-core processors. The registration process is iterative and consists of thefollowing major steps: computing a deformation field from B-spline coef-ficients, computing the mutual information similarity metric, and com-puting the change in the mutual information with respect to the B-splinecoefficients for the purpose of quasi-Newtonian optimization. We developparallel algorithms to realize these steps, specifically tailored to multi-core processors including graphics processing units (GPUs). Experimen-tal results using clinical and synthetic data indicate that our GPU-basedimplementations achieve, on average, a speedup of 21 times with respectto the reference implementation and 7.5 times with respect to a quad-core CPU implementation.

    Keywords: B-spline registration, deformable registration, multi-modalregistration, mutual information, multi-core, graphics processing units.

    1 Introduction

    When registering multi-modal images, mutual information (MI) is a useful sim-ilarity metric for quantifying the quality of a spatial transform T mapping areference fixed image F to a destination moving image M by quantifying theamount of information content the two images share in common under the influ-ence of the transform. An optimal registration that accurately maps the voxelsof F into M may be obtained by iterative optimization of T such that the sharedinformation content is maximized.

    However, using MI as a similarity metric introduces significant additionalcomputational burden to the overall registration process. Therefore, we focus ondeveloping parallel algorithms for B-spline based multi-modal registration usingMI that are tailored to multi-core processors including graphics processing units(GPUs). Specifically, our efforts have focused on developing a comprehensivesolution that leverages parallelization for the most computationally rigorous as-pects of the registration process – namely the generation of the deformation fieldT, the generation of the intensity histograms needed to compute the MI, and

    11

  • the calculation of the change in the MI with respect to the B-spline coefficientsneeded for optimization.

    Experimental results on clinical and synthetic images are presented with fo-cus on both execution speed and registration quality for the following implemen-tations: a highly optimized single-threaded CPU implementation, a multi-coreCPU implementation using OpenMP, and a GPU implementation using the com-pute unified device architecture (CUDA) interface. For large images, our GPU-based implementations achieve, on average, a speedup of 21 times with respectto the reference implementation and 7.5 times with respect to the OpenMP im-plementation. These CPU and GPU implementations are freely available onlineas part of the Plastimatch image registration software and can be downloadedunder a BSD-style open source license from www.plastimatch.org.

    2 Related Work

    Image registration algorithms can be accelerated via the judicial use of parallelprocessing; in many cases, operations can be performed independently on dif-ferent portions of the image. For example, the authors of [4, 7] develop paralleldeformable registration algorithms on a computer cluster using MI as the simi-larity metric. Results reported in [7] for B-splines show a speedup of n/2 for nprocessors compared to a sequential implementation; two 512×512×459 imagesare registered in 12 minutes using a cluster of 10 computers, each with a 3.4GHzCPU, compared to 50 minutes for a sequential program.

    Shams et al. consider the rigid registration of multi-modality images usingmutual information and focus on accelerating the histogram generation step onthe GPU [6]. The authors apply a concept termed “sort and count” to computehistogram bins in a collision-free manner, thereby eliminating the need for atomicoperations. The basic idea uses a parallel version of bitonic sort to order the ar-ray of input elements while simultaneously counting the number of occurrencesof each unique element in the sorted set. However, the technique is designed forunsigned integers and is therefore incompatible with the partial-volume inter-polation required for the effective optimization of the denser parameterizationafforded by B-splines. Additionally, the Nvidia SDK provides histogram gener-ation methods specific to indexed images with unsigned integer valued pixels;thus making the method ill-suited for direct application to floating-point imagevolumes and partial-volume interpolation.

    Since the method presented in this paper poses the registration process asan analytic optimization problem, the change in the MI similarity metric withrespect to the B-spline parameterization must be computed. This cost-functiongradient can be calculated in parallel via the technique developed by Shacklefordet al. [5] for accelerating unimodal B-spline registration using the sum of squareddifferences (SSD). To this end, we provide an extension to this method whichallows it to be applied to B-spline registration using MI as the similarity metric.

    12

    www.plastimatch.org

  • Ф

    Fixed Image Moving Image

    v

    θ

    Nearest Neighbors Partial Volumes

    1

    2 3

    4

    w1

    w4

    w2

    w3

    Ф

    (a)

    1 2 3 4

    added amounts are equal to

    corresponding partial volumes

    (b)

    w4

    w1

    w3

    w2

    intensity

    # o

    f vo

    xe

    ls

    Fig. 1. (a) Example of partial volume interpolation performed on a 2D image showing(b) the associated histogram update rule.

    3 Theory

    Given a three dimensional fixed image F with voxel coordinates θ = x1, y1, z1and voxel intensity f = F (θ) and moving image M with voxel coordinatesφ = x2, y2, z2 and voxel intensity m = M(φ) representing the same underlyinganatomy as F within the image overlap domain Ω, a transform T(φ) = θ + vis sought that best registers M within the coordinate system of F . Because Fand M are obtained using differing imaging modalities, an optimal T will notminimize the error between f and m as in unimodal registration. Instead, weattempt to maximize the amount of information common to F and M usingmutual information (MI) as a similarity metric:

    I =∑

    f,m

    pj(f,T(m)) lnpj(f,T(m))

    p(f)p(T(m))(1)

    which requires that we now view f and m as random variables with associatedprobability distribution functions p(f) and p(m) and joint probability pj(f,m).Note that applying T(φ) to M may result in voxels being displaced outside Ω;thus altering p(m). Additionally, if T maps the point θ to a location betweenpoints in the voxel grid of M , some form of interpolation must be employedto obtain m, which will modify p(m) in addition to pj(f,m). These effects areimplied within the notation p(T(m)) and pj(f,T(m)).

    The method of interpolation used to obtain m given T(φ) can lead to falselocal maxima emerging in the objective function as T is iteratively optimized.By employing partial volume interpolation [2], it is possible for very small sub-voxel-grid changes in T to manifest in the probability distributions. This isvitally important since these distributions are used to compute both the costand gradient used to optimize T.

    Fig. 1(a) depicts a voxel in F at θ being mapped to a point φ in M that fallsbetween points in the voxel grid. The interpolation method defines a unity vol-

    13

    partialvolume.eps

  • ume with vertices at center points of the nearest neighbors and then constructspartial volumes that each share the interpolation point φ as a common vertex.Once the partial volumes have been computed, they are placed into histogrambins corresponding to the intensity values for the voxels comprising the unityvolume vertices, as shown in Fig. 1(b). For example, the fractional weight w1defined by partial volume 1 is placed into the histogram bin associated withneighbor 1, w2 in the bin for neighbor 2, etc. This method is used in computinghistograms that estimate p(T(m)) and pj(f,T(b)).

    As mentioned previously, the gradient used for optimizing T is also com-puted using the voxel intensity probability distributions. Furthermore, becausewe parameterize the coordinate transformation T in terms of a sparse set ofB-spline coefficients P , we may maximize the MI I through direct optimizationof P . An analytic expression for this gradient may be developed by utilizing thechain rule to obtain:

    ∂I

    ∂P=

    ∂I

    ∂v×

    ∂v

    ∂P(2)

    where the second term is easily obtained by taking the derivative of the B-spline interpolation function used to compute the displacement vectors v(θ)that characterize T:

    v(θ) =4∑

    l=1

    4∑

    m=1

    4∑

    n=1

    βl(x)βm(y)βn(z)Pl,m,n (3)

    where (l,m, n) form the indices of the 64 control points providing local supportto the voxel at (θ), to obtain:

    ∂v

    ∂P=

    4∑

    l=1

    4∑

    m=1

    4∑

    n=1

    βl(x)βm(y)βn(z). (4)

    which is simply a function of the B-spline basis.Referring again to the first term of (2), it is important to realize that I and

    v are coupled through the probability distribution pj and are therefore directlyaffected by the 8 neighborhood partial volume interpolation:

    ∂I

    ∂v=

    ∂I

    ∂pj(f,M(n))×

    ∂pj(f,M(n))

    ∂v(5)

    =

    8∑

    n=1

    ∂I

    ∂pj(f,M(n))×

    ∂wn∂v

    (6)

    where the first term may be found by taking the derivative of (1) with respectto the joint distribution pj :

    ∂I

    ∂pj(f,M(n))= ln

    pj(f,M(n))

    p(f)p(M(n))− I. (7)

    14

  • and the terms ∂wn/∂v for n ∈ [1, 8] are found by taking the derivatives of theeight partial volumes with respect to their common shared vertex.

    4 Implementation

    To construct the image intensity histograms, F and M are each partitionedinto many non-overlapping subregions. Each subregion is assigned to a group ofthreads called a “thread block,” which ultimately generates a “sub-histogram”for their assigned subregion of the image. The number of voxels in a subregionequals the number of threads within a thread block. The process begins witheach thread obtaining the deformation vector v corresponding to its assignedvoxel θ within the subregion of F delegated to the thread block. The threadthen computes the eight partial volume weights in M corresponding to θ in Fand subsequently places them into the moving-image histogram bins in parallelas per Fig. 1.

    Since all threads within a thread block perform these operations concurrently,we ensure that threads wishing to modify the same histogram bin do not modifythe same memory location simultaneously. This is accomplished by representingthe histogram in memory as shown in Fig. 2. This data structure, s partitions,resides within the GPU’s shared memory and allows each thread to have its owncopy of each histogram bin, which prevents write collisions. Once s partitionsis populated, all threads within the thread block are synchronized and the parti-tions are merged to generate the sub-histogram data structure sub hist shownin Fig. 2. After all thread blocks have processed their assigned subregions, thesub-histograms are merged to acquire the final voxel intensity histogram.

    While this method efficiently computes p(f) and p(T(m)), it cannot be usedto generate p(f,T(m)) since current GPUs do not have adequate shared memoryto maintain individual copies of the joint histogram for each thread in a threadblock. We therefore employ atomic operations recently made available in modernGPU models to guarantee mutually exclusive access to histogram bins. The im-age is once again divided into subregions, each of which is assigned to individualthread blocks to generate the corresponding sub-histograms. However, instead ofmaintaining a separate partition for each thread within a bin, all threads write tothe same bin in shared memory. The GPU instruction atomicExch(x, y) allowsa thread to swap a value x in a shared-memory location with a register value ythat is private to the thread, while returning the previous value of x. If multiplethreads attempt to exchange their private values with the same memory locationsimultaneously, only one will succeed. This mechanism is used to in combina-tion with a thread-fence to achieve a locking mechanism that emulates atomicfloating-point addition on GTX 200 series and newer Nvidia GPUs. Startingwith the GTX 300 series, atomic floating-point addition is natively supported;thus providing further increases in execution speed.

    Since medical images tend to contain a predominate intensity, that of air forexample, the use of atomic operations may lead to severe serialization due to amajority of threads contending to update a single histogram bin. This is avoided

    15

  • thread 0 thread 1 thread Nthread 2 thread 0 thread 1 thread Nthread 2 thread 0 thread 1 thread Nthread 2

    Organization of s_partitions[ ]:

    Bin 0 Bin 1 Bin 2

    Bin 0 Bin 1 Bin NBin 2 Bin 0 Bin 1 Bin NBin 2 Bin 0 Bin 1 Bin NBin 2

    Organization of sub_hist[ ]:

    Sub-histogram 0 Sub-histogram 1 Sub-histogram 2

    Fig. 2. Memory organization of data structures used to generate voxel intensity his-tograms for the fixed and moving images.

    by inferring the value for the predominate bin once all other bins have beenpopulated. The inference is simple: all bins must sum to the number of voxelswithin the overlap domain Ω. Initially, the inferred bin is chosen based on imagemodality, but as the registration is performed over multiple iterations, the largestbin is tracked and inferred for the next iteration. Experimental results for CTimages indicate a noticeable speedup since, on average, 80% of all GPU threadsattempt to bin values correlating to air which would otherwise be serialized; thus,inferring effectively mitigates the negative impact on execution speed normallyencountered when using atomic memory operations.

    Parallel computations for the deformation vectors v and optimization gradi-ent ∂I/∂P are performed similarly to our unimodal implementation based on thesum of squared differences [5]. These methods carry over easily since computa-tion is largely dominated by the B-spline parameterization and not the employedsimilarity metric. However, slight extension to the optimization gradient compu-tation is required due to the new form of the ∂I/∂v term (6). Broadly viewed,the gradient computation consists of three highly parallel stages:

    – Stage 1 is the fetching stage. In this stage a group of the 64 threadswork in unison to fetch a set of 64 contiguous ∂I/∂v values. Once fetched,Stage 1 ends with each thread computing the local coordinates within thelocal support region for its ∂I/∂v value at θ.

    – Stage 2 is the processing stage. This stage sequentially cycles througheach of the 64 possible B-spline piecewise function combinations based onthe local coordinates computed in Stage 1. Each function combination isapplied to each of the 64 ∂I/∂v values in parallel; the result is stored intoa temporary 64 float wide sum reduction array located in shared memory.Once fully populated, the array is reduced to a single sum, which is then ac-cumulated into a region of shared memory contributions, which is indexedby the piecewise function combination, as in Fig. 3. Stage 2 ends once theseoperations have been performed for all possible 64 piecewise function combi-nations. Upon completion, control is returned to Stage 1, which proceeds tobegin another cycle by fetching the next set of 64 contiguous ∂I/∂v values.

    – Stage 3 is the distribution stage. Once stage 2 has processed all the∂I/∂v values within a region of local support, there will be 64 gradientcontributions stored within shared memory that need to be distributed tothe control points they influence. Because these contributions are indexed

    16

    subhist.eps

  • di_dv_x[ ] zeropaddingRegion 0 values

    multiple of 64

    zeropaddingRegion 1 values

    multiple of 64

    zeropaddingRegion 2 values

    multiple of 64

    s_reduction_x[ ]

    64 contribution values

    control point 0

    64 contribution values

    control point 1

    64 contribution values

    control point 2

    di_dp[ ]coeff

    control point 0

    coeff coeffX Y Z

    control point 1

    coeff coeff coeffX Y Z

    control point 2

    coeff coeff coeffX Y Z . . .

    . . .

    . . .

    Stage 1 input:

    Stage 2 output:

    Stage 3 output:

    Fig. 3. Memory organization and data flow involved in the parallel computation of thecost-function gradient.

    in shared memory using the piecewise basis combination numbers, and sincethe location of an influenced control-point may be determined from thiscombination number, distribution becomes simple sorting problem. There are64 threads and 64 gradient contributions that require distribution – Stage 3assigns one contribution to each thread and distributes them appropriatelybased on their combination number. To enable multiple groups of 64 threadsto perform this operation in mass-parallel upon different regions of localsupport, each control point has 64 “slots” in memory for storing the 64contributions it will receive. Once all 64 slots are populated for all control-point coefficients, a simple sum reduction kernel is employed to add up eachcontrol-point coefficient’s 64 slots; thus producing the gradient ∂I/∂P .

    5 Results

    Here we present runtime and registration quality for common multi-core CPUand GPU architectures; for comparison, a highly optimized1 single-core CPUimplementation is also characterized. The multi-core CPU implementation usesOpenMP [1] to parallelize the processes of histogram generation, cost-functionevaluation, and gradient computation. The GPU implementation uses the NvidiaCUDA programming interface to perform both histogram generation and gra-dient computation on the GPU – cost-function evaluation and quasi-Newtonianoptimization remain on the CPU and account for less than 2% of the totalcompute time. The tests reported here use a machine equipped with a 2.6GHzquad-core Intel i7 920 processor and 12GB of RAM. Performance results are

    1 The single-core CPU implementation uses the SSE instruction set when applicableand exhibits a very low miss rate of 0.1% in both L1 and L2 data caches accordingto the Valgrid profiler [3] for Linux.

    17

    datastruct3.eps

  • (a) (b)

    Fig. 4. (a) A 512× 384× 16 MR volume (red) superimposed upon a 512× 512× 115CT volume (blue) prior to deformable registration. (b) Post registration superimpo-sition produced after 20 L-BFGS-B optimizer iterations using the GPU acceleratedimplementation. B-spline control-point grid spacing is 1003 voxels.

    acquired for two different GPUs: the 240-core Tesla C1060 clocked at 1.5GHzwith 4GB of onboard memory and the 448-core Tesla C2050 clocked at 1.1GHzwith 2.6GB of onboard memory.

    Fig. 4 shows the registration results produced by the GPU implementationfor a CT-MR pair of thoracic images: a 512× 384× 16 voxel MR volume (red)and a 512 × 512 × 115 voxel CT volume (blue). Fig. 4(a) superimposes theimages prior to registration and Fig. 4(b) after registration. The CT is the fixedimage F and the MR the moving M – the B-spline control-point spacing is100×100×100 voxels. The image intensity probability distributions are estimatedusing histograms of 32 equally-wide bins each for F and M ; thus, the jointdistribution is estimated using 1024 bins. The result shown is produced after 20L-BFGS-B optimizer iterations.

    The effect of input volume resolution on execution time is determined bysetting a constant control-point spacing of 15 × 15 × 15 voxels and measuringthe execution time of a single optimization iteration for increasingly large syn-thetically generated input volumes in 10 × 10 × 10 voxel increments. Fig. 5(a)summarizes the results. As expected, the execution time increases linearly withnumber of voxels. The OpenMP version offers slightly better performance withrespect to the reference implementation, with a speedup of 2.5 times. We at-tribute this to the level of serialization imposed by memory locks used in his-togram construction. The GPU achieves a speedup of 21 times with respect tothe reference implementation and 8.5 times with respect to the OpenMP imple-mentation. Also, the relatively simple optimization step of inferring the value ofthe histogram bin expected to incur the most write conflict (in other words, themost number of atomic-exchange operations) significantly improves performanceon the Tesla C1060 GPU; for actual CT data, this optimization step speeds upthe histogram generation phase by five times. Finally, the atomicAdd instructionavailable on newer models such as the Tesla C2050 allows histogram bins to be

    18

    coronal1.epscoronal2.eps

  • 0

    20

    40

    60

    80

    100

    120

    300x

    300x

    300

    320x

    320x

    320

    340x

    340x

    340

    360x

    360x

    360

    380x

    380x

    380

    400x

    400x

    400

    420x

    420x

    420

    Volume Size (voxels)

    Exe

    cuti

    on

    Tim

    e (

    seco

    nd

    s)

    (a)

    10 20 30 40 50 60 70 80 90 1000

    5

    10

    15

    20

    25

    30

    35

    Control Point Spacing (voxels)

    Exe

    cuti

    on

    Tim

    e (

    seco

    nd

    s)

    Serial Implementation

    OpenMP Implementation

    Tesla C1060 (no inference)

    Tesla C1060

    Tesla C2050

    (b)

    Fig. 5. (a)Single iteration execution time vs volume size for a control-point spacingof 153. (b)Single iteration execution time vs control-point spacing for a volume size of260× 260× 260 voxels.

    incremented in a thread-safe manner without the need for atomic-exchanges andthread-fences. Consequently, the Tesla C2050 spends only 60% of its processingtime generating histograms whereas the C1060 spends slightly over 70% of itstime.

    The method used for parallel gradient computation exploits key attributes ofthe uniform control-point spacing scheme to achieve fast execution times. Eachlocal support region defined by 64 control-points is farmed out to an individualcore. If the volume size is constant, increasing control-point spacing results infewer, yet larger, work units for cores to process. Fig. 5(b) shows the impactof varying the control-point spacing in increments of 5 × 5 × 5 voxels for a260× 260× 260 voxel fixed image. Note that the execution times are relativelyunaffected by the control-point spacing for all implementations. The slight sub-linear increase in execution time for the GPU implementations is attributedto the work-unit size becoming adequately large such that the processing timebegins to dominate the time required to swap work units in and out to cores.

    6 Conclusions

    We have developed a multi-core accelerated B-spline based deformable regis-tration process for aligning multi-modal images, suitable for use on multi-coreprocessors and GPUs. We developed and implemented parallel algorithms for themajor steps of the registration process: generating a deformation field using theB-spline control-point grid, calculating the image histograms needed to computethe mutual information, and calculating the change in the mutual information

    19

    size_i7_hw.epsgrid_i7_hw.eps

  • with respect to the B-spline coefficients for the gradient-descent optimizer. Wehave evaluated the multi-core CPU and GPU implementations in terms of bothexecution speed and registration quality. Our results indicate that the speedupvaries with volume size but is relatively insensitive to the control-point spacing.Our GPU-based implementations achieve, on average, a speedup of 21 times withrespect to the reference implementation and 7.5 times with respect to a multi-core CPU implementation using four cores, with identical registration quality.We hope that such improvements in processing speed we result in deformableregistration methods being routinely utilized in interventional procedures suchas image-guided surgery and image-guided radiotherapy.

    The source code for the algorithms presented here have been published onlineas part of Plastimatch and can be downloaded freely under an open source BSD-style license from www.plastimatch.org.

    Acknowledgments

    This work was supported in part by NSF ERC Innovation Award EEC-0946463, the Federal share of program income earned by MGH onC06CA059267, and is part of the National Alliance for Medical Image Com-puting (NAMIC), funded by the National Institutes of Health through theNIH Roadmap for Medical Research, Grant 2-U54- EB005149. Informationon the National Centers for Biomedical Computing can be obtained fromhttp://nihroadmap.nih.gov/bioinformatics.

    References

    1. Chapman, B., Jost, G., Pas, R.V.D.: Using OpenMP: Portable Shared MemoryParallel Programming. MIT Press (2007)

    2. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Seutens, P.: Multimodalityimage registration by maximization of mutual information. IEEE Transactions onMedical Imaging 16(2), 187–198 (April 1997)

    3. Nethercote, N., Seward, J.: Valgrind: A framework for heavyweight dynamic binaryinstrumentation. ACM SIGPLAN Conference on Programming Language Designand Implementation (June 2007)

    4. Rohlfing, T., Maurer, C.R.: Nonrigid image registration in shared-memory multi-processor environments with application to brains, breasts, and bees. IEEE Trans.Information Technology Biomedicine 7(1), 16–25 (March 2003)

    5. Shackleford, J., Kandasamy, N., Sharp, G.: On developing b-spline registration al-gorithms for multi-core processors. Physics in Medicine and Biology 55(21), 6329–52(December 2010)

    6. Shams, R., Sadeghi, P., Kennedy, R., Hartley, R.: Parallel computation of mutualinformation on the gpu with application to real-time registration of 3d medicalimages. Computer Methods and Programs in Biomedicine 99(2), 133–146 (August2010)

    7. Zheng, X., Udupa, J.K., Chen, X.: Cluster of workstation based nonrigid imageregistration using free-form deformation. Proc. SPIE 7261 (2009)

    20

    www.plastimatch.orghttp://nihroadmap.nih.gov/bioinformatics

  • A Fast Parallel Implementation of Queue-basedMorphological Reconstruction using GPUs

    George Teodoro, Tony Pan, Tahsin M. Kurc,Lee Cooper, Jun Kong, and Joel H. Saltz

    {gteodor,tony.pan,tkurc,lee.cooper,jun.kong,jhsaltz}@emory.edu

    Center for Comprehensive Informatics, Emory University, Atlanta, GA 30322

    Abstract. In this paper we develop and experimentally evaluate a novelGPU-based implementation of the morphological reconstruction opera-tion. This operation is commonly used in the segmentation and featurecomputation steps of image analysis pipelines, and often serves as a com-ponent in other image processing operations. Our implementation buildson a fast hybrid CPU algorithm, which employs a queue structure for ef-ficient execution, and is the first GPU-enabled version of the queue-basedhybrid algorithm. We evaluate our implementation using state-of-the-artGPU accelerators and images obtained by high resolution microscopyscanners from whole tissue slides. The experimental results show thatour GPU version achieves up to 20× speedup as compared to the se-quential CPU version. Additionally, our implementation’s performanceis superior to the previously published GPU-based morphological recon-struction, which is built on top of slower baseline version of the operation.

    1 Introduction

    Image data and image analysis have applications in a wide range of domains,including satellite data analyses, astronomy, computer vision, and biomedicalresearch. In biomedical research, for instance, digitized microscopy imaging playsa crucial role in quantitative characterization of biological structures in greatdetail at cellular and sub-cellular levels. Instruments to capture high resolutionimages from tissue slides and tissue microarrays have advanced significantly inthe past decade. A state-of-the-art slide scanner can capture a 50K×50K pixelimage in a few minutes from a tissue sectioned at 3-5 micron thickness, enablingthe application of this technology in research and healthcare delivery. Systemsequipped with auto-focus mechanisms can process batches of slides with minimalmanual intervention and are making large databases of high resolution slidesfeasible.

    In a brain tumor studies, for instance, images created using these devices maycontain upto 20 billion pixels (with digitization at 40X objective magnification),or approximately 56 GB in size when represented as 8 bit uncompressed RGBformat. To extract meaningful information from these images, they usually areprocessed through a series of steps such as color normalization, segmentation,feature computation, and classification in order to extract anatomic structures atcellular and sub-cellular levels and classify them [1]. The analysis steps consist

    21

  • of multiple data and compute intensive operations. The processing of a largeimage can take tens of hours.

    On the computational hardware front, GPGPUs have become a popular im-plementation platform for science applications. There is a shift towards heteroge-neous high performance computing architectures consisting of multi-core CPUsand multiple general-purpose graphics processing units (GPGPUs). Image anal-ysis algorithms and underlying operations, nevertheless, need to take advantageof GPGPUs to fully exploit the processing of such systems.

    In this paper, we develop and experimentally evaluate an efficient GPU-based implementation of the morphological reconstruction operation. Morpho-logical reconstruction is commonly used in the segmentation, filtering, and fea-ture computation steps of image analysis pipelines. Morphological reconstruc-tion is essentially an iterative dilation of a marker image, constrained by themask image, until stability is reached (the details of the operation are describedin Section 2). The processing structure of morphological reconstruction is alsocommon in other application domains. The techniques for morphological recon-struction can be extended to such applications as the determination of Euclideanskeletons [8] and skeletons by influence zones [5]. Additionally, Delaunay trian-gulations [10], Gabriel graphs [2] and relative neighborhood graphs [11] can bederived from these methods, and obtained in arbitrary binary pictures [12].

    2 Morphological Reconstruction Algorithms

    A number of morphological reconstruction algorithms have been developed andevaluated by Vincent [13]. We refer the reader to his paper for a more formaldefinition of the operation and the details of the algorithms. In this section webriefly present the basics of morphological reconstruction, including definitionsand algorithm versions.

    Morphological reconstruction algorithms were developed to both binary andgray scale images. In gray scale images the value of a pixel p, I(p), comes from aset {0, ..., L−1} of gray levels in a discrete or continuous domain, while in binaryimages there are only two levels. Figure 1 illustrates the process of gray scalemorphological reconstruction in 1-dimension. The marker intensity profile (red)is propagated spatially but is bounded by the mask image’s intensity profile(blue). In a simplified form, the reconstruction ρI(J) of mask I from markerimage J is done by performing elementary dilations (i.e., dilations of size 1) inJ by a structuring element G. Here, G is a discrete grid, which provides theneighborhood relationships between pixels. A pixel p is a neighbor of pixel qif and only if (p, q) ∈ G. G is usually a 4-connected or 8-connected grid. Anelementary dilation from a pixel p corresponds to propagation from p to itsimmediate neighbors in G. The basic algorithm carries out elementary dilationssuccessively over the entire image J , updates each pixel in J with the pixelwiseminimum of the dilation’s result and the corresponding pixel in I (i.e., J(p) ←(max{J(q), q ∈ NG(p)∪{p}})∧I(p); where NG(p) is the neighborhood of a pixelp on grid G and ∧ is the pixelwise minimum operator), and stops when stabilityis reached, i.e., when no more pixel values are modified.

    22

  • Fig. 1. Gray scale morphological reconstruction in 1-dimension. The marker imageintensity profile is represented as the red line, and the mask image intensity profile isrepresented as the blue line. The final image intensity profile is represented as the greenline. The arrows show the direction of propagation from the marker intensity profileto the mask intensity profile. The green region shows the changes introduced by themorphological reconstruction process.

    Vincent has presented several morphological reconstruction algorithms,which are based on this core technique [13]:

    Sequential Reconstruction (SR): Pixel value propagation in the marker imageis computed by alternating raster and anti-raster scans. A raster scan startsfrom the pixel at (0, 0) and proceeds to the pixel at (N − 1,M − 1) in arow-wise manner. An anti-raster scan starts from the pixel at (N − 1,M − 1)and moves to the pixel at (0, 0) in a row-wise manner. Here, N and M arethe resolutions of the image in x and y dimensions, respectively. In each scan,values from pixels in the upper left or the lower right half neighborhood arepropagated to the current pixel in raster or anti-raster fashion, respectively.The raster and anti-raster scans allow for changes in a pixel to be propagatedin the current iteration. The SR method iterates until stability is reached, i.e.,no more changes in pixels are computed.

    Queue-based Reconstruction (QB): In this method, a first-in first-out (FIFO)queue is initialized with pixels in the regional maxima. The computation thenproceeds by removing a pixel from the queue, scanning the pixel’s neighborhood,and queuing the neighbor pixels that changed. The overall process continuesuntil the queue is empty. The regional maxima needed to initialize the queuerequires significant computational cost to generate.

    Fast Hybrid Reconstruction (FH): This approach incorporates the characteristicsof SR and QB algorithms, and is about one order of magnitude faster than theothers. It first makes one pass using the raster and anti-raster scans as in SR.After that pass, it continues the computation using a FIFO queue as in QB.

    A pseudo-code implementation of FH is presented in Algorithm 1, N+ andN− denote the set of neighbors in NG(p) that are reached before and aftertouching pixel p during a raster scan.

    23

  • Algorithm 1 Fast Hybrid gray scale reconstruction

    Input

    I: mask imageJ: marker image, defined on domain DI , J ≤ I.1: Scan DI in raster order.2: Let p be the current pixel3: J(p)← (max{J(q), q ∈ N+G (p) ∪ {p}}) ∧ I(p)4: Scan DI in anti-raster order.5: Let p be the current pixel6: J(p)← (max{J(q), q ∈ N−G (p) ∪ {p}}) ∧ I(p)7: if ∃q ∈ N−G (p) | J(q) < J(p) and J(q) < I(q)8: fifo add(p)9: {Queue-based propagation step}

    10: while fifo empty() = false do11: p← fifo first()12: for all q ∈ NG(p) do13: if J(q) < J(p) and I(q) 6= J(q) then14: J(q)← min{J(p), I(q)}15: fifo add(q)

    3 Fast Parallel Hybrid Reconstruction Using GPUs

    To the best of our knowledge, the only existing GPU implementation of mor-phological reconstruction is a modified version of SR [4]: SR GPU. As discussedbefore, SR is about an order of magnitude slower than FH. Therefore, SR GPUis built on top of a slow base line algorithm, limiting its gains when compared tothe fastest CPU algorithm — FH. This section describes our GPU paralleliza-tion of the Fast Hybrid Reconstruction algorithm, referred to in this paper asFH GPU.

    The FH GPU, as its CPU based counterpart, consists of a raster and anti-raster scan phase that is followed by a queue-based phase. To implement theraster and anti-raster scanning in our algorithm we extended the approach ofKaras [4]. We have made several changes to SR GPU: (1) we have templatedour implementation to support binary images as well as gray scale images withinteger and floating point data types; (2) we have optimized the implemen-tation and multi-threaded execution for 2-dimensional images — the originalimplementation was designed for 3-dimensional images. This modification al-lows us to specialize the Y-direction propagation during the scans; and (3) wehave tuned the shared memory utilization to minimize communication betweenmemory hierarchies. For a detailed description of the GPU-based raster scanphase we direct the reader to the paper [4]. The rest of this section discusses thequeue-based phase, which is the key aspect to achieve efficiency on execution ofmorphological reconstruction and is the focus of this work.

    3.1 Queue-based Phase Parallelization Strategy

    After the raster scan phase, active pixels, those that satisfy the propagationcondition to a neighbor pixels, are inserted into a global queue for computation.

    24

  • The global queued is then equally partitioned into a number of smaller queues.Each of these queues is assigned to a GPU thread block. Each block can carryout computations independently of the other blocks.

    Two levels of parallelism can be implemented in the queue-based compu-tation performed by each thread of block: (i) pixel level parallelism which al-lows pixels queued for computation to be independently processed; and (ii) theneighborhood level parallelism that performs the concurrent evaluation of a pixelneighborhood. The parallelism in both cases, however, is available at the cost ofdealing with potential race conditions that may occur when a pixel is updatedconcurrently.

    Algorithm 2 GPU-based queue propagation phase

    1: {Split initial queue equally among thread blocks}2: while queue empty() = false do3: while (p = dequeue(...))! = EMPTY do in parallel4: for all q ∈ NG(p) do5: if J(q) < J(p) and I(q) 6= J(q) then6: oldval = atomicMax(&J(q),min{J(p), I(q)})7: if oldval < min{J(p), I(q)} then8: queue add(q)

    9: if tid = 0 then10: queue swap in out()

    11: syncthreads()12: threadfence block()

    During execution, a value max operation is performed on the value of thecurrent pixel being processed (p) and that of each pixel in the neighborhood(q ∈ NG(p)) — limited by a fixed mask value I(q). To correctly perform thiscomputation, atomicity is required in the maximum and update operations. Thiscan only be achieved at the cost of extra synchronization overheads. The use ofatomic CAS operations is sufficient to solve this race condition. The efficiencyof atomic operations in GPUs have been significantly improved in the last gen-eration of NVidia GPUs (Fermi) [9] because of the use of cache memory. Atomicoperations, however, still are more efficient in cases where threads do not concur-rently try to update the same memory address. When multiple threads attemptto update the same memory location, they are serialized in order to ensure cor-rect execution. As a result, the number of operations successfully performed percycle is reduced [9].

    To lessen the impact of the potential serial execution, our GPU paralleliza-tion employs the pixel level parallelism only. The neighborhood level parallelismis problematic in this case because the concurrent processing and updating ofpixels in a given neighborhood likely affect one another, since the pixels arelocated in the same region of the image. Unlike traditional graph computingalgorithms (e.g., Breadth-First Search [3, 6]), the lack of parallelism in neigh-borhood processing does not result in load imbalance in the overall execution,because all pixels have the same number of neighbors which is defined by thestructuring element G.

    25

  • The GPU-based implementation of the queue propagation operation is pre-sented in Algorithm 2. After splitting the initial queue, each block of threadsenters into a loop in which pixels are dequeued in parallel and processed, andnew pixels may be added to the local queue as needed. This process continuesuntil the queue is empty. Within each loop, the pixels queued in last iterationare uniformly divided among the threads in the block and processed in parallel(Lines 3—8 in Algorithm 2). The value of each queued pixel p is compared toevery pixel q in the neighborhood of pixel p. An atomic maximum operation isperformed when a neighbor pixel q should be updated. The value of pixel q beforethe maximum operation is returned (oldval), and used to determine whether thepixel’s value has really been changed (Line 7 in Algorithm 2). This step is neces-sary because there is a chance that another thread might have changed the valueof q between the time the pixel’s value is read to perform the update test andthe time that the atomic operation is performed (Lines 5 and 6 in Algorithm 2).If the maximum operation performed by the current thread has changed thevalue of pixel q, q is added to the queue for processing in the next iteration ofthe loop. Even with this control, it may happen that between the test in line 7and the addition to the queue, the pixel q may have been modified again. In thiscase, q is added multiple times to the queue, and although it may impact theperformance, the correctness of the algorithm is not affected.

    After computing pixels from the last iteration, pixels that are added to thequeue in the current iteration are made available for the next iteration, and allwrites performed by threads in a block are guarantee to be consistent (Lines 9to 12 in the algorithm). As described in the next section, the choice to processpixels in rounds, instead of making each pixel inserted into the queue immedi-ately accessible, is made in our design to implement a queue with very efficientread performance (dequeue performance) and with low synchronization costs toprovide consistency among threads when multiple threads read from and writeto the queue.

    3.2 Parallel Queue Implementation and Management

    The parallel queue is a core data structure employed by FH GPU. It is used tostore information about pixels that are active in the computation and should beprocessed. An efficient parallel queue for GPUs is a challenging problem [3, 4, 6]due to the sequential and irregular nature of accesses.

    A straight forward implementation of a queue (named Näıve here), as pre-sented by Hong et al. [3], could be done by employing an array to store itemsin sequence and using atomic additions to calculate the position where eachitem should be inserted. Hong et al. stated that this solution worked well fortheir use case. The use of atomic operations, however, is inefficient when thequeue is heavily employed as in our algorithm. Moreover, a single queue thatis shared among all thread blocks introduces additional overheads to guaranteeconsistency across threads in the entire device.

    We have designed a parallel queue that operates independently in a perthread block basis to avoid inter-block communication, and is built with a cas-cade of storage levels in order to exploit the GPU fast memories for efficient

    26

  • Fig. 2. Multi-level parallel queue.

    write and read accesses. The parallel queue (depicted in Figure 2) is constructedin multiple-levels:(i) Per-Thread queues (TQ) which are very small queues pri-vate to each thread, residing in the shared memory; (ii) Block Level Queue (BQ)which is also in the shared memory, but is larger than TQ. Write operations tothe block level queue are performed in thread warp-basis; and (iii) Global BlockLevel Queue (GBQ) which is the largest queue and uses the global memory ofthe GPU to accumulate data stored in BQ when the size of BQ exceeds theshared memory size.

    The use of multiple levels of queues improves the scalability of the imple-mentation, since portions of the parallel queue are utilized independently andthe fast memories are employed more effectively. In our queue implementation,each thread maintains an independent queue (TQ) that does not require anysynchronization to be performed. In this way threads can efficiently store pixelsfrom a given neighborhood in the queue for computation. Whenever this levelof queue is full, it is necessary to perform a warp level communication to ag-gregate the items stored in local queues and write them to BQ. In this phase,a parallel thread warp prefix-sum is performed to compute the total number ofitems queued in individual TQs. A single shared memory atomic operation isperformed at the Block Level Queue to identify the position where the warpof threads should write their data. This position is returned, and the warp ofthreads write their local queues to BQ.

    Whenever a BQ is full, the threads in the corresponding thread block aresynchronized. The current size of the queue is used in a single operation tocalculate the offset in GBQ from which the BQ should be stored. After thisstep, all the threads in the thread block work collaboratively to copy in parallelthe contents of QB to GBQ. This data transfer operation is able to achievehigh throughput, because the memory accesses are coalesced. It should be notedthat, in all levels of the parallel queue, an array is used to hold the queuecontent. While the contents of TQ and BQ are copied to the lower level queueswhen full, GBQ does not have a lower level queue to which its content can becopied. In the current implementation, the size of GBQ is initialized with a

    27

  • fixed tunable memory space. If the limit of GBQ is reached during an iterationof the algorithm, excess pixels are dropped and not stored. The GPU methodreturns a boolean value to CPU to indicate that some pixels have been droppedduring the execution of the method kernel. In that case, the algorithm has tobe re-executed, but using the output of the previous execution as input, becausethis output already holds a partial solution of the problem. Hence, recomputingusing the output of the previous step as input will result in the same solutionas if the execution was carried out in a single step. The operation of readingitems from the queue for computation is embarrassingly parallel, as we staticallypartitioning pixels queued in the beginning of each iteration.

    4 Experimental Results

    We evaluate the performance of the GPU-enabled fast hybrid morphological re-construction algorithm (FH GPU), comparing it to that of existing fastest CPUand GPU implementations under different configurations. Test images used inthe experiments were extracted from the set of images collected from a braintumor research study. Each test image had been scanned at 20× magnificationresulting in roughly 50K×50K pixels. The images used in this studies are par-titioned into tiles to perform image analyses on a parallel machine. A tile isprocessed by a single GPU or a CPU core. Hence, the performance numberspresented in this paper are per individual tiles.

    We used a system with a contemporary CPU (Intel i7 2.66 GHz) and twoNVIDIA GPUs (C2070 and GTX580). Codes used in our evaluation were com-piled using using “gcc 4.2.1”, “-O3” optimization flag, and NVidia CUDA SDK4.0. The experiments were repeated 3 times, and, unless stated, the standarddeviation was not observed to be higher than 1.3%.

    4.1 Results

    The evaluation of morphological reconstruction, as the input data size varies,using 4-connected and 8-connected grids is presented in Table 1. The speedupsare calculated using the single core CPU version as the baseline.

    The results show that both SR GPU and FH GPU achieve higher speedupswith 8-connected grids. The performance differences are primarily a consequenceof the GPU’s ability to accommodate the higher bandwidth demands of mor-phological reconstruction with 8-connected grid, thanks to the higher randommemory access bandwidth of the GPU. For example, when the input tile is4K×4K, the CPU throughput, measured as a function of the number of pix-els visited and compared per second, increases from 67 to 85 million pixels persecond as the grid changes from 4-connected to 8-connected. For the same sce-nario, the FH GPU has a much better throughput improvement, going from 746to 1227 million pixels per second. This difference in computing rates highlightsthat morphological reconstruction has abundant parallelism primarily in mem-ory operations (like other graph algorithms [3, 7]), which limits the performanceof the CPU for the 8-connected case. The use of an 8-connected grid is usually

    28

  • Input FH SR GPU (ms) FH GPU (ms) SR GPU speedup FH GPU speedupSize CPU(ms) C2070 GTX 580 C2070 GTX 580 C2070 GTX 580 C2070 GTX 580

    4-connected grid4K×4K 1990 1578 1007 281 181 1.26 1.97 7.08 10.998K×8K 8491 5393 3404 1034 646 1.57 2.49 8.21 13.14

    12K×12K 19174 11773 6688 2334 1350 1.62 2.86 8.21 14.216K×16K 34818 19278 10990 3960 2256 1.80 3.16 8.79 15.43

    8-connected grid4K×4K 1314 599 401 169 115 2.19 3.27 7.77 11.428K×8K 5267 1646 1089 502 316 3.19 4.83 10.49 16.66

    12K×12K 11767 3156 1952 1023 656 3.72 6.02 11.5 17.9316K×16K 20965 5076 3088 1736 1078 4.13 6.78 12.07 19.44Table 1. Impact of the input tile size on the performance of (i) FH: the fastest CPUversion, (ii) SR GPU: the previous GPU implementation [4], and (iii) FH GPU: ourqueue-based GPU version.

    more beneficial for SR GPU, since the propagation of pixel values is more ef-fective for larger grids and, as a result, fewer raster scans are needed to achievestability.

    The performance analysis for FH GPU under the scenario with input datasize variation is also interesting. As larger input tiles are processed, maintainingthe number of threads fixed, the algorithm increases its efficiency. The changefrom 4K×4K to 8K×8K tiles improved the throughput from about 1227 to 1533millions of pixels visited and compared per second — using 8-connected grid andthe GTX 580. The throughput increase is consequence of a better utilization ofthe GPU computing power, because of the smaller amortized synchronizationcosts at the end of each iteration. The speedup on top of the CPU version is alsohigher, since there is no throughput improvement for the CPU as the tile sizeincreases.

    For tiles larger than 8K×8K, no improvement in throughput is observedfor FH GPU either. Small increases in speedup values compared to the CPUversion are observed for the 12K×12K and 16K×16K image tiles. These gainsare consequence of the better performance achieved by the multiple iterations ofthe raster scan phase, which is the initial phase in FH GPU before the queue-based execution phase. The raster scans have a better performance for largerimage tiles because the overhead of launching the 4 GPU kernels used by theraster scan phase is better amortized since more pixels are modified per pass.

    5 Conclusions

    We have presented an implementation of a fast hybrid morphological recon-struction algorithm for GPUs. Unlike a previous implementation, the new GPUalgorithm is based on an efficient sequential algorithm and employs a queue-based computation stage to speed up processing. Our experimental evaluationon two state-of-the-art GPUs and using high resolution microscopy images showthat (1) multiple levels of parallelism can be leveraged to implement an efficientqueue and (2) a multi-level queue implementation allows for better utiliza-tion of fast memory hierarchies in a GPU and reduces synchronization overheads.

    29

  • Acknowledgments. This research was funded, in part, by grants from theNational Institutes of Health through contract HHSN261200800001E by the Na-tional Cancer Institute; and contracts 5R01LM009239-04 and 1R01LM011119-01from the National Library of Medicine, R24HL085343 from the National HeartLung and Blood Institute, NIH NIBIB BISTI P20EB000591, RC4MD005964from National Institutes of Health, and PHS Grant UL1TR000454 from the Clin-ical and Translational Science Award Program, National Institutes of Health,National Center for Advancing Translational Sciences. This research used re-sources of the Keeneland Computing Facility at the Georgia Institute of Tech-nology, which is supported by the National Science Foundation under ContractOCI-0910735. The content is solely the responsibility of the authors and doesnot necessarily represent the official views of the NIH. We also want to thankPavel Karas for releasing the SR GPU implementation used in our comparativeevaluation.

    References

    1. L. Cooper, J. Kong, D. Gutman, F. Wang, S. Cholleti, T. Pan, P. Widener,A. Sharma, T. Mikkelsen, A. Flanders, D. Rubin, E. Van Meir, T. Kurc, C. Moreno,D. Brat, and J. Saltz. An Integrative Approach for In Silico Glioma Research. IEEETransactions on Biomedical Engineering, oct. 2010.

    2. R. K. Gabriel and R. R. Sokal. A New Statistical Approach to Geographic VariationAnalysis. Systematic Zoology, 18(3):259–278, Sept. 1969.

    3. S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graphalgorithms at maximum warp. In Proceedings of the 16th ACM symposium onPrinciples and Practice of Parallel Programming, PPoPP ’11, 2011.

    4. P. Karas. Efficient Computation of Morphological Greyscale Reconstruction. InMEMICS, volume 16 of OASICS, pages 54–61. Schloss Dagstuhl - Leibniz-Zentrumfuer Informatik, Germany, 2010.

    5. C. Lantuéjoul. Skeletonization in quantitative metallography. In R. Haralick andJ.-C. Simon, editors, Issues in Digital Image Processing, volume 34 of NATO ASISeries E, pages 107–135, 1980.

    6. L. Luo, M. Wong, and W.-m. Hwu. An effective GPU implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC ’10,pages 52–55, 2010.

    7. A. Mandal, R. Fowler, and A. Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In IEEE International Symposium on PerformanceAnalysis of Systems Software (ISPASS), 2010.

    8. F. Meyer. Digital Euclidean Skeletons. In SPIE Vol. 1360, Visual Communicationsand Image Processing, pages 251–262, 1990.

    9. Nvidia. Fermi Compute Architecture Whitepaper.10. F. P. Preparata and M. I. Shamos. Computational geometry: an introduction.

    Springer-Verlag New York, Inc., New York, NY, USA, 1985.11. G. T. Toussaint. The relative neighbourhood graph of a finite planar set. Pattern

    Recognition, 12(4):261–268, 1980.12. L. Vincent. Exact Euclidean Distance Function By Chain Propagations. In IEEE

    Int. Computer Vision and Pattern Recog. Conference, pages 520–525, 1991.13. L. Vincent. Morphological grayscale reconstruction in image analysis: Applications

    and efficient algorithms. IEEE Transactions on Image Processing, 2:176–201, 1993.

    30

  • Optimization and Parallelization of the MatchedMasked Bone Elimination Method for CTA

    Henk Marquering, Jeroen Engelberts, Paul Groot, Charles Majoie,Ludo Beenen, Antoine van Kampen, and Silvia D. Olabarriaga?

    Dept. of Biomedical Engineering and Physics, Academic Medical Center, NLDept. of Radiology, Academic Medical Center, NL

    SARA, NLDept. Clinical Epidemiology, Biostatistics and Bioinformatics, AMC, NL

    Abstract. The fast radiologic evaluation of intracranial arteries in CTangiography data sets is hampered by the presence of the dense skull.An algorithm for the removal of the skull based on subtraction of addi-tional CT images was therefore previously developed and deployed as aservice in the Academic Medical Center in Amsterdam (matched maskedbone elimination service). The method relies on a registration step thatrequires large computational time, which made this service unsuitablefor acute care patient work up. We reduced the computation time byparallelizing the algorithm. The total time to generate an enhanced scandropped from 20 min to below 10 min on 8 cores, enabling the usage ofthis service for the acute care. A proof-of-concept validation on 18 scansindicates that this speed-up has been achieved without quality loss, butfurther work