gpu compute in medical and print imaging

GPU Compute in Medical and Print ImagingAmey DeosthaliDirector, Embedded Imaging

Medical Imaging TrendsSYSTEM OPTIMIZATION AND MINIATURIZATION

Advances in visualization and increased use of 3D/4D imaging for improved diagnosis

High-end systems of yesterday becoming portables of today

INCREASED USE OF 3D/4D IMAGING

INTEGRATION OF MODALITIES & ADVANCED FEATURES

Endoscopic ultrasound, Augmented reality, Robotic endoscopy

INCREASED SYSTEM COST PRESSURES

Expanding emerging markets, regulatory pressures, increased competition

Print Imaging Trends

Traditional Multi-Function Printer Architecture

GPU Compute based Multi-Function Printer Architecture

SoC with GPU

SCALABLE SOFTWARE SCALABLE ARCHITECTURE SYSTEM COST SAVINGS

GPU Compute and AMD APUGPU Compute in Imaging

Medical and Print Imaging workloads are well suited for GPU compute

HSA architecture can deliver significant benefits in the field of Imaging

AMD APUs integrate GPU with support for Heterogeneous System Architecture (HSA)

GPU COMPUTE IN MEDICAL IMAGING

Typical Ultrasound Imaging PipelineTransmitter

ReceiverBeamforming IQ Demodulation

Filters- Edge enhancement- Speckle Reduction

Log CompressionEnvelope Detection

Frame Averaging

2D Image formation

Frequency/Time Compounding

Color flow analysis Velocity

EstimationWall Filter

Spatial Doppler

Scan Conversion

Echo Processing

Color Flow Processing

Transducer

GPU Friendly

FASTER SCANS

Evolution in algorithm complexity with GPU

Reconstruct whole image plane

IMPROVED IMAGE QUALITY

ACCESS TO RAW DATA

Fast data transfer and efficient use of system memory

SIMPLIFIED ARCHITECTURE

Scalable SW defined architecture

GPU Compute for SW Beamforming

BridgeConvert JESD-204b

to PCIe

JESD-204b64-256 I/O Channels

Image FormationPlane Wave Imaging

• FK Stolts with optimized FFT/iFFT

• IQ Demodulation and Log Compression

Image Post ProcessingSeparable Filters• Sobel and Box filters

Non-separable Filter• Laplacian of Gaussian

De-speckle Filter• Median filter

Frequency Domain Filter• Gaussian blur and Edge

Enhancement filters

Gen 3 PCIe® x16dGMA support for 10+ GBps

GPUcoherent compounding

GPU + CPUpost processing

SW Beamforming on AMD APU

Transpose1D FFT

Z Shift & Transpose

1D IFFT

FK interpolation1D

AcquisitionDevice

iGPU or dGPU

Software Beamformer

Direct GMA

(> 10 GB/s)RF Data

1D FFT

X Shift & Transpose

Transpose

OpenCL™ implementation of FK Stolts algorithm

SW Beamformer Performance1

APU dGPU

256 Channel, 2048 Samples

1.95 ms 0.47 ms

128 Channel, 2048 Samples

1.15ms 0.29 ms

Processed Output

5x5 Median

Filter

Speckle Noise Reduction

Down Sample by 2

SubtractMultiply

With Coefficients

Up-sample by 2

Gama Correction

Down-Sample by 2

Up-Sample by 2

SubGama

Correction

Down Sample by 2

Sobel DiffusionGama

CorrectionPixel

Correction

IQ Demodulation Output

Speckle Reduction Output

Speckle Noise Reduction Optimization

• Combine multiple functions into single kernel• Get more compute per byte of global

memory access

• Reduce kernel launch delay overheads

• Reduce use of temporary buffers and buffer copies

• Reduce CPU bottlenecks that require blocking calls by moving operations to GPU

• Optimize pipeline with “in order” enqueue of OpenCL commands

BBlock

Block A & B

(Multiple

OpenCL

kernels)

Block C & D

(Multiple

OpenCL

kernels)

Block E

(Multiple

OpenCL

kernels)

CPU Path

(4.10 ms)

GPU Path2

(1.01 ms)

Downsample

+ memcpy

Downsample

+ Optimized

memcpy

Color conversion, edge detection, diffusion,

normalization, gamma correction, image enhancement

Code Migration and Optimization Process1. Profile

Identify target workloads to convert

2. Convert Target workloads from

CPU to GPU

3. Block Optimization

Combine multiple CPU calls to a single OpenCL

kernel

4. Buffer Optimization

Reduce use of temporary buffers and

buffer copies

5. Pipeline Optimization

Move low workload CPU operations to GPU to reduce blocking calls

6. Reduce kernel launch delay

“in order” enqueue of OpenCL commands

Sobel Filter Optimization

8-bit Grayscale Image

(1920x1080)

Median Filter IPP

8 to 32-bit Float

Sobel & Sobel

MagnitudeMax & Min

6.51ms

19.47ms

Migrate Sobel filter to GPU with OpenCL

B:8-bit Grayscale

Image (1920x1080)

Median Filter IPP

8 to 32-bit Float

Sobel & Sobel

MagnitudeMax & Min

CPU Optimized Modules

GPU Optimized Modules

OpenCL Optimized

2X faster computation time with migration of single module to GPU3

GPU COMPUTE IN PRINT IMAGING

Print and Scan Image Pipeline

Accelerated RIP PipelineOpen source Ghostscript postscript

renderer accelerated using GPU4

AMD G-Series Reference Board

Ubuntu 14.04 Linux OS

KMD GFX Driver

OCL CodeGLSL Libraries

C Libraries

OCL 2.0 Runtime

OGL 4.3 Runtime

Software Stack

PDF Files on Disk

Bitmap Fileon

RAMdisk

PDL Interpreter

Element Decompose

Generate Glyph Bitmaps

Bitmap

Ghostscript App

Planarize

Raster

Color Conversion

OpenCL

GL ShaderLanguage(GLSL)

CPU Operating in Host Memory GPU Operating in Device Memory

GPU compute can deliver large increase in PPM performance4

RIP Pipeline acceleration: PPM performance

GX-412 GX-424

PPM - Test case 2 @600 dpi

Legacy code (no GPU accl)GPU accelerated code

GX-412 GX-424

PPM - Test case 2 @1200 dpi

Legacy code (no GPU accl)

GPU accelerated code

PPM: Pages per Minute performance of Ghostscript RIP pipeline

GPU compute can free up CPU for other value added tasks4

CPU Load: Average load across all 4 CPU cores of G-series devices under test

RIP Pipeline acceleration: CPU Load Reduction

30 40 50 60 70 75 80 90 100 125 150

Average CPU Load - Test case 2 @ 600 DPI*Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412

GPU accelerated code: GX-424 GPU accelerated code: GX-412

5 10 15 20 25 30 35 40%

Average CPU Load - Test case 2 @ 1200 DPI*Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412

GPU accelerated code: GX-424 GPU accelerated code: GX-412

Optical Character Recognition: Tesseract Project

Accelerated using GPUTesseract Flow Optical Character Recognition (OCR) Project

Tesseract : Open source Optical Character Recognition(OCR) Engine

GPU Compute for OCR

Most of the image preprocessing and character

recognition is GPU friendly

The data structures in word recognition phase are not very GPU friendly

Expected Future Improvements

Deep Neural Network (DNN) for character

recognition

Optical Character Recognition: Demo Performance

Processing time measured for above modules with CPU processing and GPU accelerated processing5

AMD APU 95W (Time in seconds)

AMD APU 35W(Time in seconds)

Non OpenCL(CPU only)

23.65 46.2

OpenCL(GPU Compute)

16.79 36.3

Gain 41% 27%

Core Scan Processing Algorithms• AMD worked with customer to accelerate partial scan pipeline using OpenCL on AMD APU

and GPU

• Scan pipeline includes several image processing algorithms such as grayscale conversion, edge detection, rotation, color conversion etc.

• GPU compute can deliver significant improvement in processing time compared to CPU based processing6

– Translates to faster scan time and higher scan ppm

Iterative algorithm optimization on AMD APU

CPU Optimized(Execution Time)*

OpenCL Optimized (Execution Time)

OpenCL Optimized Fused Code(Execution Time)

Grayscale 13.5 ms 4.6 ms (2.9x)

Median 25.6 ms 3.1 ms (8.3x)

Grayscale + Median 39.1 ms 7.9 ms (5.0x) 5.9 ms (6.6x)

Color Conversion

Partial scan pipeline acceleration

Document Detect and Alignment correction

Quality Improvement

CONCLUSION

The Future is bright with GPU Compute

Improve quality of human care with improved accuracy

Empower new experiences with next generation technology

Enhance performance while reducing system cost

Endnotes1Testing by AMD performance labs. Measured performance of OpenCL™ implementation of FK Stolts algorithm on AMD APU and AMD FirePro GPU.

System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 15.200.1045-150622a

2Testing by AMD performance labs. Measured performance of Speckle Noise Reduction pipeline with and without GPU acceleration, multi-threaded CPU compiler option. Image size: 768 x 252, active ROI was 712 x 252.

System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 16.20-160405a-301215E

3Testing by AMD performance labs. Measured performance of Sobel Filter with and without GPU acceleration. 8.2 Multi Threaded Library. Image resolution: 1920x1080. Sobel filter size: 5x5

System Configuration: Advantech ComE board with Windows 7 64-bit, AMD RX425BB, 35W, 2.5/3.4 GHz, 1866 MHz DDR3, 4GB RAM, AMD driver version: 14.502.1001.1001, OpenCL 1.2

4Testing by AMD performance labs. Measured performance of Raster Image Processing with and without GPU acceleration.

System Configuration: AMD GX-424CC: 25W, 2.4 GHz, 1866 MHz DDR3, 8GB RAM, AMD GX-412HC: 7W, 1.2 GHz, 1333 MHz DDR3, 8 GB RAM. Ubuntu 14.04 with AMD Catalyst Driver 14.301.1001

Endnotes5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and without GPU acceleration.

System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4

5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with andwithout GPU acceleration.

System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4

6Testing by AMD performance labs. Measured performance of scan pipeline performance using proprietary customer code with and without GPU acceleration.

System Configuration: AMD Olive Hill+ development board, AMD RX427BB: 25W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM, Windows 8.1, AMD Catalyst 14.29 drivers and OpenCL™ 1.2

Endnotes7Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.

System Configuration: AMD Olive Hill+ development board with AMD RX427BB: 35W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM Ubuntu 14.04 and AMD Catalyst driver 14.29

8Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.

System Configuration: : 2015 MacBook Pro with Intel Core i7-4980HQ 2.8 GHz, 16 GB DDR3L RAM. AMD Radeon™ R9 M370X Graphics, 2GB GDDR5, Mac OS X 10.10.3. AMD Catalyst 14.29

DisclaimerThe information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.AMD does not provide a license/sublicense to any intellectual property rights relating to any to any standards, including but not limited to any audio and/or video codec technologies such as AVC/H.264/MPEG-4, AVC, VC-1, MPEG-2, and DivX/xVid.

AMD, the AMD Arrow logo, AMD Catalyst, AMD CrossFire, AMD CrossFireX, AMD Radeon, ATI Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.Windows and DirectX are registered trademarks of Microsoft Corporation. ARM is a registered trademark of ARM Limited. 3DMark is a trademark of Futuremark Corporation. DivX is a registered trademark of DivX, Inc. HDMI is a trademark of HDMI Licensing, LLC. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission of Khronos. PCIe and PCI Express are registered trademarks of PCI-SIG Corporation.

THANK YOU

gpu compute in medical and print imaging

Technology

imaging using arm t6xx gpu

adaptive gpu tessellation with compute...

adreno gpu compute - home - · pdf fileadreno gpu compute...

gpu compute on mobile devices - arm community

direct compute bring gpu computing to the...

gpu compute & professional...

mobile: driving the next wave with gpu compute... · gpu...

hitachi compute blade 500 series nvidia gpu adaptor user...

improving gpu utilization with multi-process service...

gpu compute

multiprocesorski sistemi compute unified ... -...

gpu-accelerated optical coherence tomography (oct) imaging

lecture 8: compute-mode gpu programming...

gpu compute for

gpu compute for mobile devices - iwocl€¦ · gpu compute...

spectral element method and gpu computing for seismic...

parallel banding algorithm to compute exact distance...

example uses of gpu compute models

direct compute – bring gpu computing to the mainstream

memory sharing and the compute architecture of … sharing...