gpu compute in medical and print imaging
Post on 07-Jan-2017
2.086 Views
Preview:
TRANSCRIPT
GPU Compute in Medical and Print ImagingAmey DeosthaliDirector, Embedded Imaging
Medical Imaging TrendsSYSTEM OPTIMIZATION AND MINIATURIZATION
Advances in visualization and increased use of 3D/4D imaging for improved diagnosis
High-end systems of yesterday becoming portables of today
INCREASED USE OF 3D/4D IMAGING
INTEGRATION OF MODALITIES & ADVANCED FEATURES
Endoscopic ultrasound, Augmented reality, Robotic endoscopy
INCREASED SYSTEM COST PRESSURES
Expanding emerging markets, regulatory pressures, increased competition
Print Imaging Trends
Traditional Multi-Function Printer Architecture
GPU Compute based Multi-Function Printer Architecture
SoC with GPU
SCALABLE SOFTWARE SCALABLE ARCHITECTURE SYSTEM COST SAVINGS
GPU Compute and AMD APUGPU Compute in Imaging
Medical and Print Imaging workloads are well suited for GPU compute
HSA architecture can deliver significant benefits in the field of Imaging
AMD APUs integrate GPU with support for Heterogeneous System Architecture (HSA)
GPU COMPUTE IN MEDICAL IMAGING
Typical Ultrasound Imaging PipelineTransmitter
ReceiverBeamforming IQ Demodulation
Filters- Edge enhancement- Speckle Reduction
Log CompressionEnvelope Detection
Frame Averaging
2D Image formation
Frequency/Time Compounding
Color flow analysis Velocity
EstimationWall Filter
Spatial Doppler
Scan Conversion
Echo Processing
Color Flow Processing
Transducer
GPU Friendly
FASTER SCANS
Evolution in algorithm complexity with GPU
Reconstruct whole image plane
IMPROVED IMAGE QUALITY
ACCESS TO RAW DATA
Fast data transfer and efficient use of system memory
SIMPLIFIED ARCHITECTURE
Scalable SW defined architecture
GPU Compute for SW Beamforming
BridgeConvert JESD-204b
to PCIe
JESD-204b64-256 I/O Channels
Image FormationPlane Wave Imaging
• FK Stolts with optimized FFT/iFFT
• IQ Demodulation and Log Compression
Image Post ProcessingSeparable Filters• Sobel and Box filters
Non-separable Filter• Laplacian of Gaussian
De-speckle Filter• Median filter
Frequency Domain Filter• Gaussian blur and Edge
Enhancement filters
Gen 3 PCIe® x16dGMA support for 10+ GBps
GPUcoherent compounding
GPU + CPUpost processing
SW Beamforming on AMD APU
Transpose1D FFT
Z Shift & Transpose
1D IFFT
FK interpolation1D
IFFT
AcquisitionDevice
iGPU or dGPU
Software Beamformer
Direct GMA
(> 10 GB/s)RF Data
1D FFT
X Shift & Transpose
Transpose
OpenCL™ implementation of FK Stolts algorithm
SW Beamformer Performance1
APU dGPU
256 Channel, 2048 Samples
1.95 ms 0.47 ms
128 Channel, 2048 Samples
1.15ms 0.29 ms
Processed Output
5x5 Median
Filter
Speckle Noise Reduction
Down Sample by 2
SubtractMultiply
With Coefficients
Up-sample by 2
Gama Correction
Down-Sample by 2
Up-Sample by 2
SubGama
Correction
Down Sample by 2
Sobel DiffusionGama
CorrectionPixel
Correction
IQ Demodulation Output
Speckle Reduction Output
Speckle Noise Reduction Optimization
• Combine multiple functions into single kernel• Get more compute per byte of global
memory access
• Reduce kernel launch delay overheads
• Reduce use of temporary buffers and buffer copies
• Reduce CPU bottlenecks that require blocking calls by moving operations to GPU
• Optimize pipeline with “in order” enqueue of OpenCL commands
Block
A
Block
BBlock
C
Block
E
Block
D
Block A & B
(Multiple
OpenCL
kernels)
Block C & D
(Multiple
OpenCL
kernels)
Block E
(Multiple
OpenCL
kernels)
CPU Path
(4.10 ms)
GPU Path2
(1.01 ms)
Downsample
+ memcpy
Downsample
+ Optimized
memcpy
Color conversion, edge detection, diffusion,
normalization, gamma correction, image enhancement
Code Migration and Optimization Process1. Profile
Identify target workloads to convert
2. Convert Target workloads from
CPU to GPU
3. Block Optimization
Combine multiple CPU calls to a single OpenCL
kernel
4. Buffer Optimization
Reduce use of temporary buffers and
buffer copies
5. Pipeline Optimization
Move low workload CPU operations to GPU to reduce blocking calls
6. Reduce kernel launch delay
“in order” enqueue of OpenCL commands
Sobel Filter Optimization
8-bit Grayscale Image
(1920x1080)
Median Filter IPP
8 to 32-bit Float
Sobel & Sobel
MagnitudeMax & Min
6.51ms
19.47ms
Migrate Sobel filter to GPU with OpenCL
A:
B:8-bit Grayscale
Image (1920x1080)
Median Filter IPP
8 to 32-bit Float
Sobel & Sobel
MagnitudeMax & Min
CPU Optimized Modules
GPU Optimized Modules
OpenCL Optimized
2X faster computation time with migration of single module to GPU3
GPU COMPUTE IN PRINT IMAGING
Print and Scan Image Pipeline
Accelerated RIP PipelineOpen source Ghostscript postscript
renderer accelerated using GPU4
AMD G-Series Reference Board
Ubuntu 14.04 Linux OS
KMD GFX Driver
OCL CodeGLSL Libraries
C Libraries
OCL 2.0 Runtime
OGL 4.3 Runtime
Software Stack
PDF Files on Disk
Bitmap Fileon
RAMdisk
PDL Interpreter
Element Decompose
Generate Glyph Bitmaps
Bitmap
Ghostscript App
Planarize
GP
U
Raster
GP
U
Color Conversion
GP
UDMA
DMA
OpenCL
GL ShaderLanguage(GLSL)
CPU Operating in Host Memory GPU Operating in Device Memory
GPU compute can deliver large increase in PPM performance4
RIP Pipeline acceleration: PPM performance
101.8
164
244.3
370
0
50
100
150
200
250
300
350
400
GX-412 GX-424
PP
M
PPM - Test case 2 @600 dpi
Legacy code (no GPU accl)GPU accelerated code
27.6
44
76.6
111
0
20
40
60
80
100
120
GX-412 GX-424
PP
M
PPM - Test case 2 @1200 dpi
Legacy code (no GPU accl)
GPU accelerated code
2.4x
2.3x
2.8x
2.5x
PPM: Pages per Minute performance of Ghostscript RIP pipeline
GPU compute can free up CPU for other value added tasks4
CPU Load: Average load across all 4 CPU cores of G-series devices under test
RIP Pipeline acceleration: CPU Load Reduction
0
10
20
30
40
50
60
30 40 50 60 70 75 80 90 100 125 150
% C
PU
Lo
ad (
Avg
)
PPM
Average CPU Load - Test case 2 @ 600 DPI*Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412
0
10
20
30
40
50
60
70
80
5 10 15 20 25 30 35 40%
CP
U L
oad
(A
vg)
PPM
Average CPU Load - Test case 2 @ 1200 DPI*Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412
Optical Character Recognition: Tesseract Project
Accelerated using GPUTesseract Flow Optical Character Recognition (OCR) Project
Tesseract : Open source Optical Character Recognition(OCR) Engine
GPU Compute for OCR
Most of the image preprocessing and character
recognition is GPU friendly
The data structures in word recognition phase are not very GPU friendly
Expected Future Improvements
Deep Neural Network (DNN) for character
recognition
Optical Character Recognition: Demo Performance
Processing time measured for above modules with CPU processing and GPU accelerated processing5
AMD APU 95W (Time in seconds)
AMD APU 35W(Time in seconds)
Non OpenCL(CPU only)
23.65 46.2
OpenCL(GPU Compute)
16.79 36.3
Gain 41% 27%
Core Scan Processing Algorithms• AMD worked with customer to accelerate partial scan pipeline using OpenCL on AMD APU
and GPU
• Scan pipeline includes several image processing algorithms such as grayscale conversion, edge detection, rotation, color conversion etc.
• GPU compute can deliver significant improvement in processing time compared to CPU based processing6
– Translates to faster scan time and higher scan ppm
Iterative algorithm optimization on AMD APU
CPU Optimized(Execution Time)*
OpenCL Optimized (Execution Time)
OpenCL Optimized Fused Code(Execution Time)
Grayscale 13.5 ms 4.6 ms (2.9x)
Median 25.6 ms 3.1 ms (8.3x)
Grayscale + Median 39.1 ms 7.9 ms (5.0x) 5.9 ms (6.6x)
Color Conversion
Partial scan pipeline acceleration
Document Detect and Alignment correction
Quality Improvement
7 8
CONCLUSION
The Future is bright with GPU Compute
Improve quality of human care with improved accuracy
Empower new experiences with next generation technology
Enhance performance while reducing system cost
Endnotes1Testing by AMD performance labs. Measured performance of OpenCL™ implementation of FK Stolts algorithm on AMD APU and AMD FirePro GPU.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 15.200.1045-150622a
2Testing by AMD performance labs. Measured performance of Speckle Noise Reduction pipeline with and without GPU acceleration, multi-threaded CPU compiler option. Image size: 768 x 252, active ROI was 712 x 252.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 16.20-160405a-301215E
3Testing by AMD performance labs. Measured performance of Sobel Filter with and without GPU acceleration. 8.2 Multi Threaded Library. Image resolution: 1920x1080. Sobel filter size: 5x5
System Configuration: Advantech ComE board with Windows 7 64-bit, AMD RX425BB, 35W, 2.5/3.4 GHz, 1866 MHz DDR3, 4GB RAM, AMD driver version: 14.502.1001.1001, OpenCL 1.2
4Testing by AMD performance labs. Measured performance of Raster Image Processing with and without GPU acceleration.
System Configuration: AMD GX-424CC: 25W, 2.4 GHz, 1866 MHz DDR3, 8GB RAM, AMD GX-412HC: 7W, 1.2 GHz, 1333 MHz DDR3, 8 GB RAM. Ubuntu 14.04 with AMD Catalyst Driver 14.301.1001
Endnotes5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and without GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with andwithout GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
6Testing by AMD performance labs. Measured performance of scan pipeline performance using proprietary customer code with and without GPU acceleration.
System Configuration: AMD Olive Hill+ development board, AMD RX427BB: 25W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM, Windows 8.1, AMD Catalyst 14.29 drivers and OpenCL™ 1.2
Endnotes7Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: AMD Olive Hill+ development board with AMD RX427BB: 35W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM Ubuntu 14.04 and AMD Catalyst driver 14.29
8Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: : 2015 MacBook Pro with Intel Core i7-4980HQ 2.8 GHz, 16 GB DDR3L RAM. AMD Radeon™ R9 M370X Graphics, 2GB GDDR5, Mac OS X 10.10.3. AMD Catalyst 14.29
DisclaimerThe information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.AMD does not provide a license/sublicense to any intellectual property rights relating to any to any standards, including but not limited to any audio and/or video codec technologies such as AVC/H.264/MPEG-4, AVC, VC-1, MPEG-2, and DivX/xVid.
AMD, the AMD Arrow logo, AMD Catalyst, AMD CrossFire, AMD CrossFireX, AMD Radeon, ATI Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.Windows and DirectX are registered trademarks of Microsoft Corporation. ARM is a registered trademark of ARM Limited. 3DMark is a trademark of Futuremark Corporation. DivX is a registered trademark of DivX, Inc. HDMI is a trademark of HDMI Licensing, LLC. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission of Khronos. PCIe and PCI Express are registered trademarks of PCI-SIG Corporation.
© 2016 Advanced Micro Devices, Inc. All rights reserved.
THANK YOU
top related