gpu vs fpga: example application on white-light … · fpga vs. gpu implementations architectural...

1
Ilmenau University of Technology Faculty of Computer Science and Automation Institute for Computer Engineering Computer Architecture and Embedded Systems Group www.tu-ilmenau.de/ra Dr.-Ing. Alexander Pacholik, Dipl.-Inf. Marcus Müller, Prof. Dr. Wolfgang Fengler, Dipl.-Ing. Torsten Machleidt, PD Dr. Karl-Heinz Franke {alexander.pacholik, marcus.mueller, wolfgang.fengler, torsten.machleidt}@tu-ilmenau.de www.tu-ilmenau.de up to 140 s for max. image stack ca. 8 s for FPGA implementation ca. 0.1 s Overall application latency FPGA-only solution up to 140 s for max. image stack Overall application latency ca. 2 s ca. 14 s for CPU implementation ca. 0.5 s for GPU implementation FPGA & PC-based solution Conclusions: - GPU implementation of main processing - Overall application feasibility greatly profiting chain clearly outperforming all other from data reduction on embedded hardware implementations close to the sensor - especially PC-based solutions rely on limited data volume for transfer and storage - Low software development cost and availability for PC-based solutions. due to massive parallelism - Feasibility and adequate performance of purely FPGA-based implementation => Possibility to integrate WLI application in embedded and mobile sensor devices Nano-measuring technology White-light interferometry (WLI) and processing chain FPGA vs. GPU implementations Architectural concepts for WLI processing GPU VS FPGA: EXAMPLE APPLICATION ON WHITE-LIGHT INTERFEROMETRY Abstract - High performance image processing applications are a processing, which is codefined by graphics processing units (GPU). This challenging field when targeting embedded processing. Field-programmable paper provides a case study which analyzes the potential of embedded or gate arrays (FPGA) receive a growing interest as implementation platforms, hybrid implementation on FPGA and GPU for image stack processing in a but these solutions have to compete with the state -of- the- art in image white-light interferometry (WLI) application. The work presented here is related to the research within the Collaborative Research Centre 622 ”Nano-positioning and Nano-measuring Machines”, funded by the German Research Council (DFG) under grant SFB 622 / CRC 622. A goal in the advancing field of micro- and nano-technology is to measure and manipulate objects on a sub-micrometer scale. Modern high-precision positioning stages are able to position a sample in all three dimensions with an accuracy of less than one nanometer and operational ranges of up to several hundred millimeters. The comparatively vast operational ranges require methods to quickly obtain an overview over the sample surface and locate an area of interest, to which probes or tools with resolution on nano-meter scale can be applied. White-light interferometry (WLI) is a technology for determining 3D-topology information by scanning with conventional optical CCD image sensors (camera), a method used e.g. for wafer inspection. Experimental hardware setup: - FPGA: Xilinx Virtex 5 Lx85 w/ 128bit-wide DDR2-RAM - CPU: INTEL Core2 Duo E8600 @ 3.33 Ghz - GPU: NVIDIA 280 GTX Capturing - Camera capturing image stack while being v . ertically re-positioned in 72 nm steps over sample surface Height map - Assembling the height map to diplay the 3D topology of the sample surface. Preprocessing - Isolating the interference correllogram sections of stack pixel data for each surface point. z 1 m n Image number Demodulation - Computing the envelope function of the correlogram for each surface point. Image number m n 2 H(z) = ½ | [K(z + 1) − K(z + 3)] − [K(z) − K(z + 2)] · [K(z + 2) − K(z + 4)] | Gaussian Fitting - Approximating the irregular envelope with a Gaussian function using the Gauss-Newton method. µ Image number m n height [nm] Mean value µ resembles the height value of surface point with higher resolution than vertical capturing interval. Capturing points every 72 nm 5-Bucket-Method with error correction: FPGA implementations - single component performance and scalability limits on the test hardware. Limits to the scalability of FPGA implementations: - Bandwidth of off-chip memory interface relative to processing throughput - Chip area relative to function‘s resource requirements Performance comparison - The computational throughput of the respective implementations for all three algorithmic stages of the WLI processing chain. CPU and GPU implementations are compared to single FPGA cores and to a scaled solution with the maximum parallelism supported by the FPGA platform constraints. WLI application requirements: - Image resolution: 1024x1024 Pixel x 8bit gray-scale - Sensor interface throughput: 500 fps (-> CameraLink, etc.) - max. stack size ca. 70.000 images -> up to 70 Gbyte raw image data to be handled => Great effort for purely PC-based solutions, especially due to data transfer throughput limitations Data reduction to 64 pixel values per surface point. Data reduction to 60 function values per surface point. Data reduction to one height value per surface point. Data volume of floor(range /72nm) Z pixel values per surface point. General considerations: - D while image capturing to reduce overall application latency => Preprocessing realized with an embedded FPGA-based solution preferrable! ata reduction - Higher integration on embedded hardware for more effective access to camera interface - Decoupling of variable-effort Capturing&Pre- processing from fixed-effort Demodulation&Fitting

Upload: others

Post on 18-Jan-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU VS FPGA: EXAMPLE APPLICATION ON WHITE-LIGHT … · FPGA vs. GPU implementations Architectural concepts for WLI processing GPU VS FPGA: EXAMPLE APPLICATION ON WHITE-LIGHT INTERFEROMETRY

Ilmenau University of Technology

Faculty of Computer Science and Automation

Institute for Computer Engineering

Computer Architecture and Embedded Systems Group

www.tu-ilmenau.de/ra

Dr.-Ing. Alexander Pacholik, Dipl.-Inf. Marcus Müller, Prof. Dr. Wolfgang Fengler,

Dipl.-Ing. Torsten Machleidt, PD Dr. Karl-Heinz Franke

{alexander.pacholik, marcus.mueller, wolfgang.fengler, torsten.machleidt}@tu-ilmenau.de

www.tu-ilmenau.de

up to 140 sfor max. image stack

ca. 8 s for FPGA implementation ca. 0.1 s

Overallapplication

latency

FPGA-only solution

up to 140 sfor max. image stack

Overallapplication

latencyca. 2 s

ca. 14 s for CPU implementation

ca. 0.5 s for GPU implementation

FPGA & PC-based solution

Conclusions:

- GPU implementation of main processing - Overall application feasibility greatly profiting chain clearly outperforming all other from data reduction on embedded hardware implementations close to the sensor - especially PC-based

solutions rely on limited data volume for transfer and storage

- Low software development cost and availability for PC-based solutions.

due to massive parallelism

- Feasibility and adequate performance of purely FPGA-based implementation

=> Possibility to integrate WLI application in embedded and mobile sensor devices

Nano-measuring technology

White-light interferometry (WLI) and processing chain

FPGA vs. GPU implementations

Architectural concepts for WLI processing

GPU VS FPGA: EXAMPLE APPLICATION ON WHITE-LIGHT INTERFEROMETRY

Abstract - High performance image processing applications are a processing, which is codefined by graphics processing units (GPU). This challenging field when targeting embedded processing. Field-programmable paper provides a case study which analyzes the potential of embedded or gate arrays (FPGA) receive a growing interest as implementation platforms, hybrid implementation on FPGA and GPU for image stack processing in a but these solutions have to compete with the state -of- the- art in image white-light interferometry (WLI) application.

The work presented here is related to the research within the Collaborative Research Centre 622 ”Nano-positioning and Nano-measuring Machines”, funded by the German Research Council (DFG) under grant SFB 622 / CRC 622.

A goal in the advancing field of micro- and nano-technology is to measure and manipulate objects on a sub-micrometer scale. Modern high-precision positioning stages are able to position a sample in all three dimensions with an accuracy of less than one nanometer and operational ranges of up to several hundred millimeters. The comparatively vast operational ranges require methods to quickly obtain an overview over the sample surface and locate an area of interest, to which probes or tools with resolution on nano-meter scale can be applied. White-light interferometry (WLI) is a technology for determining 3D-topology information by scanning with conventional optical CCD image sensors (camera), a method used e.g. for wafer inspection.

Experimental hardware setup:

- FPGA: Xilinx Virtex 5 Lx85 w/ 128bit-wide DDR2-RAM- CPU: INTEL Core2 Duo E8600 @ 3.33 Ghz- GPU: NVIDIA 280 GTX

Capturing - Camera capturing image stack while being v

.ertically re-positioned in 72 nm

steps over sample surface

Height map - Assembling the height map to diplay the 3D topology of the sample surface.

Preprocessing - Isolating the interference correllogram sections of stack pixel data for each surface point.

z

1

m

n

Imagenumber

Demodulation - Computing the envelope function of the correlogram for each surface point.

Imagenumber

m n

2H(z) = ½ | [K(z + 1) − K(z + 3)] − [K(z) − K(z + 2)] · [K(z + 2) − K(z + 4)] |

Gaussian Fitting - Approximating the irregular envelope with a Gaussian function using the Gauss-Newton method.

µ

Imagenumber

m n

height[nm]

Mean value µ resembles the height value of surface point with higher resolution than vertical capturing interval.

Capturing points every 72 nm5-Bucket-Method with error correction:

FPGA implementations - single component performance and scalability limits on the test hardware.

Limits to the scalability of FPGA implementations:- Bandwidth of off-chip memory interface relative to processing throughput- Chip area relative to function‘s resource requirements

Performance comparison - The computational throughput of the respective implementations for all three algorithmic stages of the WLI processing chain.

CPU and GPU implementations are compared to single FPGA cores and to a scaled solution with the maximum parallelism supported by the FPGA platform constraints.

WLI application requirements:

- Image resolution: 1024x1024 Pixel x 8bit gray-scale

- Sensor interface throughput: 500 fps(-> CameraLink, etc.)

- max. stack size ca. 70.000 images -> up to 70 Gbyte raw image data to be handled

=> Great effort for purely PC-based solutions,especially due to data transfer throughputlimitations

Data reduction to64 pixel valuesper surface point.

Data reduction to60 function valuesper surface point.

Data reduction toone height valueper surface point.

Data volume offloor(range /72nm)Z

pixel valuesper surface point.

General considerations:

- D while image capturing to reduce overall application latency

=> Preprocessing realized with an embedded FPGA-based solution preferrable!

ata reduction

- Higher integration on embedded hardware for more effective access to camera interface

- Decoupling of variable-effort Capturing&Pre-processing from fixed-effort Demodulation&Fitting