real-time mobile stereo vision
TRANSCRIPT
University of Tennessee, Knoxville University of Tennessee, Knoxville
TRACE: Tennessee Research and Creative TRACE: Tennessee Research and Creative
Exchange Exchange
Masters Theses Graduate School
8-2012
Real-Time Mobile Stereo Vision Real-Time Mobile Stereo Vision
Bryan Hale Bodkin [email protected]
Follow this and additional works at: https://trace.tennessee.edu/utk_gradthes
Part of the Artificial Intelligence and Robotics Commons, Robotics Commons, and the Software
Engineering Commons
Recommended Citation Recommended Citation Bodkin, Bryan Hale, "Real-Time Mobile Stereo Vision. " Master's Thesis, University of Tennessee, 2012. https://trace.tennessee.edu/utk_gradthes/1313
This Thesis is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected].
To the Graduate Council:
I am submitting herewith a thesis written by Bryan Hale Bodkin entitled "Real-Time Mobile
Stereo Vision." I have examined the final electronic copy of this thesis for form and content and
recommend that it be accepted in partial fulfillment of the requirements for the degree of
Master of Science, with a major in Computer Engineering.
Hairong Qi, Major Professor
We have read this thesis and recommend its acceptance:
Qing Cao, Jens Gregor
Accepted for the Council:
Carolyn R. Hodges
Vice Provost and Dean of the Graduate School
(Original signatures are on file with official student records.)
Real-Time Mobile Stereo Vision
A Thesis Presented for
The Master of Science
Degree
The University of Tennessee, Knoxville
Bryan Hale Bodkin
August 2012
Acknowledgments
I would like to thank Dr. Qi for giving me the time and resources to pursue my
research interests. I would also like to thank Dr. Cao and Dr Gregor for serving on
my committee.
iii
Abstract
Computer stereo vision is used to extract depth information from two aligned cameras
and there are a number of hardware and software solutions to solve the stereo
correspondence problem. However few solutions are available for inexpensive mobile
platforms where power and hardware are major limitations. This thesis proposes a
method that competes with an existing OpenCV stereo correspondence method in
speed and quality, and is able to run on generic multi core CPUs.
iv
Contents
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Introduction to Stereo Vision . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Reference and Target . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Camera Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Stereo Correspondence Methods 5
2.1 Pixel based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Area-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Feature-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Cost Matching Functions . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Common Cost Matching Functions . . . . . . . . . . . . . . . . . . . 7
2.6 Correspondence Functions . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Block Matching 10
3.1 Core Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 OpenCVs Block Matching Function . . . . . . . . . . . . . . . . . . . 11
3.3 Naive Stereo Matching Function . . . . . . . . . . . . . . . . . . . . . 11
3.4 Test Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Test Stereo Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
v
3.6 Naive Stereo Correspondence and OpenCV BM Comparison . . . . . 12
3.7 Sliding Window Technique . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7.1 Intermediate Data Structures for the Sliding Window Technique 14
3.7.2 The Four Main Processes . . . . . . . . . . . . . . . . . . . . . 14
3.7.3 Sliding Window Performance . . . . . . . . . . . . . . . . . . 15
3.8 Improvements with SIMD Instructions . . . . . . . . . . . . . . . . . 15
3.8.1 Intel SSE overview . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8.2 SSE with C++ . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8.3 Changes in Intermediate Data Structures for SSE . . . . . . . 19
3.8.4 SSE Intrinsics Overview . . . . . . . . . . . . . . . . . . . . . 19
3.8.5 SSE Row Difference . . . . . . . . . . . . . . . . . . . . . . . . 19
3.8.6 SSE Row Summing Window . . . . . . . . . . . . . . . . . . . 22
3.8.7 SSE Column Summing Window . . . . . . . . . . . . . . . . . 22
3.8.8 SSE Winner Take All . . . . . . . . . . . . . . . . . . . . . . . 22
3.8.9 SSE with Sliding Window Performance . . . . . . . . . . . . . 24
3.9 Improving performance with Parallelism . . . . . . . . . . . . . . . . 25
3.9.1 Dividing the Data . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.9.2 Parallelism Performance . . . . . . . . . . . . . . . . . . . . . 26
4 Applying Stereo Method to Real Cameras 28
4.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Calibration Procedure . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Real-time Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Optimized Nearest-Neighbor . . . . . . . . . . . . . . . . . . . 31
4.2.2 Optimized Nearest-Neighbor Performance . . . . . . . . . . . 33
4.3 Complete System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Conclusions and Future Work 37
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Bibliography 39
A Listings of Code 43
A.1 Chapter 3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2 Chapter 4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Vita 53
vii
List of Tables
3.1 Naive Stereo Performance . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Windowing Stereo Performance . . . . . . . . . . . . . . . . . . . . . 15
3.3 SSE Intrinsic Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 SIMD Stereo Performance . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Parallel Stereo Performance . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Tsai’s Model Estimation Parameters . . . . . . . . . . . . . . . . . . 28
4.2 24-bit Color Remapping . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 8-bit Grey Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
List of Figures
1.1 Epiplor Line Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Tsukuba Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Naive Correspondence Comparison . . . . . . . . . . . . . . . . . . . 13
3.3 Intermediate Data Structures . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Intermediate Data Structures . . . . . . . . . . . . . . . . . . . . . . 16
3.5 SSE Register Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 SSE Array Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Intermediate Data Structures with Over-fitting . . . . . . . . . . . . . 20
3.8 SSE Row Absolute Difference . . . . . . . . . . . . . . . . . . . . . . 20
3.9 SSE Row Summing Window Process . . . . . . . . . . . . . . . . . . 23
3.10 SSE Column Summing Process . . . . . . . . . . . . . . . . . . . . . 23
3.11 SSE Winner Take All Process . . . . . . . . . . . . . . . . . . . . . . 24
3.12 Parallel Data Division and Recombination . . . . . . . . . . . . . . . 26
3.13 Multi-Threading Performance . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Cameras and Mounting Rig . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Calebration Images with cvDrawChessboardCorners . . . . . . . . . 32
4.3 Rectified Stereo Image Pair . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Nearest Neighbor Rectification . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Complete Stereo Imaging Process . . . . . . . . . . . . . . . . . . . . 36
ix
Chapter 1
Introduction
In computer stereo vision, two cameras are placed horizontally from one another and
the difference between the views are used to map the depth of the environment. This
method of depth mapping is in contrast to Light Detection And Ranging (LiDAR).
Stereo vision is cheaper, completely passive, but computationally expensive, while
LiDAR is relatively expensive, requires active laser projection, but requires much less
processing.
This Thesis will focus on the real time general-purpose CPU calculation of
computer stereo vision for a swarm of mobile robots using generic USB webcams.
A number of hardware choices are well suited for the massive parallel computing
potential of stereo vision. But cost and power requirements limit the choices.
Application-Specific Integrated Circuits (ASIC), would be the highest form of
low power and speed, but the astronomical setup cost for manufacturing and
unconfigurable nature of ASICs make them a poor choice.
Consumer level Graphics Processing Units (GPU) are ample with parallel
processing potential and very configurable. GPUs can be purchased for as little as
$80 but still requires a host CPU to run. The largest problem with GPU computing
is the power and cooling requirements, often requiring fifty (50) to one-hundred (100)
1
Watts to run in an air-conditioned environment. This much power is not available
for mobile robots which are often working outdoors.
Field-Programmable Gate Arrays (FPGA) are very well suited toward mobile
computer stereo vision. FPGAs require little power, are inexpensive, and very
configurable. The main caveat to FPGAs is the difficulty in programming them.
They require intimate knowledge of Verilog or VHDL and a slight adjustment of
how the stereo images are processed may require a large change in code. Easier
to use graphical FPGA programming suites exist for example National Instruments
LabView, but may be cost prohibitive.
The best processing hardware for this application of mobile stereo vision is a
generic consumer level low power motherboard CPU combo, such as the Intel Atom
integrated systems or ARM processors. These low power processors feature different
implementations of Single Instruction Multiple Data (SIMD) operations and multi
core support.
1.1 Introduction to Stereo Vision
Stereo vision is an important branch of research in digital image processing and
computer vision. One of the most important techniques of stereo vision is binocular
stereo vision. Binocular Stereo Vision, based on processing two (2) images, is a
current research topic at present because it closely reflects the physiological/biological
function of the human eyes when observing one scene from two (2) different
perspectives. Using the principle of triangulation, disparities between three-
dimensional (3D) points are mapped to pixels as the two images are processed allowing
the visual depth information to be recovered. In this manner, the object surface shape
can be acquired due to the differences between the points.
2
1.2 Reference and Target
From the two (2) images, one is declared a reference image and the other is referred
as the target image. For every single pixel in the reference image, the target image
must be searched to find its corresponding location. The distance in pixels between
the final corresponding locations is the disparity. When all pixels in the reference
image have been corresponded to a pixel location in the target image, a complete
dense disparity map is found.
The selection of the reference and target images from the stereo image pair can
be associated to either image. Most often the left stereo image is the reference and
the right is the target image. Some stereo matching techniques may use the left and
right stereo images as both the reference and target images to get better accuracy
and reduce occlusion errors. These methods may take it a step further and generate
a depth map from a point of view between the stereo image camera perspectives.
The entire area of the target space does not need to be searched for a single point
on the reference image. Consider both stereo images, in Figure 1.1, a projection
of 3D space onto a 2D plane[Ml, Mr] from two (2) independent camera origins [Cl,
Cr]. Let the left image [Ml] be the reference image and the right image [Mr] be the
target image. When a point on the reference image [Ml] is selected to be solved for
its disparity, a line is formed from the camera origin [Cl]. The projection of this
line onto the target image [Mr] forms the epipolar line constraint. So, for a point
selected on the reference image, the search space on the target image is only along
the corresponding epipolar line.
1.3 Camera Alignment
The alignment of the two (2) stereo cameras is paramount for proper and efficient
stereo depth perception. When the two (2) stereo cameras are not aligned parallel
from each other it causes the epipolar lines to skew across the image. It may be
3
acceptable in some stereo systems to process the images with skewed eplipolar lines,
however for real time systems; it is much more computationally efficient to have the
epipolar lines parallel with one another and row aligned with the image scan lines.
This way, when scanning the line of pixels that make up the epipolar line, the next
pixel in the line is the next place in memory instead of having to refer to a lookup
table.
Figure 1.1: Epiplor Line Structure
4
Chapter 2
Stereo Correspondence Methods
Stereo correspondence is the corner stone of stereo vision. It is the process of matching
areas and or specific locations between the two stereo cameras. Many methods exist
to solve this problem, often trading speed for accuracy or vice versa.
2.1 Pixel based Methods
Pixel-based methods are based on brightness intensity values that compare two (2)
images, pixel by pixel along the epipolar lines and choose a corresponding disparity
that best fits the cost function. Pixel-based methods by themselves are sensitive
to noise but may show finer detail around corners of objects in the stereo image
pair. Two common techniques are used to reduce the noise sensitivity and increase
accuracy of pixel-based methods. One, pre-filters may be used on the stereo image
pair to reduce sources of noise, sacrificing some fine detail Hirschmuller (2007). And
two, global-optimization algorithms can be used to find the best fit for all pixels
across an epipolar line Scharstein (2001). The global-optimization methods are the
most accurate but require the most amount of processing time.
5
2.2 Area-based Methods
Area-based or block matching methods are based on brightness intensity values
aggregated over a local support window between the two (2) comparable images.
The local support window, is convolved between the two (2) stereo images across
the epipolar lines. Common aggregation methods include square window, Gaussian
convolution Scharstein (2001), multi window, and shiftable window techniques Kuk-
Jin (2006).
Area-based methods are less computationally expensive than globally optimized
pixel-based correspondence. However, they are not sensitive to noise and produce
complete dense disparity maps like pixel-based methods. But at the cost of reduced
noise sensitivity, accuracy is lost in high contrast areas of the image pair.
2.3 Feature-based Methods
Feature-based methods use detected feature points to highlight areas of high diversity
in the image pair. Feature points may include edges, lines, regions, and gradient
peaks. By only including feature points with high texture diversity, areas of hard to
match low texture diversity are not matched with another. Since only the feature
points are correlated, the computation cost is greatly reduced. However, a complete
dense disparity map cannot be obtained. Additionally, feature-based methods have
a great immunity to noise.
It should be noted that the feature-based methods are intended to reduce
processing time for stereo matching, so consideration is taken on choosing a feature
extractor that does not consume more computation time than it saves.
6
2.4 Cost Matching Functions
All stereo correspondence algorithms have a way of measuring the similarity of
image pixel locations. Typically, a matching cost is computed at each pixel for
all disparities under consideration. The simplest matching costs assume constant
intensities at matching image locations, but more robust costs model certain
radiometric changes and/or noise. Common pixel-based matching costs include
absolute differences, squared differences, sampling-insensitive absolute differences, or
truncated versions, both on gray and color images. Common window-based matching
costs include the sum of absolute or squared differences (SAD / SSD), normalized
cross correlation (NCC), and rank and census transforms. Some window-based
costs can be implemented efficiently using filter. For example, the rank transform
can be computed using a rank filter followed by absolute differences of the filter
results. Similarly, there are other filters that try to remove bias or gain changes, e.g.,
logarithmic and mean filters. Hirschmuller (2007)
2.5 Common Cost Matching Functions
There are three (3) commonly used types of cost matching functions, Sum of absolute
Differences, Sum of Square Differences, and Maximum of Absolute Differences. In all 3
cost matching functions shown in equations 2.1, 2.2, and 2.3, N represents the number
of pixels away from the center pixel in the cost window, where the full window size
can be found by 2N + 1. The coordinates x and y are the locations of the base pixel
to be compared. Finally, d is the disparity between the target and reference pixels to
be matched.
The Sum of absolute Differences (SAD) function is the most commonly used cost
matching function. It has decent accuracy and computational speed. The Sum of
Square Differences (SSD), another common cost function sacrifices speed for accuracy.
And finally Maximum of Absolute Differences (MAD) can run faster than the previous
7
two methods but is the least accurate. For this thesis, Sum of absolute Differences
will be used.
Sum of Absolute Differences (SAD)
n=N∑n=−N
m=N∑m=−N
∣∣Mln+x,m+y −Mrn+x+d,m+y
∣∣ (2.1)
Sum of Square Differences (SSD)
n=N∑n=−N
m=N∑m=−N
(Mln+x,m+y −Mrn+x+d,m+y
)2(2.2)
Maximum of Absolute Differences (MAD)
max−N≤n≤N [max−N≤m≤N
∣∣Mln+x,m+y −Mrn+x+d,m+y
∣∣] (2.3)
2.6 Correspondence Functions
Correspondence cost functions are used to compare row aligned stereo image pairs to
one another. These cost functions will take a specified pixel location defined by (x,y)
from the left and right images defined by Ml and Mr respectively and compare an area
of pixels. In a naive search, the cost function is cross correlated with every pixel in
the left and right image rows to find the lowest cost pixel pairs. The final disparity is
chosen at the minimum value from the cost matching function. This is also known as
the Winner Take All (WTA) method. Many more advanced correspondence methods
exist that can achieve higher accuracy, but the WTA method is simple and very fast.
To limit the search area, a minimum and maximum disparity is often chosen.
This will help reduce computation time at the expense of eliminating a search-able
depth from the stereo cameras. A small disparity is associated with objects far in the
distance; while larger disparities are very close to the stereo cameras. This thesis’s
8
algorithm grounds the minimum disparity at zero (0) and allows a maximum search-
able disparity to be set.
9
Chapter 3
Block Matching
3.1 Core Technologies
Before delving into stereo vision block matching, a few technologies need to be
addressed for their integral role in this thesis. First, a core part of this thesis uses the
OpenCV(Open source Computer Vision) API (Application Programming Interface).
It is a consolidation of hundreds of computer vision algorithms and provides basic
structures for managing images in C++. This thesis mainly uses OpenCV to acquire
images from cameras, display the images to the screen, and save the images to the
hard drive. OpenCV also provides the competing block matching function this thesis
attempts to improve.
The Second technology that needs to be addressed are Single Instruction Multiple
Data (SIMD) operations. As the acronym implies, it is a way of processing arrays of
data with a single instruction in very few process cycles. Many processor manufactures
support some flavor of SIMD operations and for this thesis, Intel’s Streaming SIMD
Extensions (SSE) will be used.
10
3.2 OpenCVs Block Matching Function
OpenCV uses an existing stereo correspondence block matching function known as
the very fast SAD-based stereo correspondence algorithm. It was written by Kurt
Konolige and utilizes SSE2 instructions. The algorithm supports many parameters
from maximum search-able disparity to the starting disparity value. It also includes a
Sombel pre-filter to help remove noise and texture uniqueness testing. For this thesis,
the Sombel pre-filter and uniqueness testing will not be used to better compare with
this thesis’s algorithm.
3.3 Naive Stereo Matching Function
The first iteration of this thesis’s stereo matching algorithm is a functional but naive
algorithm. The algorithm uses three (3) nested loops to search over the maximum
disparity and the SAD windows, and chooses the pixel disparity with the associated
minimum SAD window. For a given image with N pixels the Big O notation is O(N3).
See C++ code in Appendix A.1
3.4 Test Hardware
For performance comparisons two (2) computers are used to run the stereo algorithms.
First a desktop computer with a 2.3 GHz Intel Core 2 Quad processor (Q8200) with
8GBs of memory. Second a Dell laptop with a quad core Intel i7 processor (i7-720Q)
with a clock speed of 1.6Ghz with 4GBs of memory. Both computers are running
Windows 7.
11
(a) Left Image (b) Right Image (c) Ground Truth
Figure 3.1: Tsukuba Data Set
3.5 Test Stereo Images
For the first set of performance testing the Tsukuba image set from the Middileberry
2001 dataset Scharstein (2001). The images are pre-calibrated with row aligned
epipolar lines and also includes a ground truth disparity image.
3.6 Naive Stereo Correspondence and OpenCV
BM Comparison
Comparing the disparity maps Naive stereo correspondence method to the highly
optimized OpenCV BM method, first both maximum disparities are set at 64 and
the SAD window size is set at 11x11. Comparing the OpenCVs disparity map to the
ground truth with the average squared difference is 6457, (lower is better) and the
naive stereo correspondence algorithm scored a 4746. It can also be seen visually that
the OpenCV BM algorithm produces quite a bit of white speckled noise.
The average runtime time of the two algorithms are evaluated on two separate
computers. The first computer hosts an Intel Core 2 Quad Q8200 processor running
at 2.33 Ghz with 8 GB of RAM. The second computer hosts an Intel Core i7-720Q
processor running at 1.6 Ghz with 4 GB of RAM. Both processors are quad core
processes, which have four (4) processors on a single chip. Looking at table 3.1, the
12
Table 3.1: Naive Stereo Performance
OpenCV BM NaiveQ8200 2.33 Ghz 8.75 ms 1808 msI7-720Q 1.6 Ghz 20.64 ms 4378 ms
(a) OpenCV BM (b) Naive BM
Figure 3.2: Naive Correspondence Comparison
OpenCV BM algorithm devastates the runtime of the naive stereo correspondence
algorithm.
3.7 Sliding Window Technique
Examining the operation of the naive stereo matching algorithm, most of data is
redundantly calculated. To cut down redundancy, the sliding window technique can
be used Muhlmann (2001). Instead of recalculating a vector window sum repeatedly,
previous window sums can be used to supplement the calculation of the next window
sum. For any window size the old element can be subtracted from the old sum and the
new element can be added, resulting in the next window sum. This process can reduce
processing time by one (1) order of magnitude. In the case of the two-dimensional
window sum in the SAD window algorithm, by two (2) orders of magnitude.
13
(a) D Matrix (b) S1/S2 Matrix
Figure 3.3: Intermediate Data Structures
3.7.1 Intermediate Data Structures for the Sliding Window
Technique
Intermediate data structures are needed to provide a support structure for the sliding
window process. First is the Difference Matrix (D). This matrix will store all of the
unique pixel differences across a single image row with in the bound of the maximum
disparity. Next are the S1n matrices. Each S1 matrix stores the sliding window
sum of each row from the D matrix. Additional instances of S1 are needed to store
multiple row sums the size of the summing window so that n = window size. The
final storage structure is the S2 matrix. This matrix stores the column sums from the
S1n matrices. In this thesis all of the intermediate storage matrices are represented
by two-dimensional matrices, however each matrix is stored in memory as a single
continuous array in memory. These intermediate data structures are created once
and reused for each subsequent stereo image pair that is processed.
3.7.2 The Four Main Processes
There are four main processes that transform the stereo image pairs to a final disparity
map represented by Figure 3.4. Note the example figures for the following processes
are for an image pair that is 12 pixels wide and a maximum of 4 disparity. First is the
pixel absolute differences process Figure 3.4a. This process scans a single selected
row of the target and reference stereo images and fills the D matrix with all of the
unique pixel differences.
14
The row summing window process Figure 3.4b uses the sliding window technique
to sum the unique pixel differences from the D matrix into a specified S1 matrix. The
column summing window process Figures 3.4c and 3.4d takes place after all of the
S1 matrices have been filled, they are summed element by element into the S2 matrix.
Once the S2 matrix is filled, the winner take all (WTA) process seen in Figures 3.4c
and 3.4d can scan the columns of the matrix to find the minimum element. The row
number of the smallest column element is then set as the disparity of that pixel. This
results in the first completely solved disparity row.
When the next row is needed the oldest S1 matrix is subtracted from S2 and the
previously old S1 matrix is used to store the newest row data from the row summing
window process (see figure 3.4d).
3.7.3 Sliding Window Performance
The sliding window technique drastically improves performance over the naive stereo
correspondence algorithm while producing the same result as the naive algorithm. For
a given image, the sliding window improved algorithm performs with big O notation
of almost O(N). However, the improved algorithm is still slower than OpenCVs block
matching algorithm.
3.8 Improvements with SIMD Instructions
Many consumer level processors have extended instruction sets to improve per-
formance in certain situations. Common extended instruction sets include Single
Instruction Multiple Data (SIMD) operations. As the name implies, SIMD operations
Table 3.2: Windowing Stereo Performance
OpenCV BM Naive WindowQ8200 2.33 Ghz 8.75 ms 1808 ms 62.05 msI7-720Q 1.6 Ghz 20.64 ms 4378 ms 172.70 ms
15
(a) First is the Pixel Absolute Differences (b) Row Summing Window
(c) column summing window 1st (d) column summing window 2nd
Figure 3.4: Intermediate Data Structures
16
are able to process an array of data with the same instruction in a single operation.
For arrays and matrices of data, this can drastically improve computation time
performance.
A number of different processor manufactures have built technologies into their
chips to support SIMD operations. They include ARM with NEON, AMD with
3DNow, and Intel with MMX, Streaming SIMD Extensions (SSE), and Advanced
Vector Extensions (AVE). SSE operations will be using in this thesis since it is the
most widely supported processor extension, supported across multiple Intel and AMD
processors. While SSE will only be only technology used, expanding the code in this
thesis to another SIMD technology is trivial once the code is optimally vectorized.
3.8.1 Intel SSE overview
The Intel SSE introduces a set of 128-bit registers, separate from the standard CPU
registers, that can be packed with a number of different data types. Eight (8)
registers are available for x86 (32-bit processors and operating systems) and sixteen
(16) registers for x64 (64-bit processors and operating systems).
There are several revisions of SSE that have incrementally added new operations.
SSE was first introduced in 1999 and followed by revisions every few years. SSE2
in 2001, SSE3 in 2004, Supplemental Streaming SIMD Extensions (SSSE3) in 2006,
and SSE4 in 2004. The first revision of SSE only supported storing and operating on
four (4) 32-bit floating point or integer values stored in the 128-bit registers. With
SSE2, the ability to store and operate on additional data types was added. The SSE
instruction set includes an almost complete set of instructions. It includes arrhythmic,
logical, compare, shuffle, conversion, load, and store operations.
3.8.2 SSE with C++
Processor specific SSE instructions are normally only accessible via direct assembly
instructions. However Intel provides a header file <emmintrin.h> that makes the
17
assembly instructions available in C and C++ via intrinsic functions. Intrinsics are
basic C mappings for assembly routines that are directly inserted into the main code.
An additional requirement to read memory efficiently into the SSE registers is that
the memory needs to be 16-byte aligned. Where the starting memory address of
the 16-byte segment is a multiple of 16(sixteen). There are instructions to support
loading unaligned memory segments into the SSE registers, but they are considerably
slower. declaring an array of memory requires the use of the aligned malloc function
Microsoft Developer Network (2005). This function will grantee that the starting
point of an array is 16-byte aligned, but will not grantee the entire array will fit in
the 16-byte aligned space. This is the prime reason that OpenCVs block matching
stereo algorithm limits the size of the maximum disparity to multiples of 16.
There are 2 (two) ways to handle data that does not fit neatly into 16-byte chunks.
First the data vector can be allocated the exact space it needs. Looking at the diagram
below assume each square represents a 32-bit integer and each 4 (four) square section
can exactly fit into a standard SSE register. For Figure 3.6a sections 1 and 2 will fit
the SSE registers but section 3 will only partially fill the SSE registers. So to process
the whole vector section 3 may rely on standard CPU instructions. For figure 3.6a
each vector has been allocated a few sizes larger than it needs to fit all of the data.
But due to the over fitting, SSE instructions can be used on all 3 sections for element
Figure 3.5: SSE Register Packing
18
by element operations where the extra random data will not pollute the resulting
vector. The over fitting method is used is used in this thesis.
3.8.3 Changes in Intermediate Data Structures for SSE
Applying the over fitting method to the existing intermediate data structures produces
something similar to Figure 3.7. The figure, like before, is for a stereo image pair
with a 12 pixel width and maximum disparity of 4. For each matrix row, filler data is
added at the end so each row will be a multiple of 16 bytes. If the only operations to
be performed on the matrices were element by element, fill data would only needed
to be added on the end of the matrix. But for the winner take all process to work
efficiently the SSE registers need to iterate vertically and horizontally though the
matrix.
3.8.4 SSE Intrinsics Overview
There are many SSE intrinsics (C++ mapped assembly instructions) but only a few
are needed in this thesis’s algorithm. The Table 3.3 contains a list and description of
all the the SSE intrinsic functions that are used in this thesis. Note the SSE version
number after each description, the highest version number will be the limiting factor
for supported processors. Kowal (2011)
3.8.5 SSE Row Difference
The row difference process like before takes the absolute differences from the left and
right image rows at the zero disparity to the maximum disparity. The SSE intrinsic
functions improve the process by allowing 16 bytes to be processed at the same time
(note that in Figure 3.8 only 4 numbers are being processed in a single SSE instruction
for simplicity). First a 16 byte unaligned chunk from each image row are loaded into
the SSE registers with the mm loadu si128 instruction.
19
(a) Exact Fitting (b) Over Fitting
Figure 3.6: SSE Array Fitting
(a) D Matrix (b) S1/S2 Matrix
Figure 3.7: Intermediate Data Structures with Over-fitting
(a) First Difference (b) Second Difference
Figure 3.8: SSE Row Absolute Difference
20
21
Table 3.3: SSE Intrinsic Reference
Intrinsic Operation SSE DescriptionVersion
m128i mm lddqu si128( m128i const* p) SSE3 loads an unaligned 128-bit valuem128i mm sub epi8( m128i a, m128i b) SSE2 the 16 8-bit integers of b from am128i mm abs epi8( m128i a) SSSE3 calculates the absolute value of the 16 8-bit a
integers of integers of am128i mm min epi16( m128i a, m128i b) SSE2 return the 8 minimum 16-bit integers a
between and bm128i mm cmplt epi16( m128i a, m128i b) SSE2 returns 8 16-bit integers valued at 0xFFFF
if the equivalent value of a is less than b or0x0000 if not
m128i mm blendv epi8( m128i a, m128i b, m128i mask) SSE4 returns the 8-bit values from a or bdepending on the given mask
m128i mm packs epi16( m128i a, m128i b) SSE3 packs the 16 16-bit integers from a and binto 16 8-bit integers and saturates
void mm storeu si128( m128i *p, m128i a) SSE3 stores the 128-bit value a to unalignedmemory at p
The two registers are subtracted and have their absolute value taken with the
mm sub epi8 and mm abs epi8 instructions. The resulting SSE register is stored
into aligned memory. The inner most while loop iterates this process horizontally
across the matrix while the outer for loop iterates the process vertically over the
different disparities. See C++ code in Appendix A.2.
3.8.6 SSE Row Summing Window
The row summing process for the SSE improved stereo correspondence is nearly
identical to the previous row summing process. The only difference is that it has
been adjusted to accept the new SSE compatible over fitted matrices, however is
not able to take advantage of any SSE instructions. The row summing process also
converts the 8-bit pixel differences in the D matrix to 16-bit sums in the S1 matrix.
3.8.7 SSE Column Summing Window
The SSE improved column summing window process sums the individual S1 matrices
into the S2 matrix like before. However the old summing window process from section
3.7.2 subtracts the old S1 data from S2 calculates a new S1 matrix then adds it into S2.
The SSE improved column window process requires that the old S1 and new S1 data
are both available. The old S1 matrix is subtracted from the new S1 matrix and added
to the existing S2 matrix via the mm sub epi16 and mm add epi16 intrinsics. See
C++ code in Appendix A.3
3.8.8 SSE Winner Take All
The SSE improved winner take all process is significantly changed from the non SSE
improved process. For simplicity the SSE registers in Figure A.4 only show storing
4 values each, however the real process is storing 8 16-bit values in each SSE register.
The process begins by comparing the first 2 128-bit data chunks at the 0 (zero) and
1 (one) disparity levels with the mm cmplt epi16 intrinsic function. The resulting
22
Figure 3.11: SSE Winner Take All Process
(a) First WTA Operation (b) Second WTA Operation
value is stored in the mask SSE register. The minimum 8 16-bit values are then found
with the mm min epi16 intrinsic and stored in the ssVal SSE register. Two registers,
one filled with zeros and the other filled with ones both representing the disparity
level, are mixed together according to the mask register using the mm blendv epi8
intrinsic with the result put into the ssDisp register. The ssDisp register now stores
the disparity level of the minimum values.
Each following disparity level is compared against the existing ssVal register and
the minimum 8 16-bit values stored back into ssVal . the existing ssDisp register is
blended with the new disparity index. When all disparities have been examined, the
ssDisp register will hold the final pixel 16-bit disparity values.
Since the ssDisp register is holding the disparities values as 8 16-bit numbers, it
needs to be packed into 8-bit numbers. Using the mm packs epi16 intrinsic it does
exactly that and stores the final disparity values into the memory unaligned output
image. See C++ code in Appendix A.4
3.8.9 SSE with Sliding Window Performance
The vectorization and inclusion of SSE instructions have more than tripled the
performance over using the summing window technique alone. However further
improvement is still needed to catch-up with the OpenCv block matching algorithm.
24
3.9 Improving performance with Parallelism
While consumer level CPUs are available at greater speeds every year, the physical
limit of processor speed is quickly approaching. To keep improving performance with
each new generation, processor manufactures have been adding multi-core support
to their processors. However implementing multi-threaded support to the algorithm
adds new problems.
First developing multi-threaded processes is not natively supported by C++ and
requires operating system specific functions to start new threads. For any cross
platform project, this is a huge hindrance. There are several packages available the
bridges the platform gap in multi-threading such as the Boost libraries, the Intel
Threading Building Blocks (TBB), and the updated C++11 standard that includes a
thread object in the standard template library. This thesis uses the Intel Threading
Building Blocks.
Other problems introduced with parallelism are shared resources, race conditions,
and deadlocks. While there are a number of techniques to manage and avoid these
problems, this thesis avoids most of them by having completely separate instances
of intermediate data for each thread and does not allow threads to pass data to one
another. This technique is sometimes call embarrassingly parallel.
3.9.1 Dividing the Data
To easily manage the threaded data, the stereo image pair is separated into separate
independent stereo matching problems. In the example at Figure 3.12, the stereo
image pair is separated into two stereo matching problems that can be solved in
parallel then stitched back together for the whole depth mapped image. For the
Table 3.4: SIMD Stereo Performance
OpenCV BM Naive Window SIMDQ8200 2.33 Ghz 8.75 ms 1808 ms 62.05 ms 20.35 msI7-720Q 1.6 Ghz 20.64 ms 4378 ms 172.70 ms 53.22 ms
25
stereo image in the example being separated for two threads, the crossover space
between the red and blue divisions is given to both threads. Since the threads do
not share any data with each other this cross over space is redundantly calculated
by both threads. An attempt was made to pass this pre-calculated data along to
adjacent threads, but any gain in performance was lost to the added overhead.
3.9.2 Parallelism Performance
The test data of the multi-threaded improved stereo correspondence algorithm once
again shows the average of 100 tests with the Tsukuba data set under different
numbers of threads. The multi-threaded improved algorithm has about the same
performance as the SIMD improved process as it should be. The multi-threaded
process drastically improves as it is split into 2 (two) threads by almost halving the
time. At 4 (four) threads the diminishing return of splitting the process into threads
Figure 3.12: Parallel Data Division and Recombination
26
can be seen. At this point the best performance is reached for the quad core Q8200
processor. The peak performance for the quad core i7-720Q processor is however at
16 threads. This is most likely due to the processors support of Hyper threading
technology.
With this final performance improvement, the proposed stereo matching method
finally surpasses OpenCv block matching algorithm in not only depth accuracy but
also in computation time for quad core and higher processors.
Table 3.5: Parallel Stereo Performance
OpenCV BM Naive Window SIMD Best ThreadedQ8200 2.33 Ghz 8.75 ms 1808 ms 62.05 ms 20.35 ms 6.40 ms (4)I7-720Q 1.6 Ghz 20.64 ms 4378 ms 172.70 ms 53.22 ms 14.30 ms (16)
Figure 3.13: Multi-Threading Performance
27
Chapter 4
Applying Stereo Method to Real
Cameras
Often, no matter how well the cameras are physically aligned, imperfections in
the complementary metal-oxide-semiconductor (CMOS) or charged-coupled device
(CCD) imaging cell placement and lens distortion can prevent pixel for pixel alignment
between the epilines and the image scan lines. Several camera calibration and
rectification techniques exist to rectify this problem. The most popular rectification
technique is Tsais model for pinhole perspective projection Tsai (1987). Tsais model
uses eleven (11) estimation parameters to project 3D world space into the 2D image
space.
Peramiter Discriptionf Focal length of camerak Radial lens distortion coefficient
Cx, Cy ordinates of center of radial lens distortionS Scale factor
Rx, Ry, Rz Rotation angles for transformationtx, ty, tz Translation components for transformation
Table 4.1: Tsai’s Model Estimation Parameters
28
4.1 Calibration
To properly set the estimation parameters in Tsai’s camera model, a set of calibration
images with a known calibration pattern such as a checker board can be used. knowing
the exact size of the checker board, the exact location in 3D space of the checker
board can be found only with the 2D image. With several images of checker boards
at different angles and depths from the camera, a gradient decent function may solve
for the best estimation parameters for the pin hole camera model.
4.1.1 Hardware
The cameras used for this thesis are two Logitech C250 webcams. The cameras are
capable of capturing 640 by 480 resolution images at 30 frames per second via USB
2.0. both of the cameras’ cases have been removed and the camera printed circuit
boards have been screwed directly to a mounting rig made from expanded PVC board.
Figure 4.1: Cameras and Mounting Rig
29
4.1.2 Calibration Procedure
To begin calibrating, a chessboard patterned calibration target is chosen. The
chessboard pattern may have any number of rows and columns; however having
less than five (5) in either dimension may reduce the quality of the rectified
result. The chessboard maybe printed on paper but it needs a firm backing. Any
physical distortion of the chessboard pattern will significantly reduce the accuracy of
calibration. Note that choosing black and white for chessboard test image gives the
maximum contrast for the next set of algorithms to detect the checkerboard corners.
The chessboard test pattern below is printed on an 8 x11.5 piece of paper and taped
to the face of a book. The test pattern has 2.0 cm wide squares with 11 columns and
8 rows.
Acquiring a set of test images is fairly simple. First the test pattern is presented
in front of the stereo cameras. The pattern needs to be in full view of both cameras.
A snap shot is taken and saved to memory or a file. Though experimentation, five (5)
image pairs seems to be sufficient to calibrate the stereo cameras. Using too many
calibration image pairs may make OpenCVs calibration using Hartleys algorithm
Hartley (1992) may take an excessive amount of time to reach its final values, or
worse my never converge to a final set of camera calibration constants.
To extract the chessboard point data, OpenCV provides the function
cvFindChessboardCorners. The function only needs to know the number of
chessboard cross sections vertically and horizontally and will return an array
of point where the cross sections are at. OpenCV provides another function
cvFindCornerSubPix to better refine the found chessboard corners to sub-pixel
accuracy. The results of the detected chessboard corners can be displayed on the
image via cvDrawChessboardCorners.
Once all of the chessboard corners have been harvested from the image pairs, the
point can be feed into another OpenCV function cvStereoCalibrate. This function
is based on Hartleys stereo camera calibration algorithm. The algorithm will compute
30
the best fitted estimation values for Tsais pinhole camera model ?Tsai) so that the
epipolar lines will be row aligned with each other.
4.2 Real-time Rectification
While it would be simple to apply Tsais transforms directly to the stereo image
pair for every video frame, it would take too much processing time for the real-time
requirements. Instead, the camera model is used to calculate a large lookup table that
represents the remapping of each individual pixel in an image. This method speeds
up the image correction time at the expense of system memory. OpenCV once again
provides the function remap that will remap and image given a mapping table. The
OpenCV remap function requires two (2) tables one for the horizontal mapping and
the other for vertical mapping. Both tables store the remapped sub-pixel locations
as 32-bit floating point numbers. This allows the function to use a number of pixel
interpolation methods such as nearest-neighbor, bilinear, bicubic, and lanczos. For
the purposes of this thesis, the least accurate but fastest of the methods, nearest-
neighbor, is used to remap the Images. But because the OpenCV remap function is
designed to support these multiple interpolation methods, it is not well optimized for
nearest- neighbor specifically. To get the best performance out of remapping, a new
function is programmed that is optimized in three (3) specific ways.
4.2.1 Optimized Nearest-Neighbor
First the mapping tables are pre-calculated to their nearest-neighbor pixel and stored
into a single table. The table no longer points to the horizontal and vertical pixel
locations in the image, but now point to the memory offset locations in the image pixel
array stored in memory. The second optimization only works for full color images.
In the original OpenCV remap function, each color channel (Red, Green, and Blue)
is copied separately 1 byte at a time. In the new remap function the red, green, and
31
(a) Left Calibration Image 1 (b) Right Calibration Image 1
(c) Left Calibration Image 2 (d) Right Calibration Image 2
Figure 4.2: Calebration Images with cvDrawChessboardCorners
Figure 4.3: Rectified Stereo Image Pair
32
(a) Original Camera Image (b) Corrected Camera Image
Figure 4.4: Nearest Neighbor Rectification
blue channels are copied all at once in a single 4 byte or 32-bit copy instruction. Each
color channel only occupies 8-bit or 1 byte so and extra dummy byte is also copied
over to the new remapped image. The next 32-bit copy instruction will overwrite
the dummy byte so each color pixel is placed correctly in the image pixel array in
memory. Note that the last pixel in the image must be copied over 1 byte at a time,
or an extra byte may be copied outside of the image pixel array bounds in memory.
The final optimization separates the remapping task across multiple processor cores
for full parallel processing. See C++ Code in Appendix A.5 and A.6.
4.2.2 Optimized Nearest-Neighbor Performance
The Optimized Remapping (OR) algorithm is benchmarked on the two (2) computer
systems used previously. The test image is a 640 x 480 resolution image, this is
the same resolution of the USB cameras. Refering to Tables 4.2 and 4.3 the
performance of the optimized remapping algorithm has drastically improved upon
OpenCV’s standard nearest neighbor remapping method. For both quad-core systems
choosing 4 to 8 threads gives the best results. choosing more than 8 congruent threads
will give quite erratic performance. Note that the remapping process need to be run
for the left and right stereo images, so for the total system time of remapping can
simply be doubled from the tables.
33
4.3 Complete System
The complete stereo correspondence system begins with capturing images from the
cameras with OpenCV’s VideoCapture module. The VideoCapture module will
retrieve the two images from the camera in an easy to manipulate cv::Mat object. If
the VideoCapture module is called faster then the camera’s frame rate will allow, the
module will simply block execution until the camera is ready. the Logitech USB web
cameras used in this thesis are limited to 30 frames per second. the stereo images
are then passed to OpenCV’s cv::cvtColor function and converts the 24-bit color
images to 8-bit grey scale. This color conversion is necessary since Both OpenCV’s
block matching and this thesis proposed methods can only parse 8-bit images.
Once the stereo images are gray, they are rectified with this thesis’s optimized
rectification algorithm from section 4.2.1. The rectified stereo images are now passed
to this thesis’s proposed stereo block matching method and produce the final disparity
image. since this thesis is only developing the block matching method, the disparity
map is displayed to the screen and returns to the main loop to acquire another stereo
image pair from the cameras. However, for an autonomous robot, further processing
of the disparity map is required to effectively use it. Mainly the disparity map would
be used for object detection and/or avoidance.
34
35
Table 4.2: 24-bit Color Remapping
OpenCV OR OR OR OR OR ORremap 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads, 32 Threads
Q8200 2.33 Ghz 2.29 ms 0.95 ms 0.42 ms 0.44 ms 0.48 ms 0.43 ms 0.41 msI7-720Q 1.6 Ghz 4.81 ms 1.55 ms 0.79 ms 0.75 ms 0.41 ms 0.64 ms 0.66 ms
Table 4.3: 8-bit Grey Remapping
OpenCV OR OR OR OR OR ORremap 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads, 32 Threads
Q8200 2.33 Ghz 1.93 ms 0.43 ms 0.23 ms 0.13 ms 0.15 ms 0.14 ms 0.14 msI7-720Q 1.6 Ghz 3.97 ms 1.08 ms 0.55 ms 0.28 ms 0.31 ms 0.62 ms 0.32 ms
Chapter 5
Conclusions and Future Work
5.1 Conclusions
This thesis’s proposed stereo block matching method is able to surpass OpenCV’s
block matching method in quality and processing time in quad core systems. For
single and dual core systems OpenCV’s method is superior in processing time.
Additionally this paper’s method of block matching is not limited to window sizes
at multiples of sixteen (16) as in the OpenCV’s block matching method, thanks the
use of over-fitting the intermediate data structures for the SIMD operations.
5.2 Future Work
Continued work on this thesis’s method will include expanding the supported
architecture the stereo block matching method uses. First, Intel’s Advances Vector
Extensions (AVE), a 256-bit extension of the SSE instruction set, will be added for
greater performance in Intel Sandy and Ivy Bridge processors. Next, ARM’s SIMD
instruction set, NEON, may be added for their chip-sets.
Since OpenCV’s block matching method still runs quicker for single core systems,
there may still be room for improvement in optimizing the code. Further improvement
37
may come from directly optimizing critical loop sections in assembly language to
efficiently stream image data.
The disparity image quality may also be improved by implementing uniqueness
testing in the winner take all method or by applying pre-filters to the stereo image
pair.
38
Bibliography
Hartley, R. I. (1992). Estimation of relative camera positions for uncalibrated
cameras. In Proceedings of the Second European Conference on Computer Vision,
pages 579–587. Springer-Verlag. 30
Hirschmuller, H. Scharstein, D. (2007). Evaluation of cost functions for stereo
matching. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE
Conference on, pages 1–8. 5, 7
Kowal, G. (2011). Intrinsics guide for intel advanced vector extensions 2. 19
Kuk-Jin, Yoon In So, K. (2006). Adaptive support-weight approach for
correspondence search. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 28:650–656. 6
Microsoft Developer Network (2005). aligned malloc. 18
Muhlmann, K. Maier, D. H. R. M. R. (2001). Calculating dense disparity maps
from color stereo images, an efficient implementation. In Stereo and Multi-Baseline
Vision, 2001. (SMBV 2001). Proceedings. IEEE Workshop on. 13
Scharstein, D. Szeliski, R. Z. R. (2001). A taxonomy and evaluation of dense two-
frame stereo correspondence algorithms. In Stereo and Multi-Baseline Vision, 2001.
(SMBV 2001). Proceedings. IEEE Workshop on, pages 131–140. 5, 6, 12
40
Tsai, R. (1987). A versatile camera calibration technique for high-accuracy 3d
machine vision metrology using off-the-shelf tv cameras and lenses. Robotics and
Automation, IEEE Journal of, 3:323–344. 28
41
Appendix A
Listings of Code
A.1 Chapter 3 Code
Listing A.1: Naive Stereo Matching
void Stereo ::SAD:: stVisual(cv::Mat &leftImg , cv::Mat &rightImg , cv
::Mat &outImg)
{
// both stereo images must match
CV_Assert(leftImg.size() == rightImg.size());
CV_Assert(leftImg.type() == rightImg.type());
cv::Mat displayL;
cv::Mat displayR;
cv::Mat displayO;
if(leftImg.size() != outImg.size() || outImg.type() != CV_8UC1)
{
outImg.release ();
outImg.create(leftImg.size(),CV_8UC3);
for(int i=0;i<leftImg.rows*leftImg.cols *3;i++)
outImg.data[i] = 0;
}
43
int width_bytes = leftImg.cols;
int max_row_addresoffset = width_bytes - windowoffset *2;
int end_row_skip = windowoffset *2;
uchar* dataL_ptr = leftImg.data;
uchar* dataR_ptr = rightImg.data;
uchar* out_ptr = outImg.data;
// create an easy acess 2d arrays
uchar ** dataDL = new uchar *[ leftImg.rows];
uchar ** dataDR = new uchar *[ leftImg.rows];
uchar ** dataDO = new uchar *[ leftImg.rows];
for(int i=0;i<leftImg.rows;i++)
{
dataDL[i] = (leftImg.ptr <uchar >(i));
dataDR[i] = (rightImg.ptr <uchar >(i));
dataDO[i] = (outImg.ptr <uchar >(i));
}
int rows = leftImg.rows;
int cols = leftImg.cols;
for(int i = windowoffset; i < rows - windowoffset; i++)
{
for(int j = windowoffset + target_size; j < cols -
windowoffset; j++)
{
int min_diff = 10000;
int disparity = 0;
for(int k = 0; k <= target_size; k++)
44
{
int val = 0;
//SAD correspondance
for(int l=-windowoffset;l<= windowoffset;l++)
{
for(int m=-windowoffset;m<= windowoffset;m++)
{
val += std::abs(( dataDL[i+l][j+m] - dataDR[i+l][j-k+m
]));
}
}
if(val < min_diff)
{
min_diff = val;
disparity = k;
}
}
if(disparity > 255)
{
dataDO[i][j*3] = 0;
dataDO[i][j*3+1] = 255;
dataDO[i][j*3+2] = 255;
}
else
{
dataDO[i][j*3] = disparity;
dataDO[i][j*3+1] = disparity;
dataDO[i][j*3+2] = disparity;
}
}
}
delete [] dataDL;
delete [] dataDR;
45
delete [] dataDO;
}
Listing A.2: SSE Row Absolute Difference
%\label{sserabsdiffcode}
inline
void Stereo :: Stereo_Win_SIMD_PLL5 :: calcRow(uchar* left , uchar*
right ,uchar** out ,int max ,int colsize)
{
const int span = colsize - max;
uchar* end;
end = left + span;
__m128i* ss_end = (__m128i *)left + span /16;
for(int i=0;i<max;i++)
{
__m128i* ssA_ptr = (__m128i *)left;
__m128i* ssB_ptr = (__m128i *)(right + i);
__m128i* ssC_ptr = (__m128i *)(out[i]);
__m128i temp ,left ,right;
while(ssA_ptr <ss_end)
{
left = _mm_loadu_si128(ssA_ptr);
right = _mm_loadu_si128(ssB_ptr);
temp =_mm_sub_epi8(left ,right);
*ssC_ptr = _mm_abs_epi8(temp);
ssA_ptr ++;
ssB_ptr ++;
ssC_ptr ++;
}
}
}
46
Listing A.3: SSE Column Summing Process
ssA_ptr = (__m128i *) S1_ptr_new;
ssB_ptr = (__m128i *) S1_ptr_old;
ssC_ptr = (__m128i *) S2_ptr;
__m128i ssTemp;
ss_end = ssA_ptr + S1_el_ss;
while(ssA_ptr <ss_end)
{
ssTemp = _mm_sub_epi16 (*ssA_ptr ,* ssB_ptr); //8 16-bit subtract
*ssC_ptr = _mm_add_epi16 (*ssC_ptr ,ssTemp); //8 16-bit add
ssA_ptr ++;
ssB_ptr ++;
ssC_ptr ++;
}
Listing A.4: SSE Winner Take All Process
void inline Stereo :: Stereo_Win_SIMD_PLL5 :: calcWTArow3(Stype* input
, uchar* output ,int S1_rowsize ,Stype* val ,Stype* disp)
{
int ss_span = S1_rowsize_2 >>3;
__m128i* ssIn1;
__m128i* ssIn2;
__m128i* ssIn1_start = (__m128i *) input;
__m128i ssVal;
__m128i ssDisp;
__m128i* ss_end = ssIn1_start + ss_span;
__m128i temp ,mask ,ssZeros;
47
__m128i* ssOut = (__m128i *) output;
int depth_index;
ssZeros = _mm_set1_epi16 (0);
while(ssIn1_start <ss_end)
{
ssIn1 = ssIn1_start;
ssIn2 = ssIn1_start + ss_span;
ssVal = _mm_min_epi16 (*ssIn1 , *ssIn2);
mask = _mm_cmplt_epi16 (*ssIn1 , *ssIn2);
ssDisp = _mm_blendv_epi8(ssZeros ,_mm_set1_epi16 (1),mask);
ssIn2 += ss_span;
depth_index = 2;
while(depth_index <max_disparity)
{
mask = _mm_cmplt_epi16 (*ssIn2 , ssVal);
ssVal = _mm_min_epi16(ssVal , *ssIn2);
ssDisp = _mm_blendv_epi8(ssDisp ,_mm_set1_epi16(depth_index),
mask);
ssIn2 += ss_span;
depth_index ++;
}
temp = _mm_packs_epi16(ssDisp , ssZeros);
_mm_storeu_si128(ssOut , temp);
ssIn1_start ++;
output +=8;
ssOut = (__m128i *) output;
48
}
}
A.2 Chapter 4 Code
Listing A.5: Optimized Grey Scale Nearest-Neighbor Remapping
void refinedCV :: pplremap_grey(const cv::Mat &input , cv::Mat &
output , std::vector <int > &map , int chunksize)
{
// consistancy check
CV_Assert(input.isContinuous ());
CV_Assert ((input.cols*input.rows) == map.size());
// allocate output Mat memmory
if(input.size() != output.size())
{
output.release ();
output.create(input.size(),input.type());
}
//
int chunk_width = map.size()/chunksize;
int map_width = map.size()/chunksize;
int max_addr_offset = map_width;
Concurrency :: parallel_for (0,chunksize ,[&]( int chunk)
{
int m;
int* mapptr = &(map [0]) + map_width*chunk;
uchar* data = output.data + chunk_width*chunk;
uchar* indata ,* input_data;
input_data = input.data;
49
int* maxaddess = mapptr + max_addr_offset;
for(;mapptr <maxaddess;mapptr ++)
{
m = mapptr [0];
if (m < 0)
{
data [0] = 0;
}
else
{
data [0] = input_data[m];
}
data ++;
}
});
}
Listing A.6: Optimized Color Nearest-Neighbor Remapping
void refinedCV :: pplremap_color(const cv::Mat &input , cv::Mat &
output , std::vector <int > &map , int chunksize)
{
// consistancy check
CV_Assert(input.isContinuous ());
CV_Assert ((input.cols*input.rows) == map.size());
// allocate output Mat memmory
if(input.size() != output.size())
{
output.release ();
output.create(input.size(),input.type());
}
50
//
int chunk_width = map.size()*3/ chunksize;
int map_width = map.size()/chunksize;
int max_addr_offset = map_width -1;
Concurrency :: parallel_for (0,chunksize ,[&]( int chunk)
{
int m;
int* mapptr = &(map [0]) + map_width*chunk;
uchar* data = output.data + chunk_width*chunk;
int* indata;
int* maxaddess = mapptr + max_addr_offset;
for(;mapptr <maxaddess;mapptr ++)
{
m = mapptr [0];
if (m < 0)
{
*(int*)data = 0;
}
else
{
indata = (int*)&( input.data[m]);
*(int*)data = *indata;
}
data += 3;
}
m = *mapptr;
if (m == -1)
{
data [0] = 0;
data [1] = 0;
data [2] = 0;
51
}
else
{
indata = (int*)&( input.data[m]);
data [0] = (( uchar*) indata)[0];
data [1] = (( uchar*) indata)[1];
data [2] = (( uchar*) indata)[2];
}
});
}
52
Vita
Bryan Bodkin was born in Los Angeles California, in 1985. After graduating in 2004
from Brentwood High School, He attended Middle Tennessee State University, where
he received a Bachelor of Science in 2009 double majoring in Computer Science and
Electrical Mechanical Engineering Technology. In the Fall of 2009 Bryan enrolled into
the Masters program at the University of Tennessee Knoxville in the department of
Electrical Engineering and Computer Science. A year later he joined the Advanced
Imaging and Collaborative Information Processing (AICIP) group as a research
assistant where he will complete his Master degree in the summer of 2012. His major
research areas are real-time image processing, firmware programming, Graphics User
Interface (GUI) programming, and electric hardware development.
53