fast and robust endoscopic motion estimation in high-speed laryngoscopy

12
Fast and Robust Endoscopic Motion Estimation in High-Speed Laryngoscopy Dimitar Deliyski § , Szymon Cieciwa * , Tomasz Zielinski * § Communication Sciences and Disorders University of South Carolina, Columbia, SC, USA * Instrumentation and Measurement Department AGH University of Science and Technology, Krakow, Poland Email: [email protected] Seven methods for endoscopic motion compensation for laryngeal high-speed videoendoscopy (HSV) are compared. Two of them are based on tracking the maximum of the cross- correlation function of two images; two are based, respectively, on the minimization of the L 2 –norm and the magnitude difference distance between two images; and the other three utilize properties of the FFT-based cross-power spectrum of two images. All seven methods were applied to compensate the motion, at the sub-pixel level, of the endoscopic lens relative to the vocal folds in HSV recordings. The motion compensation methods based on FFT cross- power spectrum demonstrated remarkable computational speed and satisfactory accuracy, while also offered wider motion-tracking range. The proposed two-step least square fitting of the FFT cross-spectrum phase plane was found to be the fastest among all seven approaches. Keywords: high-speed videoendoscopy, motion compensation, voice evaluation, kymography 1. Introduction The endoscopic (camera lens) motion in high-speed videoendoscopy (HSV) affects the time alignment of the laryngeal anatomic structures in the image. Sub-pixel endoscopic motion compensation (MC) is an important preprocessing operation in certain visual-perceptual and automated techniques for evaluation of vocal fold movement [1]. The problem of endoscopic MC for HSV is complex due to the dynamics of the vocal folds during phonation (Fig. 1a). Laryngeal HSV is essentially different from any other medical image because it registers the motion of an organ that moves very fast (70-1000 Hz), affecting practically all connected tissues and creating motion across the whole image. The motion of the connected tissues contains a fast component, comparable in speed with the vocal folds, but also slower components, some of which are comparable with the speed of the endoscopic motion (less than 15 Hz). No clear spatial outlier can show the motion relative to the camera lens located on the tip of the endoscope. Fortunately, the endoscopic motion and the changes

Upload: igomez19

Post on 18-Aug-2015

220 views

Category:

Documents


1 download

DESCRIPTION

Bom artigo

TRANSCRIPT

Fast and Robust Endoscopic Motion Estimation in High-Speed Laryngoscopy Dimitar Deliyski, Szymon Cieciwa*, Tomasz Zielinski* Communication Sciences and Disorders University of South Carolina, Columbia, SC, USA *Instrumentation and Measurement Department AGH University of Science and Technology, Krakow, Poland Email: [email protected] Seven methods for endoscopic motion compensation for laryngeal high-speed videoendoscopy (HSV)arecompared.Twoofthemarebasedontrackingthemaximumofthecross-correlationfunctionoftwoimages;twoarebased,respectively,ontheminimizationofthe L2normandthemagnitudedifferencedistancebetweentwoimages;andtheotherthree utilizepropertiesoftheFFT-basedcross-powerspectrumoftwoimages.Allsevenmethods were applied to compensate the motion, at the sub-pixel level, of the endoscopic lens relative to the vocal folds in HSV recordings. The motion compensation methods based on FFT cross-powerspectrumdemonstratedremarkablecomputationalspeedandsatisfactoryaccuracy, while also offered wider motion-tracking range. The proposed two-step least square fitting of the FFT cross-spectrum phase plane was found to be the fastest among all seven approaches. Keywords: high-speed videoendoscopy, motion compensation, voice evaluation, kymography 1. Introduction The endoscopic (camera lens) motion in high-speed videoendoscopy (HSV) affects the time alignmentofthelaryngealanatomicstructuresintheimage.Sub-pixelendoscopicmotion compensation (MC) is an important preprocessing operation in certain visual-perceptual and automated techniques for evaluation of vocal fold movement [1]. The problem of endoscopic MC for HSV is complex due to the dynamics of the vocal folds duringphonation(Fig. 1a).LaryngealHSVisessentiallydifferentfromanyothermedical image because it registers the motion of an organ that moves very fast (70-1000 Hz), affecting practically all connected tissues and creating motion across the whole image. The motion of the connected tissues contains a fast component, comparable in speed with the vocal folds, but alsoslowercomponents,someofwhichare comparablewiththespeedoftheendoscopic motion (less than 15 Hz). No clear spatial outlier can show the motion relative to the camera lens located on the tip of the endoscope. Fortunately, the endoscopic motion and the changes in the glottis during phonation have different dynamics. This dynamic difference has been successfully used [1] for building the missing outlier. In order to dynamically separate the fast vocal fold movements from the slow camera lens motion it is necessary to smooth the HSV. Smoothing of the time differential of the HSV image (Fig. 1b) has been found to be very effective when building the missing spatial outlier for endoscopic motion tracking [1]. Fig. 1. a) Open and closed phase of the vocal folds in two different x-y positions. Fig. 1. b) Smoothed time-differential images of vocal folds in two different x-y positions. The global displacement between two consecutive frames in video sequences have been typically estimated using direct image correlation or image difference minimization (L2-norm or magnitude difference) techniques. Fast motion estimation has been realized using cross-power spectrum (CPS) algorithms based on the 2D fast Fourier transform (FFT) [2]. However, theFFTmethodinitsoriginalversionislimitedtointegershiftsonlyandrequires enhancementforsub-pixelimageregistration[3-7].Theaccuracyofdifferentwell-known sub-pixelmotionestimationmethodsiscomparedin[8],includingforthepolyphase decomposition approach of Foroosh [3, 4], the frequency domain masking technique of Stone [5] and the sub-space identification extension algorithm of Hoge [6, 7]. The latter singular value decomposition (SVD)-based method was shown to be the most robust in almost-noise-free environment. A novel fast sub-pixel algorithm using a two-step least-squares (LS) approximation of the CPSphaseplanewasrecentlydeveloped[9].Inthepresentworkofthesameteam,that method is described in more detail, additionally optimized in respect to parameters values and compared not only to the classical image alignment techniques such as the image correlation and difference minimization but also to the SVD-based method [6], which was shown to be the best in [8]. All techniques were implemented in Matlab environment and their efficiency was tested on laryngeal HSV images with natural and artificially added displacement. This study aimed the estimation of translations. The methods estimating image rotation and scaling [10, 11], are outside the scope of this work. 2. Methodology 2.1. Known Motion Compensation Methods Given that f1(x, y) and f2(x, y) are two continuous functions, in this case two images, where the second function is a shifted in space version of the first one: f2(x, y) =f1(xx0, y-y0), we can find the displacement {x0, y0} making use of one of the following methods. a)b)c) -100-50050100-100010012345x107dxdyz -10010-1001023456x105dxdyz -10-50510-1001000.20.4xyz -202-2025.355.45.455.55.55x107dxdyz -202-2021.522.5x105dxdyz -202-202-0.200.20.40.6dxdyz Fig. 2.Detection matrices (similarity measures) for different methods used for HSV motion detection:up highersearchingrange;down lowersearchingrangeafterinterpolation (dx =a,dy =b);a) maximumofcorrelationsimilarity;b) minimumofL2-likesimilarity; c) peak for cross-power spectrum similarity. Correlationfunctionmethod.Theclassicmethodfor{x0, y0}detectionreliesonthe properties of the cross-correlation function of f1(x, y) and f2(x, y), which is defined as follows: 1 2( , ) ( , ) ( , )corrD a b f x y f x a y b dxdy = + + - -.(1) The function Dcorr(a, b) reaches its maximum for a =x0 and b =y0. However, the maximum is flat (as shown in Fig. 2a) and computation is time consuming. We can observe that f2(x, y) is shiftedbackinxandydimensionsbyaandb,tofittheoriginalimagef1(x, y).Speed optimization can be achieved by defining equation (1) for a limited range of a and b. Such approach is appropriate for HSV recordings when estimating small shifts within 2 pixels [9]. The functional (1) denotes also the 2D convolution Dconv(a, b) =Dcorr(a, b) of the images f1(x, y) and f2(x, y). The method described in [1] utilized the FFT-based implementation of 2D convolution function, namely conv2 in Matlab environment. That computational approach is further termed the Convolution method and will serve as a baseline for evaluating the other methods. The Convolution approach is algorithmically fast, but not optimal for HSV images because the function Dconv(a, b) is calculated in this case superfluously for all possible integer shifts a and b (depending on images size), not only for the their small values. This makes the method unnecessarily computationally intensive when shifts are small, but also very robust in detecting large shifts. L2normandmagnitudedifferenceminimizationmethods.Thespatialshiftscanbe determinedsimplybyminimizingthedifferencebetweentwoimageswhileartificially shifting one of them and computing a similarity measure of their difference, such as: 221 2| |( , ) ( , ) ( , ) D a b f x y f x a y b dxdy = + + - -(2) | | 1 2( , ) ( , ) ( , ) D a b f x y f x a y b dxdy = + + - -(3) where (2) represents the L2-norm measure of the image difference, and (3) is the magnitude difference (MD) measure. The minima of such functions are flat in the range of 2 pixels as shown in Fig. 2b. FFT-based cross-power spectrum method. It is known that the Fourier spectra of two images f1(x, y) and f2(x, y)=f1(xx0, yy0) are related as: 0 0( . . )2 1( , ) ( , )x yj x yx y x yF F e + = (4) where F1(x, y) and F2(x, y) denote their Fourier transforms. The normalized cross-power spectrum of these images is equal to: 12 0 0*( , ) ( . . ) 2 112*2 1( , ) ( , )( , )( , ) ( , )x y x yj j x y x y x yx yx y x yF FG e eF F + = = = ,(5) wheretheoperation*denotescomplexconjugation.Therefore,theinverseFourier transform of G12(x, y) results in: ( ) ( ) ( )0 0( . . )1 112 12 0 0, , ,x yj x yx yg x y Fourier G Fourier e x x y y + = = = ,(6) and it is characterized by a sharp Dirac delta function centered at (x0, y0). This property is very useful for motion detection. In the discrete case, f1(m, n) and f2(m, n) =f1(mm0, nn0), the above property still holds when fast FFT algorithms of direct and inverse discrete Fourier transform (DFT) are applied. Now (4) and (5) can be rewritten in a discrete form as follows: 0 02 ( / / )2 1( , ) ( , )j k m M l n NF k l F k l e + = ,(7) 0 0 12*2 ( / / ) ( , ) ( ) 2 112*2 1( , ) ( , )( , )( , ) ( , )j k m M l n N j k l j k lF k l F k lG k l e e eF k l F k l + += = = =,(8) where: ( )1 12 ( / / )0 0( , ) DFT ( , ) ( , )M Nj mk M nl Nm nF k l f m n f m n e += == = ,(9) M, Ndenotenumberofpixelsinrowsandcolumns,m0, n0designatereal-valuerowand column shifts, and, 0,1,2,..., 1, , 0,1,2,..., 1 k m M l n N = = . Now, the Dirac impulse takes a form of a 2D sinc function, centered at (m0, n0): ( ) [ ]( ) ( )0 0 112 120 0sin ( ) sin ( ), DFT ( , )( ) ( )m m n ng m n G k lm m n n + += = + +(10) shown in Fig. 2c-up, the spline interpolation of which is presented in Fig. 2c-down. Sub-pixelextensions. Motion detection matricesDcorr(a, b) (1),D||2(a, b) (2) and D||(a, b) (3) can be easily interpolated as it is shown in Fig. 2ab-down for sub-ranges (amax, amax) and ( bmax, bmax) around its maximum or minimum. Adaptive strategies for changing the values of a and b can be applied. The same is true for g12(m, n) (10) that can be interpolated near its maximum (m0, n0), as shown in Fig. 2c-down. LSfittingofcross-powerspectrumphaseplane.Further,onlythediscretecasewillbe discussed. Instead of calculation of g12(m, n) from (10) and its interpolation near maximum, it isalsopossibletoestimatedirectlytheshiftsm0andn0betweentheimagesf1(m, n)and f2(m, n)=f1(mm0, nn0) using two points (samples) 12(k1, l1) and 12(k2, l2) of the phase plane 12(k, l) (8) and solving the following set of two equations with two unknowns m0 and n0: 12 1 1 1 0 1 0 1 112 2 2 2 0 2 0 2 2( , ) 2 ( / / )( , ) 2 ( / / )k l k m M l n N k lk l k m M l n N k l = + = + = + = +(11) where ( )( )( ) ( ) ( )12 112 12 1212Im ( , )( , ) tan tan2 Im ( , ) , Re ( , )Re ( , )G k lk l G k l G k lG k l = = (12) 02 / m M = , 02 / n N = (13) 0204002040-202l kfi(k,l)0204002040-202l kfi(k,l)Examples of a phase function 12(k, l), representing a plane in 3D space with characteristic angles and , is shown in Fig. 3. Due to effect of phase wrapping, as illustrated in Fig. 4, it is necessary to solve 12(k, l) for small values of k and l. Fig. 3.Example of a phase function 12(k, l) (8). a)b) Fig. 4.Phase plane 12(k, l) (8) for preprocessed HSV images of vocal folds for: a) small, and b) large values of their relative displacement (only part of the matrix 12(k, l) is shown). Since the operations involved in (12) are very noise sensitive, using more samples of the phase plane and least squares estimation is recommended: 1 1 12 1 12 2 12 2 212( , )( , )( , )K K K Kk l k lk l k lk l k l = M M M = Ax b (14) lk 1( ) pinv( )T T = = x A A A b A b (15) InMatlablanguage,theoperation(15)correspondstox =A\b.Theshiftsm0 andn0are computed from (13) by finding the optimal (in LS sense) values of and . Subspace estimation of cross-power spectrum phase plane [3]. Another interesting approach of exploiting of cross-power spectrum equation spectrum for calculation of displacement of two images f1(m, n) and f2(m, n) is based on the singular value decomposition (SVD) of the matrix G12(k, l) (8) that is of rank one: ( ) ( )12( , )j k j l HG k l e e = = uv (16) where H denotes complex-conjugate transpose. For the given matrix G12(k, l) one should findthedominantcomplexvectorsuandvassociatedwiththemaximalsingularvalue, calculate and unwrap their phases, and perform 1D least squares fitting of these phases to strait lines, in a similar manner as shown above, in order to calculate the values of and (seeFig. 5).OnlypartoftheG12(k,l)matrixcanbeusedforcalculationtospeedupthe computations (e.g. 2511 pixels were used in the example presented in Fig. 5). For laryngeal HSVrecordingsthebestresultshavebeenobtainedforsubmatriceswithdimensions /10 M /10 N where means rounding down to the nearest integer value. 0 5 10 15 20 25index-0.4-0.3-0.2-0.10unwrap(angle(u)) [rd] 0 2 4 6 8 10 12index-0.4-0.200.20.4unwrap(angle(v)) [rd] Fig. 5.Calculation of motion parameters and from dominant complex vectors u and v of the matrix G12(k, l) (16). 2.2. Proposed Motion Compensation Method The block diagram of the proposed two-step hierarchical LS motion estimation algorithm is presentedinFig. 6.Modulesmarkedwithdashedlinesareonlyusedwhenrelative displacementsgreaterthen1pixeloccur,whichisoftenthecaseswithlaryngealHSV recordings. Thus, in the proposed method, the integer shift is first computed and subtracted andthenthephaseofthecross-powerspectrumiscalculated.Next,thefirstLSiteration (LS #1) is performed making use of (13)-(15), when the 12(k, l) samples fulfill the condition: 120.5rd ( , ) 0.5rd, 0 0.05 , 0 0.05 k l k M l N < < .(17) The computed shifts m0 and n0 are used for calculating the reference phase plane (denoted as Ref) and the standard deviation of the 12(k, l) around it (denoted as Std). Finally, the second LS iteration (LS #2) is executed, this time using 12(k, l) samples fulfilling the condition: 123 ( , ) 3 , 0 0.1 , 0 0.1 k l k M l N + < < Ref Std Ref Std ,(18) allowinghigherprecisioninfindingtheimageshiftsm0andn0.Theoptimalsizesofthe 12(k, l) submatrices, used in (17) and (18), assuring both minimum mean square error and fast computation were found experimentally. The reason for choosing scaling a factor of 0.1 in (18) is explained by the data shown in Fig. 7, which is based on results from conducted testsonartificiallyshiftedsmootheddifferentialHSVimages,asin[1].Eventhoughthe minimum MSE motion estimation error is obtained for scaling factor equal to 0.2, a more conservative,safevalueof0.1ischosenintheLS #2duetoobservationsmadewhen applying the proposed method to noisy laryngeal HSV images. Image 2 Cross-power spectrum Image 1FFT 2D FFT 2D Subtract integer shift FFT 2D(.)* Angle(.)LS LS #2ThresholdLS #1LS #1 Find Ref phase plane Dimensions: 0.05M 0.05N Choosing samples of Angle: 0.5 rd