natural perception in dynamic stereoscopic augmented reality environments

11
Natural perception in dynamic stereoscopic augmented reality environments Fabio Solari , Manuela Chessa, Matteo Garibotti, Silvio P. Sabatini Department of Informatics, Bioengineering, Robotics and System Engineering – DIBRIS, University of Genoa, Via all’Opera Pia 13, 16145 Genova, Italy article info Article history: Available online 11 August 2012 Keywords: 3D display Eyes and head tracking Asymmetric frustums Depth perception Stereoscopic virtual reality Moving viewer abstract Notwithstanding the recent diffusion of the stereoscopic 3D technologies for the development of power- ful human computer interaction systems based on augmented reality environment, with the conven- tional approaches an observer freely moving in front of a 3D display could experience a misperception of the depth and of the shape of virtual objects. Such distortions can cause eye fatigue and stress for entertainment applications, and they can have serious consequences in scientific and medical fields, where a veridical perception of the scene layout is required. We propose a novel technique to obtain aug- mented reality systems capable to correctly render 3D virtual objects to an observer that changes his/her position in the real world and acts in the virtual scenario. By tracking the positions of the observer’s eyes, the proposed technique generates the correct virtual view points through asymmetric frustums, thus obtaining the correct left and right projections on the screen. The natural perception of the scene layout is assessed through three experimental sessions with several observers. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction The current aim in the field of Virtual Reality (VR) and Aug- mented Reality (AR) systems is to create environments and situa- tions reasonably similar to those of the real world. This is known as the ecological validity of such environments [1]. In order to sim- ulate reality, the VR/AR systems should allow the users to interact with the virtual environment through different modalities, the most natural of which is represented by the movements of the body. In this way, the user’s movements and the sensory flow of the AR environment are tightly related. Thus a major requirement for such systems is to take into account the coupling between the observer’s movements and the generation of the VR stimuli. The introduction of such new modalities produces new kinds of stimu- lations, for instance it has been proved that motion parallax evokes vergence eye movements [2]. In order to use this evidence as an additional multimodal stimulation, instead of being a further cause of visual fatigue in the observer (see [3,4] for recent reviews), it is necessary to generate binocular disparity signals that are not con- flicting with the depth cues coming from motion parallax. These aspects are particularly relevant, since there has been a rapidly growing interest in technologies for presenting stereo 3D imagery both for professional applications, e.g. scientific visualiza- tion, medicine and rehabilitation systems [5–7,1], and for enter- tainment applications, e.g. 3D cinema and videogames [8,9]. 1.1. Objectives of this work In the conventional systems used for presenting stereoscopic 3D stimuli, an observer looking at a stereoscopic screen misper- ceives the depth, the shape, and the layout of the virtual scene, when his/her eyes are in a different position with respect to the po- sition of the virtual stereo cameras that generate the left and right image pairs projected on the screen. This problem is always pres- ent, since the observer is not fixed in the same position of the vir- tual camera. It is worth noting that translations of the observer’s position produce a distortion of the perceived layout of the virtual scene and that rotations of the observer’s head yield to a misalign- ment of stereo pairs, thus producing difficulties in the binocular fu- sion. Both effects can lead to misperception and visual fatigue. Tracking the position of the observer’s head and, consequently, changing the position of the virtual stereo cameras does not prop- erly solve the issue. We will show that it is necessary to detect the positions of the two eyes with respect to the screen, and to gener- ate two appropriate projections (i.e. asymmetric frustums that originate from the detected eyes’ locations and have the focal planes coincident with the screen), as it will further described in Section 3. In particular, to overcome the limits of common VR/AR systems (see Section 1.2 for a complete review), we propose a novel tech- nique for the rendering of 3D contents that is capable to minimize the 3D shape misperception and visual fatigue problems that arise when the viewer changes his/her position with respect to the screen. This yields to a more natural interaction with the virtual environment. In particular, to design such a system we consider the following requirements: 0141-9382/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.displa.2012.08.001 Corresponding author. E-mail address: [email protected] (F. Solari). URL: http://www.pspc.unige.it (F. Solari). Displays 34 (2013) 142–152 Contents lists available at SciVerse ScienceDirect Displays journal homepage: www.elsevier.com/locate/displa

Upload: silvio-p

Post on 18-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Natural perception in dynamic stereoscopic augmented reality environments

Displays 34 (2013) 142–152

Contents lists available at SciVerse ScienceDirect

Displays

journal homepage: www.elsevier .com/locate /d ispla

Natural perception in dynamic stereoscopic augmented reality environments

Fabio Solari ⇑, Manuela Chessa, Matteo Garibotti, Silvio P. SabatiniDepartment of Informatics, Bioengineering, Robotics and System Engineering – DIBRIS, University of Genoa, Via all’Opera Pia 13, 16145 Genova, Italy

a r t i c l e i n f o a b s t r a c t

Article history:Available online 11 August 2012

Keywords:3D displayEyes and head trackingAsymmetric frustumsDepth perceptionStereoscopic virtual realityMoving viewer

0141-9382/$ - see front matter � 2012 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.displa.2012.08.001

⇑ Corresponding author.E-mail address: [email protected] (F. Solari).URL: http://www.pspc.unige.it (F. Solari).

Notwithstanding the recent diffusion of the stereoscopic 3D technologies for the development of power-ful human computer interaction systems based on augmented reality environment, with the conven-tional approaches an observer freely moving in front of a 3D display could experience a misperceptionof the depth and of the shape of virtual objects. Such distortions can cause eye fatigue and stress forentertainment applications, and they can have serious consequences in scientific and medical fields,where a veridical perception of the scene layout is required. We propose a novel technique to obtain aug-mented reality systems capable to correctly render 3D virtual objects to an observer that changes his/herposition in the real world and acts in the virtual scenario. By tracking the positions of the observer’s eyes,the proposed technique generates the correct virtual view points through asymmetric frustums, thusobtaining the correct left and right projections on the screen. The natural perception of the scene layoutis assessed through three experimental sessions with several observers.

� 2012 Elsevier B.V. All rights reserved.

1. Introduction

The current aim in the field of Virtual Reality (VR) and Aug-mented Reality (AR) systems is to create environments and situa-tions reasonably similar to those of the real world. This is knownas the ecological validity of such environments [1]. In order to sim-ulate reality, the VR/AR systems should allow the users to interactwith the virtual environment through different modalities, themost natural of which is represented by the movements of thebody. In this way, the user’s movements and the sensory flow ofthe AR environment are tightly related. Thus a major requirementfor such systems is to take into account the coupling between theobserver’s movements and the generation of the VR stimuli. Theintroduction of such new modalities produces new kinds of stimu-lations, for instance it has been proved that motion parallax evokesvergence eye movements [2]. In order to use this evidence as anadditional multimodal stimulation, instead of being a further causeof visual fatigue in the observer (see [3,4] for recent reviews), it isnecessary to generate binocular disparity signals that are not con-flicting with the depth cues coming from motion parallax.

These aspects are particularly relevant, since there has been arapidly growing interest in technologies for presenting stereo 3Dimagery both for professional applications, e.g. scientific visualiza-tion, medicine and rehabilitation systems [5–7,1], and for enter-tainment applications, e.g. 3D cinema and videogames [8,9].

ll rights reserved.

1.1. Objectives of this work

In the conventional systems used for presenting stereoscopic3D stimuli, an observer looking at a stereoscopic screen misper-ceives the depth, the shape, and the layout of the virtual scene,when his/her eyes are in a different position with respect to the po-sition of the virtual stereo cameras that generate the left and rightimage pairs projected on the screen. This problem is always pres-ent, since the observer is not fixed in the same position of the vir-tual camera. It is worth noting that translations of the observer’sposition produce a distortion of the perceived layout of the virtualscene and that rotations of the observer’s head yield to a misalign-ment of stereo pairs, thus producing difficulties in the binocular fu-sion. Both effects can lead to misperception and visual fatigue.Tracking the position of the observer’s head and, consequently,changing the position of the virtual stereo cameras does not prop-erly solve the issue. We will show that it is necessary to detect thepositions of the two eyes with respect to the screen, and to gener-ate two appropriate projections (i.e. asymmetric frustums thatoriginate from the detected eyes’ locations and have the focalplanes coincident with the screen), as it will further described inSection 3.

In particular, to overcome the limits of common VR/AR systems(see Section 1.2 for a complete review), we propose a novel tech-nique for the rendering of 3D contents that is capable to minimizethe 3D shape misperception and visual fatigue problems that arisewhen the viewer changes his/her position with respect to thescreen. This yields to a more natural interaction with the virtualenvironment. In particular, to design such a system we considerthe following requirements:

Page 2: Natural perception in dynamic stereoscopic augmented reality environments

F. Solari et al. / Displays 34 (2013) 142–152 143

– To maintain a consistent augmented reality environment it isnecessary to have a virtual world that is at each instant a virtualreplica of the real world.

– The off-axis technique (i.e. the stereo rendering technique com-monly adopted in the conventional stereoscopic systems) ismodified by using generalized asymmetric frustums in orderto take into account the different positions of the observer’seyes.

– For each different positions of the observer’s eyes a correspond-ing stereo image pair is generated, and displayed on the screen.

– A natural interaction between the user and the virtual environ-ment should be achieved.

– The proposed technique can be implemented through afford-able and of everyday use technologies.

1.2. Background and related works

Several studies devised some specific geometrical parameters ofthe stereo acquisition setup (both actual and virtual) in order to in-duce the perception of depth in an observer [10]. In this way, onecan create stereo pairs that are displayed on stereoscopic deviceswhich do not introduce vertical disparity, and thus causing no dis-comfort to the users [11]. Moreover, spatial imperfections of thestereo images are another cause of visual discomfort, while view-ing binocular image pairs. In [12] the authors experimentallydetermined the level of discomfort experienced by an observerwith a wide range of possible imperfections and distortions (seealso [3,4] for recent reviews).

Two open issues are the vergence–accommodation conflict andthe misperception of the 3D layout of the stimuli. The former is ad-dressed in the literature in several works that describe the diffi-culty of perceptually rendering a large interval of 3D spacewithout a visual stress, since the eyes of the observer have to main-tain accommodation on the display screen (i.e., at a fixed distance),thus lacking the natural relationship between accommodation andvergence eye movements [9], and the distance of the objects [13].

(a)Fig. 1. A sketch of geometry of the stereoscopic augmented reality environment whenapproach (b). (a) In the virtual environment a target is positioned in T, and the stereo camprojection plane. A real observer in the same position OL;R

0 of the virtual cameras will perceof the target when looking at the screen from different positions (OL;R

1 and OL;R2 ). (b) In the

different positions of the observer. This yields different projections of the target tL0; tR

0;�

target for the observer.

For a recent review on the vergence–accommodation conflict, see[14].

In this paper, we focus on the fact that the 3D shape, the depthand the scene layout are often misperceived when a viewer isfreely positioned in front of stereoscopic displays [15]. It is worthnoting that the augmented reality systems based on head-mounted projective displays rely on a different technology [16]and are affected by different problems [17,3,18]. Only few worksin the literature address the problem of examining depth judgmentin augmented or virtual reality environments in the peripersonalspace (i.e. distances less than 1.5 m), where binocular disparity isa dominant depth cue [19]. Among them, in [20] the authors inves-tigated depth estimation via a reaching task, but in their experi-ment the subjects could not freely move in front of the display.Moreover, only correcting methods useful in specific situationsare proposed in the literature, e.g. see [21–23]. In principle, theCAVE system [24,25] copes with the misperception problem in vir-tual reality scenarios, but it is an expensive and room-size system,thus its solution cannot be easily applied for everyday use.

Though the different solutions presented in the literature, to theknowledge of the authors, there are no works that propose aneffective and general solution for an observer that freely movesin front of a 3D monitor. Thus, the solution presented in this paperis relevant both in entertainment applications, where such distor-tions might cause visual fatigue and they should be avoided [3],and in medical and surgery applications, or in cognitive rehabilita-tion system and in applications for the study of the visuo-motorcoordination, where they can have serious implications. This isespecially true in augmented reality applications, where the userperceives real and virtual stimuli at the same time, thus it is nec-essary that the rendering of the 3D information does not introduceundesired distortions.

2. The misperception problem with the standard approach

When looking at a virtual scene, in order to obtain a natural per-ception of the 3D scene layout, and to allow an observer ‘‘to see’’

(b)using the standard stereo rendering technique (a), and when using the proposederas are placed in CL;R

0 , thus generating the left and right projections tL and tR on theive the target bT 0 correctly, whereas he/she will misperceive the position (bT 1 and bT 2)proposed technique, the virtual cameras (CL;R

1 and CL;R2 ) are moved accordingly to the

tL1; tR

1; tL2; tR

2

�on the projection plane, thus allowing a coherent perception of the

Page 3: Natural perception in dynamic stereoscopic augmented reality environments

(a) (b)Fig. 2. The deformation of the 3D shape when displaying a horizontal bar and a triangle with the standard technique. The solid lines represent the visual rays and the virtualtarget from a point of view coincident with the position of the virtual stereo camera. When the observer is in a different position a misperception of the position and of theshape occurs (dotted lines).

Fig. 3. The view volume of a virtual camera is defined by the straight lines thatoriginate from the center of projection C (i.e. the camera position) and intersect thefocal plane (projection plane) P in four points, defined by the top left (TL), bottomleft (BL), and top right (TR) values. The frustum of the camera is bounded by thenear N and the far F planes. The camera frame and the monitor frame are centeredin C and M, respectively.

144 F. Solari et al. / Displays 34 (2013) 142–152

the virtual objects as if they were real, it is necessary that both eyesof the observer must be positioned in the centers of projection de-fined by the position of the virtual stereo cameras [21,15]. If theeyes are in the correct position, then the retinal images, originatedby viewing the 3D stereo display and the ones originated by look-ing at the real scene, are identical. If this constraint is not satisfied,a misperception of the 3D positions of the scene points occurs, seeFig. 1a. The virtual stereo cameras positioned in C0 (for the sake ofsimplicity, we omit the superscript, i.e. CL

0 and CR0 denote the posi-

tions of the left and right cameras, respectively) determine the leftand right projections tL and tR of the virtual target T on the projec-tion plane. The straight lines that link the projections and the ob-server’s eyes are the visual rays. An observer located in the sameposition of the virtual camera (O0 = C0) will perceive the target ina position bT 0 coincident with the true position. Otherwise, an ob-server located in a different position (Oi – C0) will experience amisperception of the location of the target ðbT i – TÞ. As a conse-quence, when looking at a segment one misperceives both its posi-tion and its orientation (see Fig. 2a), and, in general, when lookingat an object (e.g. a triangle) the observer experiences a mispercep-tion of the location and of the shape of the object (see Fig. 2b). It isworth noting that also small rotations of the observer’s head aboutvertical axis (yaw rotation) or about forward axis (roll rotation)produce non-intersecting (skew) visual rays, thus a misperceptionof the 3D stimuli occurs [15].

3. The correct perception with the proposed approach: TrueDynamic 3D

In order that the observer always perceives the 3D shape of theobjects coherently, for each different position of his/her eyes a cor-responding stereo image pair should be generated, and displayedon the screen (see Fig. 1b). The solution proposed in this work, TrueDynamic 3D (TD3D), relies upon the following concepts:

– Compensation of the observer’s movements and rotations bycomputing the positions of his/her eyes and by positioningthe virtual left and right cameras in the same locations.

– Consistent maintaining of an augmented reality environmentby generating a virtual world that is at each moment a virtualreplica of the real world. Thus, the real screen and the virtualprojection plane should be always coincident.

– Correct generating of the left and right views projected on thescreen through asymmetric frustums, that are updated withrespect to the eye’s positions, and not by roto-translating thevirtual cameras, as described in the following.

In VR/AR systems, to project the virtual scene on the screen avirtual camera should be placed inside the virtual environment.Such camera has a field of view defined by its view volume or frus-tum, that is described by the near (N), the focal or projection (P)and the far (F) planes, and by their distances from the camera, dnear,dfocal and dfar, respectively [26,27] (see Fig. 3). The focal plane is de-scribed by three points, top left (TL), bottom left (BL), and top right(TR), that are its intersections with the straight lines originatingfrom the center of projection (C) of the camera. In order to displaya stereo 3D scene to generate a perception of depth for an observer,it is commonly adopted the method known as ‘‘parallel axis asym-metric frustum perspective projection’’, or off-axis technique. In

Page 4: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 4. The asymmetric frustums (top view) for the standard stereo off-axistechnique: the technique usually used to generate a perception of depth for anobserver. The baseline denotes the distance between the observer’s eyes.

Fig. 6. The generalized asymmetric frustums (top view) obtained by using ourtechnique: the left and right focal planes are coincident with the screen, thus thevirtual views are a replica of the real world ones.

F. Solari et al. / Displays 34 (2013) 142–152 145

the off-axis technique, the stereo images are obtained by project-ing the objects in the scene onto the focal plane for each camera;such a projection plane has the same position and orientation forboth camera projections (see Fig. 4). Nevertheless, if the observermoves in front of the screen, it is necessary to take into accountthe position of his/her eyes, as previously discussed (see Fig. 1).A roto-translation of the stereo cameras, accordingly to the posi-tion of the observer, produces a ‘‘wrong’’ projection on the screen,since the left and the right projection planes rotate with respect tothe screen, thus yielding to an augmented reality environment thatis not consistent with the real one (see Fig. 5). Consequently, theobserver misperceives the depth and the 3D scene layout.

The TD3D solution here proposed overcomes this problemthrough a generalization of the asymmetric frustums, in order toobtain the left and the right focal planes always coincident withthe screen (see Fig. 6). This generalization, that is original since itis not implemented in the available systems for the stereoscopicrendering of 3D scenes, can be formulated as follows (see Fig. 3for the notations):

– The focal plane is described by the 3D points MTL, MBL and MTRwith respect to the monitor coordinate system, denoted by M.

Fig. 5. The asymmetric frustums (top view) roto-translated accordingly to theposition of the eyes of the observer: the left and the right focal planes are notcoincident with the screen, thus the augmented reality environment is notconsistent with the real world.

– The 3D positions of the observer’s eyes and, consequently, of theleft and right cameras MC(n)L and MC(n)R are computed withrespect to the monitor frame. These positions are continuouslyupdated as a function of time step n.

– The left and the right focal planes have to be described withrespect to the left and right camera frames. To achieve thistransformation, it is necessary:– to compute the translations TðnÞL;R ¼ �MCðnÞL;R;– to apply the translations to the 3D points MTL, MBL and MTR in

order to compute CTL(n)L,R,CBL(n)L,R and CTR(n)L,R, i.e. thecoordinates of the focal planes with respect to the left andright camera frames.

– Once computed CTL(n)L,R, CBL(t)L,R and CTR(n)L,R the generalizedleft and right asymmetric frustums are defined as a functionof time step n (see A for further details).

4. Implementation of the proposed technique

Considering the availability of commercial products with highperformances and affordable costs, we decided to use off-the-shelfdevices to design and develop an AR system that implements ourTD3D solution. This choice has been done to answer to one ofthe common criticisms of virtual reality, i.e. the relatively highcosts of advanced VR systems [1]. However, the hardware devicesand the software libraries used to implement our technique can bereplaced with equivalent ones at any time. The essential key com-ponent of the system is the implementation of the proposed gener-alized asymmetric frustums that continuously update as a functionof the position of the observer’s eyes (see Section 3).

4.1. Hardware and software components

To measure the 3D position of the real world points, we usedthe XBox Kinect, a motion sensing input device developed byMicrosoft for the XBox 360 video game console, based on an RGBcamera and on an infrared (IR) depth sensor. The depth sensor con-sists of an IR projector combined with a monochrome camera,which captures video data at a frame rate of 30 Hz, in an operationrange between 0.6 m and 3.5 m, with a resolution in depth of 1 cmat 1 m of distance from the sensor [28].

All the software modules have been developed in C++, usingMicrosoft Visual Studio 10. To render the stereoscopic virtual scenein quad buffer mode we used the Coin3D graphic toolkit,1 a high le-vel 3D graphic toolkit for developing cross-platform real time 3D

1 www.coin3D.org.

Page 5: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 7. The developed setup for the augmented reality system. The reported measures refer to the specific setup considered for the experimental results.

146 F. Solari et al. / Displays 34 (2013) 142–152

visualization, properly modified in order to implement our solution,as it will be detailed in the following. To access the data provided byMicrosoft XBox Kinect, we used the open source driver released byPrimeSense,2 the company that developed the 3D technology ofthe Kinect. The localization and the tracking of the head and of thehand rely on the framework OpenNI,3 a set of open source Applica-tion Programming Interfaces (APIs). These APIs provide support foraccess to natural interaction devices, allowing the body motiontracking, hand gestures and voice recognition. The processing ofthe images acquired by Kinect RGB camera have been performedthrough the OpenCV4 library.

Both the development and the testing phases have been con-ducted on a PC equipped with an Intel Core i7 processor, 12 GBof RAM, a Nvidia Quadro 2000 video card enabled to 3D VisionPro with 1 GB of RAM, and an Acer HN274H 3D monitor 27-in.

Fig. 7 shows the setup scheme of our system. The XBox Kinect islocated on the top of the monitor, centered on the X axis, andslightly rotated around the same axis. This configuration was cho-sen because it allows the Kinect to have good visibility on the user,without having the Kinect interposed between the user and themonitor.

4.2. The developed software

After the startup phase, each time new data is available fromthe Kinect, the proposed system performs a series of steps to re-compute the stereoscopic representation of the 3D virtual world.These steps can be summarized as follows:

– Detection of the position of the eyes of the user in the imageplane of the Kinect RGB camera. This can be achieved first byobtaining the position of the head from the skeleton, producedby OpenNI libraries, and then performing a segmentation and atracking of colored markers, positioned around the observer’seyes, in the sub-image, centered in the detected position of

2 www.primesense.com.3 www.openni.org.4opencv.willowgarage.com.

the head. The segmentation and the computation of the orienta-tion of the segmented areas are performed by using the OpenCVlibraries.

– Computation of the positions of the eyes in the real world bycombining their positions in the image plane and the corre-sponding depths (from the Kinect D camera). The spatial dis-placement between the RGB and D cameras has been takeninto account.

– Computation and generation of the asymmetric frustums fol-lowing the solution described in Section 3. Since the existinggraphics libraries do not support the generation of such stereo-scopic frustums, we have extended the behavior of the Coin3Dcamera class. In particular, we have derived a new class fromthe SoPerspectiveCamera class in order to modify the glRendermethod to implement our solution. In this way, each time theasymmetric frustums are recomputed, accordingly to thedetected position of the eyes, the left and the right image arerendered on the screen.

To achieve the interaction with the user, we compute the posi-tion of the index finger of the user, through a colored marker, in theimage plane of the Kinect RGB camera. The 3D position of the fin-ger is obtained by a similar procedure of the one used to detect theposition of the eyes.

4.3. System calibration

To align the coordinate system of the XBox Kinect and of themonitor, we performed a calibration step by taking into consider-ation a set of world points, whose coordinates are known with re-spect to the monitor coordinate system, and their positions derivedfrom the Kinect. In this way, it is possible to obtain the roto-trans-lation matrix between the Kinect coordinate system and the mon-itor reference frame. In particular, the angle ay between the Y axesof the two coordinate systems in the considered experimental set-up configuration is 8�.

In order to evaluate the precision of the measurements achiev-able by the system we performed a test session, from which we gotan uncertainty of about 1 cm (at 1 m of distance from the sensor)on the three axes in real world coordinates, both for the 3D posi-tions of the eyes and of the finger.

Page 6: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 8. Perceived positions of the target point (red circles), and the related eyes’ positions (yellow circles) for the first experiment. The straight line between the yellow circlesrepresents the binocular baseline, the dotted lines indicate the visual rays between the eyes and the perceived target. The virtual target is positioned in (0,0,800) mm. The lessscattered perceived positions of the target confirm a natural perception with the TD3D rendering technique (bottom), see text for details. (For interpretation of the referencesto color in this figure legend, the reader is referred to the web version of this article.)

F. Solari et al. / Displays 34 (2013) 142–152 147

5. Experimental results

To quantitatively assess if the proposed TD3D rendering tech-nique and the developed augmented reality system allow a naturalperception of the 3D position and shape of virtual objects (i.e. acoherent perception of both real and virtual objects positioned inthe scene), and a better interaction with the user, we have per-formed three type of experiments. In the first one, the observerwas asked to reach a virtual target point. In the other two cases,the task was to trace the profile of a virtual horizontal bar and ofa triangle, respectively. In all cases, the scene has been observedby different positions, and we have measured the position of theeyes and of the finger of the observer during the execution of thetasks. The experiments have been performed both by using a stan-dard rendering technique and the proposed TD3D approach, thatactively modifies the rendered images with respect to the positionof the observer’s eyes.

The aim of these experiments is to verify if the observer is ableto estimate the position and the shape of virtual objects similarlyto what he/she would do with real objects positioned in the same3D locations.

The virtual scenes have been designed in order to minimize theeffects of the depth cues other than the stereoscopic one, such asshadows and perspective for the first and second experiments. Inthe third case a spatial structure, and thus perspective effects, isconsidered. It is worth noting that since we use a real finger topoint where a virtual object is, as perceived by the observer, we ob-tain a real estimate of the scene layout. If the task was performedby using a virtual finger, e.g. a virtual cursor that replicates theuser’s finger position inside the virtual environment, we would ob-tain a relative estimation of the 3D position, only, since both thevirtual target and the cursor would affected by the same distor-tions [21].

For the first two experiments 18 subjects were chosen, withages ranging from 22 to 44. Each subject performed the two taskslooking at the scene from four different positions. For the thirdexperiment, 10 subjects were chosen with ages ranging from22 to 32. Each one performed the task for three different posi-tions. All participants (males and females) had normal or cor-rected-to-normal vision. Each experiment has been performedtwice, by using a standard rendering technique and our TD3Dtechnique.

Page 7: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 9. The mean errors and the standard deviation values of the perceived 3Dposition of the virtual target, obtained with the standard rendering technique(darker gray) and with the developed TD3D technique (lighter gray). The resultingvalues, (49 ± 42, 31 ± 18, 81 ± 66) and (14 ± 10, 14 ± 7, 18 ± 14) respectively, show abetter perception of the 3D position with our technique.

148 F. Solari et al. / Displays 34 (2013) 142–152

The participants were not informed of the differences betweenthe two techniques, and the presentation of the stimuli with thetwo rendering techniques has been chosen randomly. The move-ments of the subjects in front of the monitor were not predeter-mined, since they were completely free to observe the scenefrom any position, with the only constraint to maintain the reach-ing distance.

5.1. First experiment: 3D position perception of a point

In this experiment, the virtual target point is represented by thenearest right bottom vertex of a frontal textured cube, whosewidth is 25 mm, positioned 800 mm from the center of the monitortowards the observer, thus the 3D perception of the virtual objectis between the user and the screen.

Fig. 8 shows the 3D positions of the eyes (yellow circles) and ofthe perceived target point (red circles). The results confirm that the3D position estimation obtained with the TD3D technique is betterthan those obtained with the standard technique. Accordingly, thevolume where the target is perceived is very small. This indicates averidical perception of the virtual target from any observer eyesposition. Such volume can be described in terms of mean and stan-dard deviation values of the perceived 3D points. With the

Fig. 10. Perceived positions of the target line (red segments), and the related eyes’ positions (yellow circles) for the second experiment. The straight line between the yellowcircles represents the binocular baseline, the dotted lines indicate the visual rays between the eyes and the perceived end points of the segment. The virtual line is positionedin (0,0, 800) mm. The less scattered perceived positions of the target (bottom) confirm a veridical perception with the TD3D rendering technique (see text for details). (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Page 8: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 11. The mean errors and the standard deviation values of the perceived 3Dposition of the midpoint of the virtual target line, obtained with the standardrendering technique (darker gray) and with the developed TD3D technique (lightergray). The resulting values, (52 ± 43, 22 ± 16, 50 ± 46) and (7 ± 6, 12 ± 7, 15 ± 9)respectively, show a better perception of the 3D position with our technique.

Fig. 12. The mean errors and the standard deviation values of the perceived lengthand 3D orientation of the virtual target line, obtained with the standard renderingtechnique (darker gray) and with the developed TD3D technique (lighter gray). Theresulting values for the perceived length, expressed in mm, are 26 ± 16 and 15 ± 11,respectively, and for the perceived orientation, expressed in �, are 8.3 ± 5.8 and5.6 ± 3.9, respectively. Both results confirm a better perception of the 3D scenelayout with our technique.

F. Solari et al. / Displays 34 (2013) 142–152 149

standard technique, the values, expressed in mm, are (18 ± 65,�16 ± 22, 775 ± 102), whereas with the TD3D we obtain (21 ± 14,�2 ± 7, 799 ± 23). Since the positions of the observers are uni-formly distributed in the work space, the perceived positions ofthe target are uniformly distributed around the actual target posi-tion, thus yielding mean values comparable between the two sys-tems. The misperception is represented by the spread of theperceived positions of the target, and it can be quantified by thestandard deviation values.

In order to take into account where the users have moved, wealso computed the mean and the standard deviation values ofthe midpoint of the eyes’ positions, with both the used renderingtechniques. The eyes’ positions acquired when the stimuli are pre-sented with the standard rendering technique could be spread in awider volume, with respect to the TD3D technique. This happenssince, as a consequence of the active modification of the point ofview, the target could not be visible from lateral views. In particu-lar, the TD3D technique moves the stereo projections on the screenin order to compensate the observer’s movements, thus such pro-jections could also fall out of the screen. This behavior is consistentwith real world situations, in which objects that are seen through awindow disappear when the observer moves laterally. The result-ing mean and the standard deviation values of the midpoint ofthe eyes’ positions, expressed in mm, are (�10 ± 97, �39 ± 38,1216 ± 113) for the standard technique, and (3 ± 74, �23 ± 26,1232 ± 104) for the developed TD3D, thus the volume Vstd

mov is twiceVTD3D

mov . To determine how such movement volumes are related tothe perceived target volumes, Vstd

perc and VTD3Dperc , we decided to com-

pute the ratio l between the two volumes in both cases,lstd ¼ Vstd

mov=Vstdperc and lTD3D ¼ VTD3D

mov =VTD3Dperc . The greater is the ob-

tained value, the better performances the system has, since a largemovement volume is related to a small uncertainty in the percep-tion of the target. In this experiment, the ratio lTD3D = 89 is muchgreater than the one computed for the standard technique lstd = 3,thus indicating a relevant improvement in the 3D localization ofthe true position of the virtual target with our technique (i.e. thevirtual target is perceived as if it were a real object).

Moreover, Fig. 9 shows the mean errors and the standard devi-ation values of the perceived coordinates of the 3D point, com-puted with respect to the known values, for the two techniques.The considerable smaller errors and standard deviations obtainedwith the TD3D technique confirm the validity of the approach.

5.2. Second experiment: 3D position and orientation perception of aline

In this experiment, the virtual target line is represented by afrontal horizontal textured bar, whose width and height are100 mm and 25 mm respectively, and whose midpoint is posi-tioned 800 mm from the center of the monitor, towards theobserver.

Fig. 10 shows the 3D positions of the eyes (yellow circles) andthe 3D positions of the perceived bar (red segments). The mean er-rors and the standard deviation values of the perceived 3D positionof the midpoint of the virtual target line (Fig. 11), and of the per-ceived 3D orientation and length (Fig. 12) are much smaller withour technique than the ones obtained with the standard approach.

The results obtained in the first experiment are confirmed forthis task, too. In particular, with our solution, we obtain a consis-tent reduction of the volume in which the perceived positions ofthe bar is scattered. With the standard technique the mean andthe standard deviation values of the midpoint of the target bar, ex-pressed in mm, are (7 ± 67, �8 ± 26, 794 ± 68), whereas with theTD3D technique the values are (7 ± 7, �5 ± 13, 796 ± 17). As forthe first experiment, the ratio between the movements’ volumeand the perceived bar positions’ volume for the TD3D technique

(lTD3D = 85) is much larger than the one in the conventional case(lstd = 3).

5.3. Third experiment: 3D position and shape perception of a triangle

To test the perception of the object shape, we consider a simplegeometric shape oriented in 3D. In particular, in this experiment,the virtual target is represented by a textured equilateral triangle,whose sides are 100 mm, whose barycenter is positioned in(0,14.3,775) mm from the center of the monitor, and whose nor-mal is (0,0.87,0.5).

Fig. 13 shows the 3D positions of the eyes (yellow circles) andthe 3D positions of the perceived triangle (red segments). Fig. 14

Page 9: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 13. Perceived positions of the target triangle (red segments), and the related eyes’ positions (yellow circles) for the third experiment. The notation is the same of Figs. 8and 10, simplified for representational purposes. The less scattered perceived positions of the target (bottom) confirm a natural perception with the TD3D renderingtechnique (see text for details). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

150 F. Solari et al. / Displays 34 (2013) 142–152

shows the mean errors and the standard deviation values of theperceived 3D position of the barycenter of the virtual target trian-gle, and of the perceived length of the triangle sides. Moreover, wehave computed the interior angles and the normal vector of theperceived triangle by using the values of the perceived 3D vertexes.Fig. 15 shows the mean errors and the standard deviation of suchvalues.

Also in this case, with our solution, we obtain a consistentreduction of the volume in which the perceived positions of the tri-angle is scattered. With the standard technique the mean and thestandard deviation values of the barycenter of the target triangle,expressed in mm, are (�1 ± 66, 24 ± 25, 761 ± 82), whereas withthe TD3D technique the values are (10 ± 7, 31 ± 10, 749 ± 15). Asfor the previous experiments, the ratio between the movements’volume and the perceived triangle positions’ volume for theTD3D technique (lTD3D = 119) is much larger than the one in theconventional case (lstd = 2).

6. Conclusions and future work

A novel stereoscopic rendering technique, the True Dynamic 3D(TD3D), has been devised and described in this paper. Moreover,

we have developed an augmented reality system, that implementssuch technique and allows a coherent perception of both virtualand real objects to an user acting in a virtual environment, by min-imizing the misperception of the 3D position and of the 3D layoutof the objects in the scene. This is achieved through a continuoustracking of the eyes’ position of the observer and a consequentre-computing of our generalized asymmetric frustums in order toproduce the correct left and right stereo image pair that are dis-played on the screen, and to allow a natural interaction betweenthe user and the virtual environment.

The proposed solution overcomes the problems related with theconventional stereoscopic systems, in which when the user freelymoves in front of the screen, or simply rotates his/her head, distor-tions of the shape and of the position of virtual objects occur. Thisissue is relevant when an accurate interaction of a real observer ina virtual world is required, especially in scientific visualization,rehabilitation systems, or in psychophysical experiments. More-over, the 3D misperception can be a cause of visual fatigue alsoin entertainment applications.

To develop a working system we have modified a graphic li-brary in order to implement the proposed TD3D rendering tech-nique, since all the existing libraries do not take into account thepreviously discussed problems, and we have used off-the-shelf

Page 10: Natural perception in dynamic stereoscopic augmented reality environments

Fig. 14. The mean errors and the standard deviation values of the perceived 3Dposition of the barycenter of the virtual target triangle and of the perceived lengthof the triangle sides, obtained with the standard rendering technique (darker gray)and with the developed TD3D technique (lighter gray). The resulting values,expressed in mm, (49 ± 45, 23 ± 13, 67 ± 48) and (10 ± 6, 18 ± 8, 26 ± 15) respec-tively for the barycenter, and 23 ± 18 and 16 ± 13 respectively for the sides’ length,show a better perception of the 3D position with our technique.

Fig. 15. The mean errors and the standard deviation values of the perceived interiorangles and of the normal vector of the virtual target triangle, obtained with thestandard rendering technique (darker gray) and with the developed TD3Dtechnique (lighter gray). The resulting values for the perceived interior angles,expressed in �, are 13.1 ± 10.86 and 8.4 ± 6.4, respectively, and for the perceivednormal vector, expressed in �, are 20.6 ± 21.5 and 11.4 ± 9.7, respectively. Bothresults confirm a better perception of the 3D scene layout with our technique.

F. Solari et al. / Displays 34 (2013) 142–152 151

technologies for the detection and the tracking of the eyes’ position(e.g., Microsoft XBox Kinect) and for the visualization of the stereo-scopic virtual scenario (e.g., a 3D monitor with shutter glasses). It isworth noting that proposed TD3D solution is valid in general, anddoes not depend on the specific choice of the hardware used toimplement the system.

The performances of the proposed technique has been assessedby a quantitative analysis in three experiments performed by sev-eral participants acting in an augmented reality scenario. The re-sults confirmed a better estimation of the 3D position and of the3D scene layout with the proposed TD3D technique with respectto the conventional ones, thus an augmented reality environment,

where real and virtual objects are mixed, is coherent with respectto a real environment.

As a future work, in order to further validate the proposed ap-proach, an extensive experimental phase is foreseen with a largernumber of participants and with different and more complex tasks.

Acknowledgments

This work has been partially supported by EU Projects FP7-ICT217077 ‘‘EYESHOTS’’ and FP7-ICT 215866 ‘‘SEARISE’’, and by ItalianMIUR Project (PRIN 2008). ‘‘Bio-inspired models for the control ofrobot ocular movements during active vision and 3D exploration’’.

Appendix A. Generalized asymmetric frustums

To describe the relationship between the coordinates of the fo-cal planes CTL(n)L,R, CBL(n)L,R and CTR(n)L,R computed with respect tothe camera reference frame C, and the projections t(n)L and t(n)R ofthe virtual target point CTL,R, it is necessary to define the left andright projection matrices M(n)L,R as a function of time step n.

In order to define such matrices it is necessary to describe thecoordinates of the lower left (ll(n)L,R, bb(n)L,R, �dnear(n)L,R) andupper right (rr(n)L,R, tt(n)L,R, �dnear(n)L,R) corners of the near planewith respect to the camera reference frame C. The values ll(n)L,R,bb(n)L,R, rr(n)L,R, and tt(n)L,R are computed from the coordinates ofthe focal plane in the following way:

llðnÞL;R ¼CBLðnÞL;R dnearðnÞL;R

dfocalðnÞL;R

!x

bbðnÞL;R ¼CBLðnÞL;R dnearðnÞL;R

dfocalðnÞL;R

!y

rrðnÞL;R ¼CTRðnÞL;R dnearðnÞL;R

dfocalðnÞL;R

!x

ttðnÞL;R ¼CTRðnÞL;R dnearðnÞL;R

dfocalðnÞL;R

!y

ðA:1Þ

where (�)x and (�)y denote the x and y components.Thus the projection matrices M(n)L,R are defined as follows:

m11 0 m13 00 m22 m23 00 0 m33 m34

0 0 �1 0

0BBB@1CCCA ðA:2Þ

where

m11 ¼2 dnearðnÞL;R

rrðnÞL;R � llðnÞL;R

m13 ¼rrðnÞL;R þ llðnÞL;R

rrðnÞL;R � llðnÞL;R

m22 ¼2 dnearðnÞL;R

ttðnÞL;R � bbðnÞL;R

m23 ¼ttðnÞL;R þ bbðnÞL;R

ttðnÞL;R � bbðnÞL;R

m33 ¼�ðdfarðnÞL;R þ dnearðnÞL;RÞ

dfarðnÞL;R � dnearðnÞL;R

m34 ¼�2 dfarðnÞL;R dnearðnÞL;R

dfarðnÞL;R � dnearðnÞL;RðA:3Þ

Page 11: Natural perception in dynamic stereoscopic augmented reality environments

152 F. Solari et al. / Displays 34 (2013) 142–152

Such matrices are applied to each virtual point to obtain thecorresponding clipping coordinates. The target point CTL,R, de-scribed in homogeneous coordinates, is transformed in the follow-ing way:

MclipTðnÞL;R ¼ MðnÞL;RCTL;R ðA:4Þ

The clipping coordinates are normalized through the perspec-tive division, scaled and translated in order to obtain the projec-tions t(n)L and t(n)R of the target point onto the projection plane.

References

[1] C.J. Bohil, B. Alicea, F.A. Biocca, Virtual reality in neuroscience research andtherapy, Nature Reviews Neuroscience 12 (12) (2011) 752–762.

[2] J. Frey, D.L. Ringach, Binocular eye movements evoked by self-induced motionparallax, The Journal of Neuroscience 31 (47) (2011) 17069–17073.

[3] K. Ukai, P. Howarth, Visual fatigue caused by viewing stereoscopic motionimages: background, theories, and observations, Displays 29 (2) (2008) 106–116.

[4] T. Bando, A. Iijima, S. Yano, Visual fatigue caused by stereoscopic images andthe search for the requirement to prevent them: a review, Displays 33 (2)(2012) 76–83.

[5] S. Subramanian, L. Knaut, C. Beaudoin, B. McFadyen, A. Feldman, M. Levin,Virtual reality environments for post-stroke arm rehabilitation, Journal ofNeuroEngineering and Rehabilitation 4 (1) (2007) 20–24.

[6] P. Ferre, R. Aracil, M. Sanchez-Uran, Stereoscopic human interfaces, IEEERobotics & Automation Magazine 15 (4) (2008) 50–57.

[7] L.A. Knaut, S.K. Subramanian, B.J. McFadyen, D. Bourbonnais, M.F. Levin,Kinematics of pointing movements made in a virtual versus a physical 3-dimensional environment in healthy and stroke subjects, Archives of PhysicalMedicine and Rehabilitation 90 (5) (2009) 793–802.

[8] A. Kratky, Re-viewing 3D – implications of the latest developments instereoscopic display technology for a new iteration of 3D interfaces inconsumer devices, in: Proceedings of the First International Conference onAdvances in New Technologies, Interactive Interfaces, and Communicability,Springer-Verlag, Berlin, Heidelberg, 2011, pp. 112–120.

[9] M. Lambooij, W. IJsselsteijn, I. Heynderickx, Visual discomfort of 3D TV:assessment methods and modeling, Displays 32 (4) (2011) 209–218.

[10] V. Grinberg, G. Podnar, M. Siegel, Geometry of binocular imaging, in: Proc. ofthe IS & T/SPIE Symp. on Electronic Imaging, Stereoscopic Displays andApplications, vol. 2177, 1994, pp. 56–65.

[11] D. Southard, Transformations for stereoscopic visual simulation, Computers &Graphics 16 (4) (1992) 401–410.

[12] F. Kooi, A. Toet, Visual comfort of binocular and 3D displays, Displays 25 (2-3)(2004) 99–108.

[13] J.P. Wann, S. Rushton, M. Mon-Williams, Natural problems for stereoscopicdepth perception in virtual environments, Vision Research 35 (19) (1995)2731–2736.

[14] T. Shibata, J. Kim, D.M. Hoffman, M.S. Banks, The zone of comfort: predictingvisual discomfort with stereo displays, Journal of Vision 11 (8) (2011) 1–29.

[15] R.T. Held, M.S. Banks, Misperceptions in stereoscopic displays: a vision scienceperspective, in: Proceedings of the 5th Symposium on Applied Perception inGraphics and Visualization, APGV ’08, 2008, pp. 23–32.

[16] H. Hua, L. Brown, C. Gao, Scape: supporting stereoscopic collaboration inaugmented and projective environments, IEEE Computer Graphics andApplications 24 (2004) 66–75. ISSN 0272-171.

[17] K. Ukai, A. Kibe, Counterroll torsional eye movement in users of head-mounteddisplays, Displays 24 (2) (2003) 59–63.

[18] S. Sharples, S. Cobb, A. Moody, J.R. Wilson, Virtual reality induced symptomsand effects (VRISE): comparison of head mounted display (HMD), desktop andprojection display systems, Displays 29 (2) (2008) 58–69.

[19] J.E. Cutting, P.M. Vishton, Perceiving layout and knowing distances: theintegration, relative potency and contextual use of different information aboutdepth, in: W. Epstein, S. Rogers (Eds.), Handbook of Perception and Cognition,Perception of Space and Motion, vol. 5, 1995, pp. 69–117.

[20] G. Singh, J.E. Swan, II, J.A. Jones, S.R. Ellis, Depth judgment measures andoccluding surfaces in near-field augmented reality, in: APGV ’10, ACM, 2010,pp. 149–156.

[21] Z. Wartell, L. Hodges, W. Ribarsky, A geometric comparison of algorithms forfusion control in stereoscopic HTDs, IEEE Transactions on Visualization andComputer Graphics 8 (2) (2002) 129–143.

[22] L. Lin, P. Wu, J. Huang, J. Li, Precise depth perception in projective stereoscopicdisplay, in: The 9th International Conference for Young Computer Scientists,2008, ICYCS 2008, 2008, pp. 831–836.

[23] M. Vesely, N. Clemens, A. Gray, Stereoscopic images based on changes in userviewpoint, US 2011/0122130 Al, 2011.

[24] C. Cruz-Neira, D. Sandin, T. DeFanti, Surround-screen projection-based virtualreality: the design and implementation of the CAVE, in: Proceedings of the20th Annual Conference on Computer Graphics and Interactive Techniques,1993, pp. 135–142.

[25] W.J. Li, C.C. Chang, K.Y. Hsu, M.D. Kuo, D.L. Way, A PC-based distributedmultiple display virtual reality system, Displays 22 (5) (2001) 177–181.

[26] J. Wernecke, The Inventor Mentor: Programming Object-Oriented 3D Graphicswith Open Inventor, Release 2 (OpenGL), Addison-Wesley Professional, 1994.

[27] R. Wrigh, B. Lipchack, N. Haemel, OpenGL SuperBible, fourth ed.,Comprehensive Tutorial and Reference, Addison-Wesley, 2007.

[28] A. Maimone, J. Bidwell, K. Peng, H. Fuchs, Enhanced personal autostereoscopictelepresence system using commodity depth cameras, Computers & Graphics(2012), http://dx.doi.org/10.1016/j.cag.2012.04.011.