proton 2: increasing the sensitivity and portability of a ... · lect data using the steel tooling...

7
Proton 2: Increasing the Sensitivity and Portability of a Visuo-haptic Surface Interaction Recorder Alex Burka 1 , Abhinav Rajvanshi 1 , Sarah Allen 1 , and Katherine J. Kuchenbecker 1 Abstract—The Portable Robotic Optical/Tactile ObservatioN PACKage (PROTONPACK, or Proton for short) is a new handheld visuo-haptic sensing system that records surface in- teractions. We previously demonstrated system calibration and a classification task using external motion tracking. This pa- per details improvements in surface classification performance and removal of the dependence on external motion tracking, necessary before embarking on our goal of gathering a vast surface interaction dataset. Two experiments were performed to refine data collection parameters. After adjusting the placement and filtering of the Proton’s high-bandwidth accelerometers, we recorded interactions between two differently-sized steel tooling ball end-effectors (diameter 6.35 and 9.525 mm) and five surfaces. Using features based on normal force, tangential force, end-effector speed, and contact vibration, we trained multi-class SVMs to classify the surfaces using 50 ms chunks of data from each end-effector. Classification accuracies of 84.5% and 91.5% respectively were achieved on unseen test data, an improvement over prior results. In parallel, we pursued on- board motion tracking, using the Proton’s camera and fiducial markers. Motion tracks from the external and onboard track- ers agree within 2 mm and 0.01 rad RMS, and the accuracy decreases only slightly to 87.7% when using onboard tracking for the 9.525 mm end-effector. These experiments indicate that the Proton 2 is ready for portable data collection. I. MOTIVATION Our long-term research goal is to collect a large dataset of matched visuo-haptic surface data for use by robots in applications such as grasping unfamiliar objects and walking or driving over unknown terrain. In essence, we want to give robots the power to “feel with their eyes”, as humans do, so they can choose motor patterns that are well matched to the physics of their environment. We previously built a data collection rig for this purpose [1], as shown in Fig. 1. It is important to tune our hardware to a high level of performance before embarking on extensive data collection. The Proton consists of multimodal sensors mounted on a handheld rig, as well as a backpack containing a battery and computer to support untethered data collection. Sys- tem operation is controlled through a self-hosted web page, designed to be loaded on a smartphone. The choice of a human operator (rather than a robotically controlled sensing rig) reduces system complexity, allows us to leverage prior human intuition as to how best to explore a novel surface, and reduces contamination of the recordings by actuator noise. This material is based upon work supported by the U.S. National Science Foundation (NSF) under award number 1426787 as part of the National Robotics Initiative. Alex Burka was also supported by the NSF under the Complex Scene Perception IGERT program, award number 0966142. 1 MEAM, ESE, CIS Departments, GRASP Lab, University of Pennsylvania {aburka,arajv,sarallen,kuchenbe}@seas.upenn.edu Fig. 1. A CAD model of the Proton 1, shown with the OptoForce end-effector contacting a surface. The main components of the handheld rig, shown in Fig. 1, are a finger-shaped end-effector that may be easily changed depending on the desired contact sensor, a lower handle containing inertial sensors, and dual cameras at the top looking down toward the end-effector and the surface it is contacting. Specifically, the cameras are a high-resolution mvBlueFOX3 RGB camera and a Structure depth sensor. The three interchangeable end-effectors are a steel tooling ball, an OptoForce OMD three-axis force sensor, and a Syn- Touch BioTac. Regardless of the attached end-effector, the Proton includes an ATI Mini40 six-axis force/torque sensor behind the end-effector, dual high-bandwidth accelerome- ters and a microphone to measure contact vibration, plus an inertial measurement unit (IMU). In this work, we col- lect data using the steel tooling ball end-effector, recording from the force/torque sensor and high-bandwidth accelerom- eters at 3000 Hz, the IMU at 800 Hz (L3GD20H gyroscope) and 1344 Hz (LSM303 accelerometer), the depth sensor at 30 FPS, and the camera at 7.5 FPS (for the motion tracking experiment) and 20 FPS (for the end-effector experiment). Henceforth the original instrument previously described by Burka et al. [1] will be referred to as the “Proton 1”, and the modified system described here as the “Proton 2”. (See the supplemental video for a demonstration of the Pro- ton 1.) Sec. II details hardware modifications to the sensors and end-effectors, plus an experiment confirming that they

Upload: others

Post on 30-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Proton 2: Increasing the Sensitivity and Portability of aVisuo-haptic Surface Interaction Recorder

Alex Burka1, Abhinav Rajvanshi1, Sarah Allen1, and Katherine J. Kuchenbecker1

Abstract— The Portable Robotic Optical/Tactile ObservatioNPACKage (PROTONPACK, or Proton for short) is a newhandheld visuo-haptic sensing system that records surface in-teractions. We previously demonstrated system calibration anda classification task using external motion tracking. This pa-per details improvements in surface classification performanceand removal of the dependence on external motion tracking,necessary before embarking on our goal of gathering a vastsurface interaction dataset. Two experiments were performed torefine data collection parameters. After adjusting the placementand filtering of the Proton’s high-bandwidth accelerometers,we recorded interactions between two differently-sized steeltooling ball end-effectors (diameter 6.35 and 9.525 mm) andfive surfaces. Using features based on normal force, tangentialforce, end-effector speed, and contact vibration, we trainedmulti-class SVMs to classify the surfaces using 50 ms chunks ofdata from each end-effector. Classification accuracies of 84.5%and 91.5% respectively were achieved on unseen test data, animprovement over prior results. In parallel, we pursued on-board motion tracking, using the Proton’s camera and fiducialmarkers. Motion tracks from the external and onboard track-ers agree within 2 mm and 0.01 rad RMS, and the accuracydecreases only slightly to 87.7% when using onboard trackingfor the 9.525 mm end-effector. These experiments indicate thatthe Proton 2 is ready for portable data collection.

I. MOTIVATIONOur long-term research goal is to collect a large dataset

of matched visuo-haptic surface data for use by robots inapplications such as grasping unfamiliar objects and walkingor driving over unknown terrain. In essence, we want to giverobots the power to “feel with their eyes”, as humans do,so they can choose motor patterns that are well matched tothe physics of their environment. We previously built a datacollection rig for this purpose [1], as shown in Fig. 1. It isimportant to tune our hardware to a high level of performancebefore embarking on extensive data collection.

The Proton consists of multimodal sensors mounted ona handheld rig, as well as a backpack containing a batteryand computer to support untethered data collection. Sys-tem operation is controlled through a self-hosted web page,designed to be loaded on a smartphone. The choice of ahuman operator (rather than a robotically controlled sensingrig) reduces system complexity, allows us to leverage priorhuman intuition as to how best to explore a novel surface, andreduces contamination of the recordings by actuator noise.

This material is based upon work supported by the U.S. National ScienceFoundation (NSF) under award number 1426787 as part of the NationalRobotics Initiative. Alex Burka was also supported by the NSF under theComplex Scene Perception IGERT program, award number 0966142.1MEAM, ESE, CIS Departments, GRASP Lab, University of Pennsylvania{aburka,arajv,sarallen,kuchenbe}@seas.upenn.edu

Fig. 1. A CAD model of the Proton 1, shown with the OptoForceend-effector contacting a surface.

The main components of the handheld rig, shown inFig. 1, are a finger-shaped end-effector that may be easilychanged depending on the desired contact sensor, a lowerhandle containing inertial sensors, and dual cameras at thetop looking down toward the end-effector and the surface itis contacting. Specifically, the cameras are a high-resolutionmvBlueFOX3 RGB camera and a Structure depth sensor.The three interchangeable end-effectors are a steel toolingball, an OptoForce OMD three-axis force sensor, and a Syn-Touch BioTac. Regardless of the attached end-effector, theProton includes an ATI Mini40 six-axis force/torque sensorbehind the end-effector, dual high-bandwidth accelerome-ters and a microphone to measure contact vibration, plusan inertial measurement unit (IMU). In this work, we col-lect data using the steel tooling ball end-effector, recordingfrom the force/torque sensor and high-bandwidth accelerom-eters at 3000 Hz, the IMU at 800 Hz (L3GD20H gyroscope)and 1344 Hz (LSM303 accelerometer), the depth sensor at30 FPS, and the camera at 7.5 FPS (for the motion trackingexperiment) and 20 FPS (for the end-effector experiment).

Henceforth the original instrument previously describedby Burka et al. [1] will be referred to as the “Proton 1”,and the modified system described here as the “Proton 2”.(See the supplemental video for a demonstration of the Pro-ton 1.) Sec. II details hardware modifications to the sensorsand end-effectors, plus an experiment confirming that they

Device End-effector diameterWHaT [6] 1.98 mm

Haptic Camera [7], [8], [9] 1.6 mm, 3.175 mm, 3.25 mmHaptic stylus [10] 5 mm

Proton 1 [1] 19.05 mmProton 2 6.35 mm, 9.525 mm

TABLE IEND-EFFECTOR DIAMETER OF VARIOUS HAPTIC SENSING SYSTEMS.

VALUE FOR WHAT WAS ESTIMATED FROM FIG. 1 OF PAI AND RIZUN [6].

improve classification performance. The Proton 1 was teth-ered to an external stationary Vicon tracking system. Sec. IIIdetails the upgrade to onboard optical motion tracking andanother experiment to validate its efficacy. Secs. II and IIIhave parallel structure, including literature review, experi-mental methods, and results. Sec. IV contains conclusionsand directions for further research.

II. END-EFFECTOR SELECTIONBased on the range of texture probes used in the literature,

and the unsatisfying performance of the Proton 1 on surfaceclassification (82.5% test set accuracy [1]), we elected toequip the Proton 2 with smaller tooling ball end-effectors tocapture contact vibrations with higher fidelity.

A. Theory

When a human or robot contacts a surface through a tool,the geometry of the tool affects the resulting sensations [2].In particular, considering a spherical or hemispherical tip, thetip size strongly affects the size of the surface features thatcan be resolved. Klatzky and Lederman studied roughnessperception through thimbles and rigid sticks [3]. That studyused probes with contact areas 2 mm and 4 mm wide, andsurfaces with inter-element spacing of 0.5–3.5 mm. Theyfound higher roughness sensitivity with the larger probe,although subjects gave higher roughness ratings to surfaceswith low inter-element spacing when using the smaller probe(presumably due to larger vibration amplitude). Klatzky etal. further explored the relationship between tip size, inter-element spacing and roughness perception [4], including acondition where the applied force and speed were controlledby the apparatus. They confirmed a significant effect ofprobe diameter as well as end-effector speed. More recently,Lawrence et al. conducted similar experiments [5] with atip diameter of 9 mm and gratings with groove widths of0.125–8.5 mm. Although only one tip diameter was studied,the ratio of groove width to tip diameter can be considered.They similarly found higher perceived roughness at highergroove widths, which would correspond to a smaller probetip if spatial frequency were held constant.

For direct comparison, we consider the tip diameters ofextant haptic surface probes. Table I compares end-effectordiameter between the three pen-shaped haptic sensors thatinspired this research, and the two versions of the Proton.The Proton 1 end-effector had a diameter more than tentimes larger than those of prior work; the new end-effectordiameters are within an order of magnitude. We did notevaluate even smaller diameters because we found they tendto damage the surfaces when attached to the heavy Proton.

ABS plastic glitter paper silk

vinyl woodFig. 2. Surface samples used in the validation experiment: “ABS Plastic”,“Glitter Paper”, “Silk 2”, “Vinyl 2”, and “Stained Wood” from the PennHaptic Texture Toolkit [11]. All materials are mounted on a hard plate.

B. System Improvements

For the experiments presented here, we prepared two in-terchangeable end-effector modules for the Proton. They at-tach rigidly to the force/torque sensor and hold tooling ballswith diameters of 6.35 mm and 9.525 mm. Interactions wererecorded with a set of five sample surfaces selected from thePenn Haptic Texture Toolkit [11]: ABS plastic, glitter paper,silk, vinyl, and wood (shown in Fig. 2). This set was selectedwith the goal of using flat surfaces that would not damageor be damaged by our steel end-effectors, and that representdistinct values of roughness and softness. The ABS plasticand glitter paper are quite rough to the touch, while the woodis smoother. The silk is very slippery, while the vinyl hashigh friction.

Several substantive modifications were made to the systemsince the Proton 1:

Accelerometer Filtering: The performance of the Pro-ton 1 may have been degraded by the low-pass filteringapplied to the accelerometer signals by the breakout boardsto which they are attached (Adafruit product ID 1018). Weexpect valuable information about the surface to be present inthe high-frequency vibration data [9], [10]. Accordingly, inthe Proton 2, the output capacitors were changed from theiroriginal value of 0.1 uF to 4.7 nF, increasing the accelerationbandwidth from 50 Hz to 1064 Hz, which is comparable tothe bandwidths of the accelerometers used by Culbertson etal. [7], [8] and Romano et al. [9].

Accelerometer Placement: Even after this change, wefound that the acceleration signals were muffled (were lowmagnitude) when the accelerometers were mounted insidethe plastic handle. We conjecture that this muffling stemsfrom several factors. Considering the handheld rig as a rigidbody, the center of rotation is likely closer to the handle thanto the end-effector, so the acceleration felt at the handle willbe lower. Furthermore, the vibration signal must propagatefrom the point of contact through several centimeters ofplastic and a few material interfaces to reach the internalaccelerometers, which incurs attenuation. To fix this prob-lem, we moved the accelerometers to the outside of theend-effector, very close to the point of contact. Fig. 3 showsthe modified end-effector. Plotted in Fig. 4 is a section of data

gathered while dragging the end-effector on ABS plastic,with one accelerometer in its original position and the othermounted externally. The difference in sensitivity is striking.

Frequency bins: One important feature used to modelsurfaces is the vibration power at various frequencies duringinteraction. We previously followed the machine learningpipeline of Romano et al. [9], including their perceptual fre-quency binning strategy, in which the power values are com-bined via sums weighted by Gaussian coefficients. Though itled to a slight benefit in classification accuracy for that work,on further investigation we found that the more complexbinning strategy sometimes provides no benefit, so in thiswork we also evaluate a simpler linear binning strategy, inwhich vibration powers are divided evenly and summed asin a histogram.

C. Experimental MethodsUsing a procedure similar to the one outlined by Burka

et al. [1], we recorded 30-second interactions between eachof the two end-effectors and the five surfaces, for a total of10 interactions and 300 seconds of data. Each surface wasaffixed to a horizontal table. The end-effector was attached tothe Proton 2 and dragged across the surface by a human ex-perimenter, deliberately varying speed, direction and normalforce over time. During post-processing, the Vicon motiontracking output was used to estimate the end-effector speed,and to transform the force readings into the world frame. Thethree-axis acceleration readings were merged into a singledimension using the DFT321 algorithm [12], and the meanwas subtracted to reveal pure vibration of the end-effector.For machine learning, the 30 s recordings were choppedinto 50 ms chunks, each of which was treated as a labeledexample of the surface from which it was recorded. Thesechunks were randomly divided into a training set, consistingof 80% of the data, and a held-out testing set, consisting of20%. Finally, feature vectors were formed from the mean andstandard deviation of the end-effector speed, normal forceand tangential force, plus a frequency histogram of the vibra-tion power. They were normalized by calculating the trainingset mean and range by column, subtracting the mean anddividing by the range, constraining all feature values to therange (−1, 1). Using these vectors, we trained a multi-classsupport vector machine (ν-SVM [13]).

Figure 5 shows a visual representation of the process usedto transform the recorded data into feature vectors for theSVM, using snippets from the ABS plastic and vinyl trials.

As mentioned above, post-processing of the accelerationdata used the DFT321 algorithm as described by Landinet al. [12]. Given a three-dimensional time-domain signal~a(t), we transformed it into the frequency domain, ~A(f),using a Fast Fourier Transform. Then the components werecombined using the following equation:∣∣∣ ~A(f)∣∣∣ =√|Ax(f)|2 + |Ay(f)|2 + |Az(f)|2

The original time-domain signal ~a(t) was a 50 ms chunksampled at 3000 Hz, containing 150 samples in each dimen-sion, and therefore the combined frequency-domain signal

Fig. 3. 6.35 mm tooling ball end-effector with external accelerometers.Left: bottom view. Right: side view.

Time (s)9 10 11 12 13 14 15 16 17E

xter

nal a

ccel

erom

eter

sign

al (

m/s

2)

-8

-6

-4

-2

0

2

4

6

8

Time (s)9 10 11 12 13 14 15 16 17In

tern

al a

ccel

erom

eter

sign

al (

m/s

2)

-8

-6

-4

-2

0

2

4

6

8

Fig. 4. Comparison of signals received by external (top) and internal(bottom) accelerometers. External mounting yields far better signals.

| ~A(f)| consists of 75 values representing linearly spacedfrequencies from 0–1500 Hz. Using a linear binning strategy,it makes sense to use 3, 5, 15 or 25 bins in order to havean equal number of values in each bin. Using perceptualbinning [9], the 75 values are combined in a weighted sum,where the weight vector is a Gaussian function whose widthincreases at higher frequencies, modulated by a parameter α.As such, the number of bins is not constrained to be a divisorof 75, so we evaluated bin numbers between 20 and 60.

The model is configured with several hyperparameters,including ν, the learning parameter for the SVM; γ, thewidth of the SVM kernel; n, the number of frequency bins;and the frequency binning strategy. When using perceptualbins, α describes the bin width. The optimal values for thesehyperparameters were found using grid search over three-fold cross-validation on the training set.

D. Results

Table II shows the training and test set accuracies achievedby the optimal models for each end-effector, as well asthe optimal hyperparameter values. Perceptual binning gaveequal or better performance in all cases. Fig. 6 visualizes theconfusion matrix for the 9.525 mm end-effector, quantifyingthe classification mistakes, and Table III shows numericalmeasures of the same end-effector’s performance.

Overall, the classification accuracy is high, with most of

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

min

max

Time (s)16.4 16.55 16.7

-15

-10

-5

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10

Normal force (N)Tangential force (N)Tip speed (cm/s)Acceleration (cm/s2)

Time (s)18.46 18.61 18.76

-15

-10

-5

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10

Normal force (N)Tangential force (N)Tip speed (cm/s)Acceleration (cm/s2)

0-0.3 kHz

0.3-0.6 kHz

0.6-0.9 kHz

0.9-1.2 kHz

1.2-1.5 kHz

Mean Fn

Std Fn

Mean Ft

Std Ft

Mean V

Std V

Feature vectors1 2 3 4 5 6 7 8 9 10

0-0.3 kHz

0.3-0.6 kHz

0.6-0.9 kHz

0.9-1.2 kHz

1.2-1.5 kHz

Mean Fn

Std Fn

Mean Ft

Std Ft

Mean V

Std V

Feature vectors1 2 3 4 5 6 7 8 9 10

(a) (b)

(c) (d)

Fig. 5. Visualization of the feature extraction process for 0.5 s snippets of data from ABS plastic (left) and vinyl (right), using the best-performingparameters from Table II. Top row: time-domain force, vibration and speed data, with boundaries of the 50 ms chunks shown in gray. Bottom row: featurevectors, including frequency bins and summary statistics of force and speed. Each chunk in (a) and (b) corresponds to one row in (c) and (d). The offeature vectors are scaled by subtracting the mean and dividing by the range. For display (though not for the machine learning) the frequency bin featuresare scaled together, so their magnitudes may be directly compared.

Diameter ν γ n αAccuracy

Train Test6.35 mm 0.1 10 50 0.1 84.4% 84.5%

9.525 mm 0.1 30 60 0.2 88.2% 91.5%

TABLE IICLASSIFICATION ACCURACY AND OPTIMAL HYPERPARAMETERS FOR

EACH END-EFFECTOR DIAMETER USING VICON MOTION TRACKING.

the mistakes involving confusion between the wood and silkor the ABS plastic and glitter paper surfaces. Both pairsare quite similar in roughness and friction characteristics.Though the former pair seem like quite different surfaces dueto their difference in hardness, this property is not explicitlyconsidered in this experiment.

Among these results, the larger of the two testedend-effectors produces somewhat higher classification accu-racies on the training and test sets. In theory, a smallerend-effector should be able to better resolve surface details,but these details may be irrelevant for our classification task.Another factor is the human operator, who tended to exert

Surface Precision Recall F1 scoreABS 0.822 0.945 0.879

glitter paper 0.863 0.800 0.830silk 0.829 0.807 0.818

vinyl 0.904 0.918 0.911wood 0.805 0.744 0.773mean 0.845 0.843 0.842

TABLE IIITEST SET PERFORMANCE OF THE 6.35 MM MODEL FROM TABLE II

USING VICON MOTION TRACKING.

slightly more force and move slightly slower when usingthe 9.525 mm end-effector; such differences in exploratorymotion can change the difficulty of the task.

III. MOTION TRACKING

We want our system to be completely portable and allowthe use of features that depend on measurements of theend-effector force and speed. We achieve this goal by re-purposing the RGB camera to provide egomotion trackingrather than using the expensive and stationary Vicon system.

0.945 0.031 0.000 0.016 0.008

0.100 0.800 0.036 0.009 0.055

0.044 0.000 0.807 0.009 0.140

0.037 0.030 0.007 0.918 0.007

0.038 0.045 0.105 0.068 0.744

Detected materialABS glitter paper silk vinyl wood

Act

ual m

ater

ial

ABS

glitter paper

silk

vinyl

wood

Fig. 6. Confusion matrix for the 6.35 mm model from Table II using Viconmotion tracking. Entries are normalized by the total number of samples ofeach material.

A. Theory

Using a camera to track the motion of a mobile platform isa common problem in recent robotics research. One applica-tion is an unmanned aerial vehicle that needs to track its ownmotion through space with one or more cameras. Examplesinclude Tournier et al. [14], who used a single camera andorientable fiducial markers to track position and orientation,and Achtelik et al. [15], who compared the effectiveness ofpose tracking using a laser rangefinder and a stereo camera.Instead of adding fiducial markers to the environment, Achte-lik et al. [15] use a combination of the popular FAST andHarris corner detectors to find likely features, and they calcu-late the optical flow of those features in subsequent frames;the results of this visual odometry are fused with IMU datausing an extended Kalman filter (EKF). Another typical ap-proach is exemplified by Deretey et al. [16], who use a singlecamera to acquire images, then extract features and look themup in a pre-calculated database of 3D locations.

The design of the Proton is influenced by previous texturesensing systems, such as the work of Xu et al. [17] with arobot-mounted BioTac, Lepora et al.’s evaluation of activetouch [18], and devices such as Pai and Rizun’s WHaT [6],Battaglia et al.’s ThimbleSense [19], and Culbertson et al.’sHaptic Camera [7]. Of these, the WHaT originally had nomotion tracking, but an external visual tracker was added byAndrews et al. [20]. The devices of Xu et al. and Leporaet al. are mounted on robots with relatively precise positioncontrol, so additional tracking is not necessary. Lastly, theThimbleSense [19] and Haptic Camera [7] use stationaryexternal motion trackers (visual and magnetic, respectively).For another approach, Strese et al. [10] demonstrate surfaceclassification using features that are independent of normalforce and scanning speed, which obviates some of the needfor motion tracking, but the Proton still needs motion track-ing to calculate tangential force.

B. System Improvements

By design, the Proton’s camera will often be pointed atflat, homogeneous surfaces where easy-to-track features maynot be available. In order to perform highly accurate motion

Fig. 7. Fiducial frame on top of a desk surface.

tracking in a wide variety of scenes, we introduce knownfiducial markers (AprilTags [21]) into the environment. Thesemarkers can be clearly identified with the mvBlueFOX3 cam-era in a variety of lighting conditions, and being of knownsize and shape, their position and orientation with respect tothe camera can be determined from a single image without adepth sensor. Accordingly, we fabricated a U-shaped framecovered with an array of AprilTags, as pictured in Fig. 7.During data collection, this frame is placed over the observedsurface so that both the tags and the surface are within thecamera’s field of view. In this way the mvBlueFOX3 camerais used both for motion tracking and to record the visualappearance of the surface.

Motion tracking is more accurate when given the chanceto track more spatially distributed features. Therefore, weuse a Computar M1224-MPW2 lens, which has a wider fieldof view than the Computar M2518-MPW2 used by Burkaet al. [1] (30.4◦ as opposed to 15◦), trading off some imagedetail for better tracking performance.

C. Optical motion tracking algorithms

Four motion tracking algorithms were compared:PNP: The goal of the Perspective-n-Point (PNP) prob-

lem [22] is to recover the camera pose given point correspon-dences between locations in the world frame and the pixelcoordinates on the camera’s image plane. Fiducial markersprovide four points per tag (the tag corners), which arelocated at known world-frame coordinates (referenced to anorigin placed arbitrarily at the corner of the fiducial frame).This algorithm uses a closed-form solution [23] to computethe camera pose at every frame, using all visible tags.

PNP+IMU: The fiducial frame contains over 70 April-Tags, so there are usually many visible in every cameraimage. However, quick motion can induce blur, or the Protonmay be tilted at an extreme angle, so that few tags arecaptured and the PNP solver may output an inaccurate poseor none at all. For this reason, the Proton has an inertialmeasurement unit (IMU) containing a three-axis accelerome-ter and gyroscope. We integrate the acceleration and angularvelocity and fuse with the PNP estimate using an EKF.

OF: The PNP solver requires the fiducial markers to beanchored to a world frame. Alternative algorithms can startwith an initial estimate of the camera pose and track it fromthere, using any visible features. We again use the AprilTag

Algorithm PNP PNP+IMU OF SLAMInitialization N/A N/A IMU IMU PNP

X (mm) 2.765 2.899 9.405 20.979 21.823Y (mm) 2.682 2.711 15.453 15.433 16.570Z (mm) 1.889 2.112 16.641 21.465 19.495

Roll (rad) 0.010 0.011 0.082 0.048 0.053Pitch (rad) 0.012 0.013 0.081 0.092 0.123Yaw (rad) 0.004 0.006 0.041 0.125 0.103

TABLE IVEND-EFFECTOR POSE ERRORS (AVERAGE RMS ERROR ACROSS THREE

RUNS) ON TEST DATASETS USING EACH MOTION TRACKING ALGORITHM.

corners to find robust features to track, but then we usethe Structure depth sensor to find their 3D coordinates. Theoptical flow (OF) algorithm uses point correspondences toalign successive frames and extract the rigid transformationrepresenting the change in camera pose [24]. These incre-mental transformations are applied cumulatively to the initialpose estimate, making that estimate very important.

SLAM: To jointly track the pose of the camera and thefeatures, a Simultaneous localization and mapping (SLAM)algorithm can be used. Similar to the optical flow algorithm,the features do not need to be registered to the world frame.Our implementation uses an error-state Kalman filter (ESKF)tracking the Proton’s position, orientation, and velocity, plusthe world-frame coordinates of the AprilTag corners [25].

The optical flow and SLAM methods both require aninitial orientation estimate, which can be found by using theIMU to measure the direction of the gravity vector, or byrunning the PNP solver on the first image frame.

D. Experimental Methods

We compared the output of the four motion tracking al-gorithms with ground truth from the Vicon system duringmotions that are typical of data collection, with the Protonmostly level and the end-effector moving in straight linesacross a flat surface, as well as motion exhibiting more ex-treme orientations in order to test the limits of the tracking.

We then tested the best of these algorithms in the classifi-cation test described in the previous section. Motion trackingwas done with both the Vicon system and the onboard cam-era. For the second experiment, the same machine learningpipeline was run again, using the pose estimates from cameratracking in place of Vicon data.

E. Results

Results from the motion tracking comparison (see Ta-ble IV) show that the PNP algorithm performs the best, withRMS tip position errors of about 2 mm and RMS angle errorsof about 0.01 rad. Surprisingly, IMU fusion did not help inour testing. Nevertheless, it is possible that better filteringcould make it more beneficial, or that varying environmentalconditions may make the fusion necessary (e.g., there wereno prolonged periods of missing or blurry fiducials in the testdata, which may occur during practical data collection). Thehigher error rates of the OF and SLAM algorithms may beexplained by the fact that they operate on less information,since they are not given the world-frame AprilTag locations;

Time (s)0 5 10 15 20 25 30 35

Pos

ition

(m

m)

-400

-200

0

200

400

600

X (Vicon)Y (Vicon)Z (Vicon)X (Camera)Y (Camera)Z (Camera)

Time (s)0 5 10 15 20 25 30 35

Eul

er a

ngle

s (r

ad)

-2

-1.5

-1

-0.5

0

0.5

1

Roll (Vicon)Pitch (Vicon)Yaw (Vicon)Roll (Camera)Pitch (Camera)Yaw (Camera)

Fig. 8. Graphical comparison of motion tracking using the Vicon systemand onboard camera, during the interaction shown in Fig. 4. Plots show thepose of the tracked Vicon markers.

Tracking Diameter ν γ n αAccuracy

Train TestVicon 6.35 mm 0.1 10 50 0.1 84.4% 84.5%Vicon 9.525 mm 0.1 30 60 0.2 88.2% 91.5%

Camera 6.35 mm 0.1 10 60 0.1 83.0% 84.7%Camera 9.525 mm 0.1 90 30 0.15 87.4% 87.7%

TABLE VCLASSIFICATION ACCURACY AND OPTIMAL HYPERPARAMETERS VS.

END-EFFECTOR DIAMETER AND TRACKING METHOD.

also, they persist estimates from frame to frame, compound-ing any errors, while the PNP estimate is recalculated.

Fig. 8 shows the camera pose over time according tothe Vicon system and the winning PNP algorithm, duringa representative test. Since the two tracking algorithms usedifferent world frames, they have been manually brought intoalignment here for ease of visual comparison. Qualitatively,the tracks are nearly identical, confirming the results above.

Table V repeats the results from Table II (first two rows),and shows the results of the same experiment with the Viconmotion tracking replaced with onboard camera-based track-ing, using the PNP algorithm. A visualization of the confu-sion matrix for the larger end-effector is shown in Fig. 9 andanalyzed in Table VI. For the 6.35 mm end-effector, peaktest set accuracy increases slightly from 84.5% to 84.7%,while for the 9.525 mm end-effector, the test set accuracydecreases from 91.5% to 87.7%. We consider these resultsto be comparable. The optimal hyperparameter values aresimilar, but not identical to their previous values. As before,the perceptual frequency bins are the better choice.

IV. CONCLUSIONS

Moving the accelerometers close to the contact point, re-moving the low-pass filter, and reducing the end-effector di-ameter improved surface classification performance over ourprior work. Onboard motion tracking using the RGB cameraand PNP solver achieved comparable performance in an end-to-end proof-of-concept surface classification task. Taken to-gether, these results make us confident that the Proton 2 isready for portable operation and large-scale data collection.

0.948 0.022 0.007 0.015 0.007

0.047 0.849 0.019 0.009 0.075

0.034 0.068 0.846 0.017 0.034

0.067 0.022 0.030 0.867 0.015

0.040 0.080 0.096 0.024 0.760

Detected materialABS glitter paper silk vinyl wood

Act

ual m

ater

ial

ABS

glitter paper

silk

vinyl

wood

Fig. 9. Confusion matrix for the 9.525 mm model from Table V usingonboard motion tracking. Entries are normalized by the total number ofsamples of each material.

Surface Precision Recall F1 scoreABS 0.847 0.948 0.894

glitter paper 0.789 0.849 0.818silk 0.839 0.846 0.843

vinyl 0.936 0.867 0.900wood 0.864 0.760 0.809mean 0.855 0.854 0.853

TABLE VITEST SET PERFORMANCE OF THE 9.525 MM MODEL FROM TABLE V

USING ONBOARD MOTION TRACKING.

A key hardware improvement was mounting the ac-celerometers externally, on the end-effector, instead of insidethe Proton’s handle. However, this location makes it harderto switch end-effectors. Future work includes improving thismounting system to preserve the goal of easily interchange-able end-effectors. Once that is complete, we can move for-ward with collection of a large-scale multimodal dataset ofsurface interactions, which will be used to discover cross-sensory correspondences using deep learning, as in the workof Gao et al. [26]. The database will also be made publiclyavailable for use in research or applications by the generalhaptics and robotics communities.

ACKNOWLEDGMENTSThe authors thank Naomi Fitter and Jennifer Hui for their

design advice, and Stu Helgeson for helping make the videoassociated with this paper.

REFERENCES

[1] A. Burka, S. Hu, S. Helgeson, S. Krishnan, Y. Gao, L. A. Hendricks,T. Darrell, and K. Kuchenbecker, “Proton: A visuo-haptic data ac-quisition system for robotic learning of surface properties,” in Proc.IEEE Int. Conf. on Multisensor Fusion and Integration for IntelligentSystems, 2016.

[2] C. G. McDonald and K. J. Kuchenbecker, “Dynamic simulation oftool-mediated texture interaction,” in Proc. IEEE World Haptics Conf.,2013, pp. 307–312.

[3] R. L. Klatzky and S. J. Lederman, “Tactile roughness perceptionwith a rigid link interposed between skin and surface,” Perception &Psychophysics, vol. 61, no. 4, pp. 591–607, 1999.

[4] R. L. Klatzky, S. J. Lederman, C. Hamilton, M. Grindley, and R. H.Swendsen, “Feeling textures through a probe: effects of probe and sur-face geometry and exploratory factors,” Perception & Psychophysics,vol. 65, no. 4, pp. 613–631, 2003.

[5] M. A. Lawrence, R. Kitada, R. L. Klatzky, and S. J. Lederman, “Hapticroughness perception of linear gratings via bare finger or rigid probe,”Perception, vol. 36, no. 4, pp. 547–557, 2007.

[6] D. K. Pai and P. Rizun, “The WHaT: a wireless haptic texture sensor,”in Proc. IEEE Symposium on Haptic Interfaces for Virtual Environmentand Teleoperator Systems, 2003.

[7] H. Culbertson, J. Unwin, B. E. Goodman, and K. J. Kuchenbecker,“Generating haptic texture models from unconstrained tool-surfaceinteractions,” in Proc. IEEE World Haptics Conf., 2013, pp. 295–300.

[8] H. Culbertson, J. Unwin, and K. J. Kuchenbecker, “Modeling and ren-dering realistic textures from unconstrained tool-surface interactions,”IEEE Transactions on Haptics, vol. 7, no. 3, pp. 381–393, 2014.

[9] J. M. Romano and K. J. Kuchenbecker, “Methods for robotic tool-mediated haptic surface recognition,” in Proc. IEEE Haptics Sympo-sium, 2014, pp. 49–56.

[10] M. Strese, C. Schuwerk, and E. Steinbach, “Surface classification usingacceleration signals recorded during human freehand movement,” inProc. of IEEE World Haptics Conf., 2015, pp. 214–219.

[11] H. Culbertson, J. J. Lopez Delgado, and K. J. Kuchenbecker, “Onehundred data-driven haptic texture models and open-source methodsfor rendering on 3D objects,” in Proc. IEEE Haptics Symposium, 2014,pp. 319–325.

[12] N. Landin, J. M. Romano, W. McMahan, and K. J. Kuchenbecker, “Di-mensional reduction of high-frequency accelerations for haptic render-ing,” in Astrid M. L. Kappers, Jan B. F. van Erp, Wouter M. BergmannTiest, and Frans C. T. van der Helm, editors, Haptics: Generating andPerceiving Tangible Sensations, Proc. EuroHaptics, Part II, volume6191 of Lecture Notes in Computer Science, 2010, pp. 79–86.

[13] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “Newsupport vector algorithms,” Neural Computation, vol. 12, pp. 1207–1245, 2000.

[14] G. P. Tournier, M. Valenti, J. P. How, and E. Feron, “Estimation andcontrol of a quadrotor vehicle using monocular vision and moire pat-terns,” in AIAA Guidance, Navigation, and Control Conf. and Exhibit,2006, pp. 1–16.

[15] M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy, “Stereovision and laser odometry for autonomous helicopters in GPS-deniedindoor environments,” in Proc. Int. Society for Optical Engineering,vol. 1, 2009, pp. 733 219–733 219–10.

[16] E. Deretey, M. T. Ahmed, J. A. Marshall, and M. Greenspan, “Visualindoor positioning with a single camera using PnP,” in Proc. Int. Conf.on Indoor Positioning and Indoor Navigation, 2015.

[17] D. Xu, G. E. Loeb, and J. A. Fishel, “Tactile identification of objectsusing Bayesian exploration,” in Proc. IEEE Int. Conf. on Robotics andAutomation, 2013, pp. 3056–3061.

[18] N. F. Lepora, U. Martinez-Hernandez, and T. J. Prescott, “Active touchfor robust perception under position uncertainty,” in Proc. IEEE Int.Conf. on Robotics and Automation, 2013, pp. 3020–3025.

[19] E. Battaglia, G. Grioli, M. G. Catalano, M. Santello, and A. Bicchi,“ThimbleSense: an individual-digit wearable tactile sensor for exper-imental grasp studies,” in Proc. IEEE Int. Conf. on Robotics andAutomation, 2014, pp. 2728–2735.

[20] S. Andrews and J. Lang, “Interactive scanning of haptic textures andsurface compliance,” in Proc. IEEE Int. Conf. on 3-D Digital Imagingand Modeling, 2007, pp. 99–106.

[21] E. Olson, “AprilTag: A robust and flexible visual fiducial system,” inProc. IEEE Int. Conf. on Robotics and Automation, 2011, pp. 3400–3407.

[22] K. Daniilidis and J. Shi, “Lecture 26 - pose from 3D point correspon-dences: The Procrustes problem,” in Robotics: Perception. Coursera,2016.

[23] ——, “Lecture 27 - pose from projective transformations,” in Robotics:Perception. Coursera, 2016.

[24] ——, “Lecture 37 - 3D velocities from optical flow,” in Robotics:Perception. Coursera, 2016.

[25] J. Solá, “Quaternion kinematics for the error-state KF,” Labora-toire d’Analyse et d’Architecture des Systèmes-Centre National de laRecherche Scientifique, Toulouse, France, Tech. Rep, 2012.

[26] Y. Gao, L. A. Hendricks, K. J. Kuchenbecker, and T. Darrell, “Deeplearning for tactile understanding from visual and haptic data,” in Proc.IEEE Int. Conf. on Robotics and Automation, 2016, pp. 536–543.