massachusetts institute of technology agelab technical ... et al 2020 mit... · introduction...

Massachusetts Institute of Technology AgeLab Technical Report 2020-2MIT DriveSeg (Semi-auto) Dataset: Large-scale Semi-automated

Annotation of Semantic Driving Scenes

Li Ding, Michael Glazer, Jack Terwilliger, Bryan Reimer, and Lex Fridman

Massachusetts Institute of Technology (MIT)

Abstract

This technical report summarizes and provides detailedinformation about the MIT DriveSeg (Semi-auto) dataset,including technical aspects in data collection, annotation,and potential research directions.

1. IntroductionSolving the external perception problem for autonomous

vehicles and driver-assistance systems requires accurate androbust driving scene perception in both regularly-occurringdriving scenarios (termed “common cases”) and rare out-lier driving scenarios (termed “edge cases”). In order todevelop and evaluate driving scene perception models atscale, and more importantly, covering potential edge casesfrom the real world, we take advantage of the MIT Ad-vanced Vehicle Technology Consortium’s MIT-AVT Clus-tered Driving Scene Dataset and build a subset for the se-mantic scene segmentation task. We hereby present theMIT DriveSeg (Semi-auto) Dataset: a large-scale videodriving scene dataset, which contains 20,100 video frameswith pixel-wise semantic annotation. We propose semi-automatic annotation approaches leveraging both manualand computational efforts to annotate the data more effi-ciently and at lower cost than manual annotation.

2. Background2.1. Semantic Scene Segmentation Datasets

Recent progress in deep learning has significantlypushed forward the benchmarks of many popular computervision problems, such as image classification and segmen-tation. The improved solutions and algorithms make it pos-sible for people to explore more challenging problems andreal-world applications. Scene segmentation, based on se-mantic image segmentation, is one of the challenging com-

puter vision problems. The goal is to partition the full imageinto semantically meaningful parts.

However, training neural networks require large-scaledata. Datasets for scene segmentation are not as widelyavailable as those for traditional image classification. Somepopular challenges along with benchmarks for semanticsegmentation and scene parsing have been released in re-cent years. These include PASCAL VOC [3], MS COCO[7], MIT Scene Parsing [9], and CityScapes [1]. DAVIS [8]is another dataset for video object segmentation. Althoughsome of the datasets have a relatively large number of anno-tated images (20k images for both MIT Scene Parsing [9]and CityScapes [1]), the cost and effort in manual annota-tion investigated in these datasets is difficult to deploy atscale. For example, in [1] annotation and quality control re-quired more than 1.5 hours on average for a single image.In our earlier work proposing the MIT DriveSeg (Manual)Dataset, we were able to optimize the manual annotationprocess with enhanced tooling and frameworks. While thatgave us a 10× cost reduction for annotation compared toexisting work, it is still too expensive to annotate at scale,especially for high-frequency videos. In addition to solv-ing the actual scene segmentation problem, it is also im-portant to develop efficient mechanisms for the annotationprocess, which has been a preliminary requirement for ob-taining large-scale datasets.

2.2. MIT-AVT Dataset

The MIT-AVT study [4] as a whole seeks to gather alarge variety of driver, scene, telemetry and vehicle statedata from different vehicles equipped with varying types ofadvanced vehicle technologies. Cameras include dash (in-ternal cabin of the car), face, front and sometimes instru-ment cluster-facing orientations recording at 720p resolu-tion, 30 frames per second (fps), for the full duration of anytrip taken in the vehicle. Telemetry data gathered includesIMU, GPS and selected signals from the vehicles CAN bus.

1

Figure 1: Visualization via t-SNE 3D embedding of clips from 6 sample clusters (each point in the plot is a clip) in the MIT-AVT Clustered Driving Scene Dataset. Each cluster is associated with manually assigned semantic labels that are related toroad type, weather, and other categories. Representative image samples are shown for each cluster.

2.3. MIT-AVT Clustered Driving Scene Dataset

Driving encompasses a wide variety of external scenes,many of which are periodic (exits and intersections) or in-frequent (unique bridges). Weather and light conditionsamong other factors may further augment those scenes. Ap-proaches that go beyond random sampling large-scale natu-ralistic driving dataset may include manually implementedfilters for conditions such as weather, season, road type, ortime of day based on date and GPS data. However, thesefilters do not account for differences in the visual character-istics of the scene, since the imagery data is not utilized inthe sampling process. For example, bridges and overpassesmay cause more varying illuminance conditions at the samelocation in similar weather conditions, compared to normalhighways. Moreover, in some cases, the visual sensor maybe partially occluded or blurred by the presence of obscur-ing materials such as snow, ice, rain, fog, dirt, etc.

In order to overcome these challenges, a vision-based ap-proach to obtaining both representative samples and a vari-

ety of edge cases from the AVT data was proposed. Themethod entails extracting 10-second clips from the MITAVT dataset. 5 evenly spaced images are selected from eachclip and are run through a Resnet-50 [6] model pre-trainedon ImageNet [2] to obtain feature embeddings. These em-beddings are then reduced in dimension using PCA, com-bined chronologically and then clustered using Mini-BatchK-means. Clusters are then manually annotated for basic,high-level scene characteristics to assign contextual classesto each cluster. These algorithms were selected to minimizeprocessing time, cost and local memory use in order to lo-cally process terrabytes of driving data while still effectivelydifferentiating vision-based scene characteristics.

3. MIT DriveSeg (Semi-auto) Dataset

For the purpose of training and evaluating driving sceneperception models at scale and more importantly, coveringpotential edge cases from the real world, we take advan-tage of the MIT-AVT Clustered Driving Scene Dataset for

2

Figure 2: Examples of inferred drivable area overlaid on the front camera image. The vehicle trajectory is show on thebottom bird’s eye view image, and the drivable area is inferred by perspective transformation. Different levels of the greencolor represent 1 second, 2 seconds, and 3 seconds of future vehicle trajectory.

semantic scene segmentation. We hereby propose the MITDriveSeg (Semi-auto) Dataset: a large-scale video drivingscene dataset with semi-automatic pixel-wise semantic an-notation.

3.1. Dataset Specifications

In order to keep camera positioning and reflection pat-terns constant, a single vehicle (Range Rover Evoque)was sampled from the MIT-AVT Clustered Driving SceneDataset. This sub-sampling was carried-out to focus effortson the variability of the external scene independent of ve-hicle factors. Data from this vehicle spread across differ-ent clusters was used, equally obtaining common cases andedge cases in the naturalistic driving scenarios.

The dataset contains 20,100 video frames (67 videoclips, 10 seconds each at 30 frames per second) at 720Presolution (1280×720). We semi-automatically annotatepixel-wise semantic labels for each frame with 12 classesin our dataset: vehicle, pedestrian, road, sidewalk, bicycle,motorcycle, building, terrain (horizontal vegetation), veg-etation (vertical vegetation), pole, traffic light, and trafficsign. The annotation is done through semi-automatic ap-proaches leveraging both manual and computational efforts.

3.2. Semi-automatic Annotation

Existing full scene segmentation datasets [1, 9] rely onhiring a very small group of professional annotators todo full frame annotation, which are usually highly costlyand time-consuming. This is reasonable because the pixel-wise annotation is a task of high-complexity that reachthe limits of human perception and motor function. How-ever, the driving task requires annotated data in potentiallymuch larger scale in order to cover the long tails of edgecases. For this purpose, we explore and develop several ap-proaches for automatic and semi-automatic annotation thatcan largely improve the annotation efficiency and at thesame time maintain similar level of noise on the annotated

data as manual annotation.

3.2.1 Drivable Area Projection From Vehicle Trajec-tory

Naturalistic driving data contains sequential information ofvisual scene and vehicle motion. Since steering commandsresult in future vehicle motion, we can use the sequentialvehicle control data to infer the drivable area, by projectingfuture vehicle trajectory onto the current front scene. Theinferred drivable area is of high precision such that it onlyconsists of road and on-road fast-moving objects (vehicles),because it is actually the to-be-driven area according to thefuture information.

To obtain the drivable area, we first warp the front sceneinto the bird’s eye view with calibrated visual perspectivetransformation. We then calculate the future vehicle trajec-tory with the steering, speed, and other vehicle specifica-tions. We draw the vehicle trajectory on a bird’s eye viewof front scene, and project the single-lane area back ontothe front scene image. Fig. 2 shows examples of the in-ferred drivable area overlaid on front camera image. Weuse the projected drivable area for the further relabeling asdescribed in the next section.

3.2.2 Automatic Segmentation Relabeling with Infor-mation Fusion

Previous work [5] shows that the disagreement between dif-ferent deep learning models can be viewed as a strong signalthat is sufficient to improve the accuracy of the overall sys-tem given human supervision. In addition to using a fullpre-trained full scene segmentation model for automatic la-beling, we introduce another detection model that can detectand predict the bounding box over certain objects, includ-ing vehicles and pedestrians. Theoretically, a pixel beingpredicted as vehicle by the segmentation model should also

3

Figure 3: Examples of automatically relabeling segmentation through different sources of information. From top to bottom:video frame input, raw segmentation output, detection and drivable area output, automatically relabeled segmentation.

be within the bounding box of a vehicle predicted by thedetection model.

Model fusion, or ensembling, is a common method inmachine learning that averages predictions from differentsources in order to get a more accurate, robust result. Inthe scene segmentation task, this idea can also be utilized,but in a more structural and reasonable way, given the priorknowledge of the nature of the driving scene. We pro-pose an information fusion method that leverages differ-ent source of perception to combine the advantages of eachand automatically fix some obvious errors introduced by themain scene segmentation model. The sources and their de-scriptions are listed as below:

• Full Scene Segmentation - Dense and precise labelingof every pixel in the whole scene, similar to the formatof annotation, not robust to edge cases (lighting condi-tion, unseen objects).

• Object Detection - Bounding box prediction for a fewobject classes (e.g. car, person), accurate and robust

because the model is trained on much larger scale ofdata compared to scene segmentation.

• Drivable Area - Road prediction from steering com-mands described in Sec. 3.2.1, high precision but lowrecall as it is only available for part of the scene.

In order to make the most use of the advantages of theabove methodologies, we design the automatic annotationframework to use the more robust and accurate algorithmto guide the labeling in certain parts. This can be brieflydescribed as two principles:

1. A pixel can be labeled as car or person by the segmen-tation model only if it is within the bounding box ofthe object predicted by the detection model;

2. A pixel can only be labeled as road or car (fast mov-ing object that can exist in the way of driving) if thepixel is within the drivable area projected from steer-ing commands.

4

Figure 4: Examples of semi-automatically fixing segmentation by adjusting confidence levels. From left to right: 1. frontcamera input; 2. segmentation after automatic fixing; 3. mid confidence level; 4. low confidence level. The confidence levelis manually adjust to eliminate most of the false annotations.

Thus, we can automatically fix the segmentation predic-tion and get better accuracy of dense scene labeling, whichserves as the input for semi-automatic annotation. Someexamples of relabeling are shown in Fig. 3.

3.2.3 Semi-automated Annotation with Confidence-level Adjustment

State-of-the-art semantic segmentation methods utilize deepconvolutional architectures that are good at detecting imagefeatures, such as shapes, edges, colors. While models maynot generalize well to unseen or edge case situations exist-ing in some parts of a full scene image, most other partsare predicted with reasonably good performance. In orderto combine the strength of human perception and machineefficiency, we propose a semi-automated annotation frame-work that can perform image annotation by adjusting theprediction from strong deep neural networks.

For a deep convolutional neural network doing semanticimage segmentation, a common format of prediction is

Yh,w = argmaxc

yi,j,c ∀i ∈ h, j ∈ w

where h, w are height and width of the image, and c is thenumber of classes. The network predicts the probability ofeach class on each pixel in the image, and take the class oftop probability as the prediction. In this case, for a wrongprediction usually due to domain shift or bad generalization,the predicted class is very likely to have lower probabilitythan other correctly predicted ones. Based on this observa-tion, we design a semi-automatic annotation framework inwhich the human user can remove the prediction of a sin-gle class with simple keyboard entry, and adjust the confi-dence level to remove less-confident predictions, resulting acoarsely annotated image with high precision. An exampleis shown in Fig. 4.

3.2.4 Annotation Process

We first sample 67 video clips across different scene clustersfrom the MIT-AVT Clustered Driving Scene Dataset. Thevideo clips are validated through a manual process to make

sure the data quality is reasonable after the Automatic Seg-mentation Relabeling process. We then perform the Semi-automatic Interactive Segmentation process to further im-prove the annotation.

4. Research Directions

There are many potential research directions that can bepursued with this large-scale video scene dataset. We pro-vide a few open research questions where people may findthe dataset helpful:Spatio-temporal semantic segmentation. We are inter-ested in further research finding a novel way to utilize tem-poral data, such as optical flow and driving state, to improveperception from using static images only.Predictive modeling. Can we know ahead what is goingto happen on the road? Predictive power is an importantcomponent of human intelligence, and can be crucial to thesafety of autonomous driving. The dataset we provide isconsistent in time and therefore can be used for predictiveperception research.Deep learning with video encoding. Most of the cur-rent deep learning systems are based on RGB encoded im-ages. However, preserving exact RGB values for each sin-gle frame in the video is too expensive for computation andstorage.Solving redundancy of video frames. How can we effi-ciently find useful data from visually similar frames? Whatshall be the best fps for a good perception system? Oneof the most important problems of real-time applications istrade-off between efficiency and accuracy.

5. Conclusion

The MIT DriveSeg (Semi-auto) dataset has been pre-sented to train and evaluate driving scene perception mod-els at large scale, covering a wide range of real-world driv-ing scenarios. The semi-automatic annotation methods pro-posed in this work can be further explored and extended toother computer vision tasks.

5

AcknowledgmentsThis work was in part supported by the Toyota Collabo-

rative Safety Research Center. The views and conclusionsbeing expressed are those of the authors and do not necces-sarily reflect those of Toyota.

References[1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,

R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding. InProc. of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. 1, 3

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. In 2009IEEE conference on computer vision and pattern recognition,pages 248–255. Ieee, 2009. 2

[3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International Journal of Computer Vision, 88(2):303–338, June 2010. 1

[4] L. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd,B. Jenik, J. Terwilliger, A. Patsekin, J. Kindelsberger, L. Ding,et al. Mit advanced vehicle technology study: Large-scalenaturalistic driving study of driver behavior and interactionwith automation. IEEE Access, 7:102021–102038, 2019. 1

[5] L. Fridman, L. Ding, B. Jenik, and B. Reimer. Arguing ma-chines: Human supervision of black box ai systems that makelife-critical decisions. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops,pages 0–0, 2019. 3

[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 770–778,2016. 2

[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In European conference on computervision, pages 740–755. Springer, 2014. 1

[8] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,M. Gross, and A. Sorkine-Hornung. A benchmark datasetand evaluation methodology for video object segmentation. InComputer Vision and Pattern Recognition, 2016. 1

[9] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-ralba. Scene parsing through ade20k dataset. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, 2017. 1, 3

Appendix

A. License AgreementThe MIT DriveSeg (Semi-auto) Dataset is made freely

available to academic and non-academic entities for non-commercial purposes such as academic research, teaching,scientific publications. Permission is granted to use the datagiven that you agree:

1. That the dataset comes “AS IS”, without express or im-plied warranty. Although every effort has been made toensure accuracy, we (MIT, Toyota) do not accept anyresponsibility for errors or omissions.

2. That you include a reference to the MIT DriveSeg(Semi-auto) Dataset in any work that makes use of thedataset. For research papers, cite our preferred publi-cation as listed on our website; for other media cite ourpreferred publication as listed on our website or link tothe website.

3. That you do not distribute this dataset or modified ver-sions. It is permissible to distribute derivative works inas far as they are abstract representations of this dataset(such as models trained on it or additional annotationsthat do not directly include any of our data) and donot allow to recover the dataset or something similarin character.

4. That you may not use the dataset or any derivativework for commercial purposes as, for example, licens-ing or selling the data, or using the data with a purposeto procure a commercial gain.

5. That all rights not expressly granted to you are re-served by us (MIT, Toyota).

6

massachusetts institute of technology agelab technical ... et al 2020 mit... · introduction...

Documents