cloudcv: deep learning and computer vision in the cloud

CloudCV: Deep Learning and Computer Vision in the Cloud

Harsh Agrawal

Thesis submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Sciencein

Computer Engineering

Dhruv Batra, ChairDevi Parikh

A. Lynn Abbott

May 2, 2016Blacksburg, Virginia

Keywords: Deep Learning, Computer Vision, Cloud ComputingCopyright 2016, Harsh Agrawal

CloudCV: Deep Learning and Computer Vision on the Cloud

Harsh Agrawal

ABSTRACT

We are witnessing a proliferation of massive visual data. Visual content is arguably thefastest growing data on the web. Photo-sharing websites like Flickr and Facebook now hostmore than 6 and 90 billion photos, respectively. Unfortunately, scaling existing computer vi-sion algorithms to large datasets leaves researchers repeatedly solving the same algorithmicand infrastructural problems. Designing and implementing efficient and provably correctcomputer vision algorithms is extremely challenging. Researchers must repeatedly solve thesame low-level problems: building & maintaining a cluster of machines, formulating eachcomponent of the computer vision pipeline, designing new deep learning layers, writing cus-tom hardware wrappers, etc. This thesis introduces CloudCV, an ambitious system thatcontain algorithms for end-to-end processing of visual content. The goal of the project is todemocratize computer vision; one should not have to be a computer vision, big data and deeplearning expert to have access to state-of-the-art distributed computer vision algorithms. Weprovide researchers, students and developers access to state-of-art distributed computer vi-sion and deep learning algorithms as a cloud service through web interface & APIs.

CloudCV: Deep Learning and Computer Vision on the Cloud

Harsh Agrawal

GENERAL AUDIENCE ABSTRACT

We are witnessing a proliferation of massive visual data. Visual content is arguably thefastest growing data on the web. Photo-sharing websites like Flickr and Facebook now hostmore than 6 and 90 billion photos, respectively. People upload more than 72 hours worthof content on Youtube every minute. In recent years, there has been a lot of progress in thefield of computer vision technology that can make computers understand this visual data.Facebook can today describe the image to a visually impaired person, Google can sort thephoto collection present on our mobile phones into albums, stories and can detect people inimages, Tesla can run its cars on autopilot by analyzing the car’s surrounding. This the-sis introduces CloudCV, an ambitious system that contain such algorithms for end-to-endprocessing of visual content. The goal of the project is to democratize computer vision;one should not have to be a computer vision, big data and deep learning expert to haveaccess to state-of-the-art distributed computer vision algorithms. We provide researchers,students and developers access to state-of-art distributed computer vision and deep learningalgorithms as a cloud service through web interface & APIs so that they can use it to buildtheir own artificial intelligence (AI) systems.

To my parents and sister: for all the love, affection and guidanceTo Dr. Dhruv Batra and Dr. Devi Parikh: for being the best advisors one can hope for.

iv

Acknowledgement

I am truly honored and privileged to have worked in the Machine Learning and PerceptionLab. Working in the lab was an amazing experience and joining it was one of the bestdecisions I have ever made. First, I want to thank my advisor, Dr. Dhruv Batra, forproviding an atmosphere where I could work on anything and everything I wanted to workon. I cannot imagine a better and more resourceful advisor than Dr. Batra. He helpedstrike the right balance between doing interesting problems and manage CloudCV. Underhis guidance, I learned important skills that not only were useful during my Masters but willalso be valuable in career: working on the right problems, coming up with efficient solutions,presenting my work and writing a good paper. He has been an amazing mentor and teacher.His class on introduction to machine learning was one of the best courses I have taken. Iremember looking forward to his class every week and doing the homework, specially theKaggle competitions. That course along with Deep Learning course helped me understandthe machine learning area better and made me more confident of my abilities as a researcher.I am also extremely grateful to have got the opportunity to work with Dr. Devi Parikh. Ihope, someday, I will be able to have a work ethic and passion for the problem similar tohers. She has taught me how to plan ahead, be systematic and consequently, be less stressfulwhen a deadline is approaching. Her insights and intuitions for the problem along with hermeticulous and thorough nature have really set the bar for me and influenced my approachto research. I also enjoyed the post deadline dinners with them and rest of the lab. It usedto be the time when not only will we celebrate the hard work everybody put in, but alsoreflect back on things that could have been done better. The best part of these celebrationswere also the personal interactions with Dr. Batra and Dr. Parikh which gave us insightsinto life in academia.

I am also grateful to Dr. Lynn Abott for being part of the commmitee and for his input inthis thesis and the flexibility throughout this process. I have enjoyed interacting with himduring poster presentations and reading group.

Working with Clint, Neelima on CloudCV in the early days of the project was a lot offun. Despite the stress of deadlines and demos breaking down at the last moment, it wasa lot of fun and I learnt a lot. We were often dealing with things that we have never used,building things that we have never built, but in hindsight, it helped me better deal withuncertainties and is a skill I could not have learnt otherwise. I was extremely fortunate to

v

work with amazing people on CloudCV. Working with Ahmed, Deshraj, Shubham, Prashantand Mohit was a great way to spend summers building things that will hopefully benefit thecomputer vision community. Their hard-work, dedication and willingness to solve difficultproblems inspired us to think big and aim higher. Ahmed was also a great program managerand his participation helped us conduct the GSOC program smoothly and build a greatcollaborative environment for the students.

I am fortunate to have been part of a fantastic Computer Vision, Machine Learning andPerception group. The group consists of some of the smartest people I have met in my lifeand having constant access to them made my graduate studies a very rewarding experience.I have had the pleasure of working with Abhishek on the Visual Attention project. Hispatience, knowledge about the deep learning frameworks we used and deep understanding ofthe problem made the discussions very fruitful. Working with him, I was also able to pick upsome of the skills that I will find helpful in my career. I was also fortunate to team up withNeelima and Aroma for the Object Proposals project. Running hundreds of experiments forthis project wouldn’t have been possible without their support.

A person is defined by the company he keeps. I have made some great friends here andI hope our friendship will last a lifetime. I would take this opportunity to thank Rama -for his honest feedback, advice to constantly improve as a researcher, and for sharing theexcitement on seeing new TMUX and VIM and other productivity tricks, Ram - for lettingus feast on the delicious food he cooked, for your poor Hindi and for giving us our dailydose of laughter. Arjun - for those interesting hour-long discussions about life, universe andscience, and for teaching me a thing or two about work life balance, Ashwin and Khushifor the useful study group discussions and for our hilarious coffee-break musings. Graduateschool wouldn’t have been as much fun without them.

I am also extremely lucky to have Shashank, Tanya and Manan as friends. They helpedme navigate tricky moments better and were around to motivate me whenever I was unsureof my abilities. Without their support and encouragement, it is unlikely that I would havetaken the decisions that I now consider to be some of my best decisions.

Most of all, I want to express my deepest gratitude to four very important people in mylife: my parents for being the best teachers and resource for advice, my sister who has beenmy role model for her hard work and dedication, and last but certainly not the least, myniece Pritisha, who manages to bring a wide smile on my face in the busiest of days and thetoughest of times.

vi

Contents

List of Figures x

List of Tables xiii

1 Introduction 1

1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Efforts 7

3 System Architecture 9

3.1 CloudCV Back-end Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Web Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.2 Distributed Processing and Job Scheduler . . . . . . . . . . . . . . . 14

3.1.3 Caffe - Deep Learning Framework . . . . . . . . . . . . . . . . . . . . 14

3.2 Front-end Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 Python API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Matlab API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

vii

4 Functionalities 22

4.1 Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Train a New Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 VIP: Finding Important People in Group Images . . . . . . . . . . . . . . . 27

4.7 Gigapixel Image Stitching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Impact On Community 33

5.1 Google Summer of Code (GSOC), 2015 . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Open Source Contributions to Digits . . . . . . . . . . . . . . . . . . 34

5.1.2 CloudCV Containers for Easy Distribution and Installation . . . . . . 34

5.1.3 Improvements in Python API . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Google Summer of Code, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Future Work 36

6.1 CloudCV-fy your Vision/Deep-Learning Code . . . . . . . . . . . . . . . . . 36

6.2 Build Deep Learning Models Online . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Tutorials and Courses on CloudCV . . . . . . . . . . . . . . . . . . . . . . . 37

7 Conclusion 38

Appendices 39

A Object-Proposals Evaluation Protocol is Gameable 40

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.3 Object Proposals Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.4 Evaluating Object Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

viii

A.5 A Thought Experiment:How to Game the Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . 50

A.6 Evaluation on Fully and Densely Annotated Datasets . . . . . . . . . . . . . 51

A.6.1 Fully Annotated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.6.2 Densely Annotated Datasets . . . . . . . . . . . . . . . . . . . . . . 54

A.7 Bias Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A.7.1 Assessing Bias Capacity . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

B Role of Visual Attention for Visual Question Answering(VQA) 61

B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

B.2 Dataset: Collection and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 62

B.2.1 Types of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

B.2.2 Attention Map Examples . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

B.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography 69

ix

List of Figures

1.1 Overview of CloudCV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Users can access CloudCV using a web interface, Python or MATLAB API.The back-end consists of web servers which communicates with the client inreal-time through HTTP and Web-sockets. The job schedule at the mas-ter node distributes incoming jobs across multiple computer servers (workernodes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Flow describing the execution of a job starting from the user connecting to theCloudCV system to the back-end sending results to the user during executionof the job in real-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Celery Flow Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 (a) shows the upload section. User can upload images either from his/herdropbox or local disk, (b) shows the terminal which receives real-time progressupdates from the server and (c) shows the visualization interface for a givenjob. Note in this case the task was classification and the result displays cate-gory and corresponding confidence for a given image. . . . . . . . . . . . . . 19

3.5 The web-interface allows user to upload images and save features or modelsfiles inside his/her Dropbox account. (a) shows the upload interface where auser can select one or multiple images and (b) shows the save interface wherethe user can save all the data generated for the given job inside Dropbox. Inthis example, the user was trying to save features extracted in the form ofMat files inside Dropbox account. . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 MATLAB API Screenshot: Users can access CloudCV services within MAT-LAB. These APIs run in background such that while the user is waiting for aresponse, the user can run other tasks and the API call is non-blocking. showsthe screenshot of the MATLAB API. . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Visual Question Answering demo on the CloudCV website. . . . . . . . . . . 23

x

4.2 Model for Visual Question Answering described in [29] . . . . . . . . . . . . 24

4.3 CaffeNet model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Classification Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Feature Extraction Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 Train a New Category pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.7 VIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 Who are the most important individuals in these pictures? . . . . . . . . . . 28

4.9 VIP Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.10 Gigapixel Image Stitching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.11 Image stitching web interface. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.1 (a) shows PASCAL annotations natively present in the dataset in green. Otherobjects that are not annotated but present in the image are shown in red;(b) shows Method 1 and (c) shows Method 2. Method 1 visually seems torecall more categories such as plates, glasses, etc. that Method 2 missed.Despite that, the computed recall for Method 2 is higher because it recalledall instances of PASCAL categories that were present in the ground truth.Note that the number of proposals generated by both methods is equal in thisfigure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.2 (a) shows PASCAL annotations natively present in the dataset in green. Otherobjects that are not annotated but present in the image are shown in red; (b)shows Method 1 and (c) shows Method 2. Method 1 visually seems to recallmore categories such as lamps, picture, etc. that Method 2 missed. Clearlythe recall for Method 1 should be higher. However, the calculated recall forMethod 2 is significantly higher, which is counter-intuitive. This is becauseMethod 2 recalls more PASCAL category objects. . . . . . . . . . . . . . . 42

A.3 Github page of the Object Proposals Library . . . . . . . . . . . . . . . . . . 47

A.4 Steps for generating proposals . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.5 Steps for evaluating proposals . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.6 Performance of different object proposal methods (dashed lines) and our pro-posed ‘fraudulent’ method (DMP) on the PASCAL VOC 2010 dataset. Wecan see that DMP significantly outperforms all other proposal generators. Seetext for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xi

A.7 (a),(b) Distribution of object classes in PASCAL Context with respect todifferent attributes. (c),(d) Augmenting PASCAL Context with instance-levelannotations. (Green = PASCAL 20 categories; Red = new objects) . . . . . 58

A.8 Performance of different methods on PASCAL Context, MS COCO and NYUDepth-V2 with different sets of annotations. . . . . . . . . . . . . . . . . . . 59

A.9 Performance of RCNN and other proposal generators vs number of object cat-egories used for training. We can see that RCNN has the most ‘bias capacity’while the performance of other methods is nearly (or absolutely) constant. . 60

B.1 All the three interfaces that were experimented with for the data collectionprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.2 Attention Map Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

xii

List of Tables

B.1 Human study to compare the quality of annotations collected by differentinterfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.2 Evaluation of machine-generated attention maps . . . . . . . . . . . . . . . . 65

xiii

Chapter 1

Introduction

A recent World Economic Forum report [20] and a New York Times article [96] declareddata to be a new class of economic asset, like currency or gold. Visual content is arguablythe fastest growing data on the web. Photo-sharing websites like Flickr and Facebook nowhost more than 6 and 90 Billion photos (respectively). Every day, users share 300 millionmore images on Facebook. Every minute, users upload 72 hours or 3 days worth of video toYoutube. Besides consumer data, diverse scientific communities (Civil & Aerospace Engi-neering, Computational Biology, Bioinformatics, and Astrophysics, etc.) are also beginningto generate massive archives of visual content [39,90,124], without necessarily having accessto the expertise, infrastructure and tools to analyze them.

This data revolution presents both an opportunity and a challenge. Extracting value fromthis asset will require converting meaningless data into perceptual understanding and knowl-edge. This is challenging but has the potential to fundamentally change the way we live –from self-driving cars bringing mobility to the visually impaired, from in-home robots caringfor the elderly and physically impaired to augmented reality with Microsoft Hololens-likewearable computing units. With the recent advancement in deep learning in pushing state-of-the-art in various applications such as object recognition, object detection and otherdomains of computer vision such as visual question answering [93,119], more companies areusing these techniques to process visual data [21,22,24,25].

1.1 Challenges

In order to convert this raw visual data into knowledge and intelligence, we need to addressa number of key challenges:

• Scalability. The key challenge for image analysis algorithms in the world of big datais scalability. In order to fully exploit the latest hardware trends, we must address the

1

2

challenge of developing fully distributed computer vision algorithms. Unfortunately,scaling existing computer vision algorithms to large datasets leaves researchers repeat-edly solving the same infrastructural problems: building and maintaining a clusterof machines, designing multi-threaded primitives for each algorithm and distributingjobs, pre-computing and caching features, etc.

Consider for instance the image categorization system by the Google/Stanford team [92].The system achieved an impressive 70% relative improvement over the previous bestknown algorithm for the task of recognizing 20,000 object categories in the ImageNetdataset [53]. To achieve this feat, the system required a sophisticated engineeringeffort in exploiting model parallelism and had to be trained on a cluster with 2,000machines (32,000 cores) for one week. While this is a commendable effort, lack ofsuch an infrastructural support and intimate familiarity with parallelism in computervision algorithms leaves most research groups marginalized, computer vision expertsand non-experts alike. Very recently, NVIDIA also released a new server applianceplatform dubbed DGX1 at GTC 2016, its annual conference at the price of $129,000.Though these systems help reduce the training time from 150 hours on a CPU to 2hours on GPU for training a classification model on ImageNet, it is too expensive formost research labs and developers.

• Provably Correct Parallel/Distributed Implementations. Designing and im-plementing efficient and provably correct parallel computer vision algorithms is alsoextremely challenging. Some tasks like extracting statistics from image collections areembarrassingly parallel, i.e., can be parallelized simply by distributing the images todifferent machines. This is where frameworks such as MapReduce have demonstratedsuccess. Unfortunately, most tasks in computer vision and machine learning such astraining a face detector are not embarrassingly parallel – there are data and computa-tional dependencies between images and various steps in the algorithm. Moreover, foreach such parallel algorithm, researchers must solve the same low-level problems overand over again: formulating parallelizable components in computer vision algorithms,designing multi-threaded primitives, writing custom hardware wrappers, implementingmechanisms to avoid race conditions, dead-locks, etc.

• Reusability. Computer vision researchers have developed vision algorithms that solvespecific tasks but software developers building end-to-end system find it extremely dif-ficult to integrate these algorithms into the system due to different software stacks,dependencies and different data format. For instance, the advancements made in deeplearning have enabled a lot of open source tools for building deep neural networks.However, with more than 50 deep learning tools [23] available, the user is left withheterogeneous open source implementations in different languages like Python, Lua,C++, Julia, that are difficult to incorporate in a software pipeline. Additionally, hard-ware designers have developed various dedicated computer vision processing platformsto overcome the problem of intensive computation. However, these solutions have cre-ated another problem: heterogeneous hardware platforms have made it time-consuming

3

Figure 1.1: Overview of CloudCV.

and difficult to port computer vision systems from one hardware platform to another.

1.2 System Overview

In order to overcome these challenges, we are building CloudCV, a comprehensive systemthat will provide access to state-of-the-art distributed computer vision algorithms on thecloud.

As shown in Fig. 1.1, CloudCV today consists of a group of virtual machines running onAmazon Web Services capable of running a large number of tasks in a distributed andparallel setting. Popular datasets used are already cached on these servers to facilitateresearchers trying to run popular computer vision algorithms on these datasets. Users canaccess these services through a web interface which allows user to upload a few imagesfrom either Dropbox or a local system and obtain results in real-time. For larger datasets,the system enables to embed CloudCV services into a bigger end-to-end system by utilizingPython and Matlab APIs. Since the APIs are fairly easy to install through standard packagemanagers, the researchers can now quickly run image analysis algorithms on huge datasetsin a distributed fashion without worrying about infrastructure, efficiency, algorithms andtechnical know-how. At the back-end, on receiving the list of images and the algorithm that

4

needs to be executed, the server distributes these jobs to worker nodes that process the datain parallel and communicate the results to the user in real time. Therefore, the user does notneed to wait for the processing to finish on the entire dataset and can monitor the progressof the job due to streaming real-time updates.

1.3 Application

CloudCV will benefit three different audiences in different ways:

• Computer vision researchers: who do not have resources to setup the necessaryinfrastructure or do not want to reinvent a large-scale distributed computer visionsystem. For such users, CloudCV can serve as a unified data and code repository,providing cached versions of all relevant data-representations and features. We en-vision a system where a program running on CloudCV simply “calls” for a feature;if it is cached, the features are immediately loaded from distributed storage such asHDFS [121]; if it is not cached, then the feature extraction code is run seamlessly in thebackground and the results are cached for future use. Eventually, CloudCV becomesthe ultimate repository for “standing on the shoulders of giants”.

• Scientists who are not computer vision experts: but have large image collectionsthat needs to be analyzed. Consider a biologist who needs to automate the process ofcell-counting in microscopy images. Today such researchers must find computer visioncollaborators and then invest in the logistical infrastructure required to run large-scaleexperiments. CloudCV can eliminate both these constraints, by providing access tostate-of-the-art computer vision algorithms and compute-time on the cloud.

• Non-scientists: who simply want to learn about computer vision by demonstration.There is a tremendous demand from industry professionals and developers for learningabout computer vision. Massive Open Online Classes (MOOCs) like Udacity andCoursera have demonstrated success. CloudCV can build on this success by being animportant teaching tool for learning computer vision by building simple apps on thecloud. Imagine a student writing 4 lines of code in CloudCV development environmentto run a face-detector on a stream of images captured from his laptop webcam.

1.4 Contribution

The primary thrust of this thesis is CloudCV. First, we describe the architecture and variousfunctionalities of CloudCV. We also briefly mention how we have been able to engage withthe open source community and the future directions of the project. Apart from CloudCV,we also describe two other contributions: an object proposals toolbox that allows researchersto run analysis on the recent object proposal methods; and a study to understand the role

5

of visual attention for the task of visual question answering.

The outline of the paper is as follows:

• In Chapter 2, we put our work in context of related efforts in this direction. We discussabout various types of computer vision and deep learning softwares that are populartoday.

• In Chapter 3, we describe in detail the system architecture of CloudCV. First, we talkabout the components of the back-end architecture such as web servers, job-scheduler.This chapter also describes the lifecycle of a job starting from initiating a request by theuser to sending back results. Next, we describe the three front-end platforms throughwhich users can access the CloudCV services. CloudCV consists of various componentsthat are built on top of existing open source implementations and this chapter explainshow we integrated various components to build a complex pipeline that allow us torun deep learning algorithms in the cloud.

• In Chapter 4, we describe the various functionalities of CloudCV such as Classification,Feature Extraction, Train a Category, Visual Question Answering.

– Some of the functionalities discussed in this chapter were built in collaborationwith Clint, Yash and Neelima.

• In Chapter 5, we talk about the impact of the project on the student community. Wediscuss about our past experience of working as a mentoring organization in an opensource program for students. The team mentored three undergraduate students whocontinue to contribute to CloudCV and improve the system.

• In Chapter 6, we briefly mention some of the new features planned in the next releaseof CloudCV.

• In Appendix A, we present the Object Proposals Toolbox. We use the toolbox toconduct a thorough experimental analysis of existing evaluation protocol for objectproposals and show that the evaluation protocol is gameable. We also propose waysto alleviate this problem.

– The work presented in this chapter was joint work with Neelima and Aroma.

• In Appendix B, we look at the popular task of visual question answering (VQA) in thefield of computer vision and study the role of visual attention in solving the task. Wecollect a new dataset to model visual attention and present deep learning architecturesthat incorporates this data to solve VQA.

– The work presented in this chapter was joint work with Abhishek Das.

6

1.5 Related publications

Most ideas described in this thesis have appeared in the following publications:

• CloudCV: Large Scale Distributed Computer Vision as a Cloud Service.Harsh Agrawal, Clint Solomon Mathialagan, Yash Goyal, Neelima Chavali, PrakritiBanik, Akrit Mohapatra, Ahmed Osman, Dhruv Batra.Book Chapter, Mobile Cloud Visual Media Computing.Editors: Gang Hua, Xian-Sheng Hua. Springer, 2015.

• Object-Proposal Evaluation Protocol is ‘Gameable’.Neelima Chavali∗, Harsh Agrawal∗, Aroma Mahendru∗, Dhruv Batra.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

∗ equal contribution

Chapter 2

Related Efforts

Before describing the architecture and capabilities of CloudCV, let us first put it in contextof related efforts in this direction. Open-source computer vision software can be broadlycategorized into three types:

• General-Purpose Libraries: There are a number of general purpose computer visionlibraries available in different programming languages:

– C/C++: OpenCV [42], IVT [8], VXL [16]

– Python: OpenCV (via wrappers), PyVision [40]

– .NET: AForge.NET [2].

The most comprehensive effort among these is OpenCV, which is a library aimedat real-time computer vision. It contains more than 2500 algorithms and has beendownloaded 5 million by 47K people [42]. The library has C, C++, Java and Pythoninterfaces and runs on Windows, GNU/Linux, Mac, iOS & Android operating systems.

• Narrow-Focus Libraries: A number of toolboxes provide specialized implementationsfor specific tasks, e.g . Camera Calibration Toolbox [41], Structure-from-Motion tool-boxes [69, 123,136], Visual Features Library [131], and deep learning frameworks suchas Caffe [79], Theano [36,38], Torch [17] etc.

• Specific Algorithm Implementations: released by authors on their respective websites.Popular examples include object detection [66], articulated body-pose estimation [139],graph-cuts for image segmentation [84], etc.

Unfortunately, all three source-code distribution mechanisms suffer from at least one of theselimitations:

7

8

1. Lack of Parallelism: Most of the existing libraries have a fairly limited or no supportfor parallelism. OpenCV and VLFeat for instance have multi-threading support, al-lowing programs to utilize multiple cores on a single machine. Unfortunately, moderndatasets are so large that no single machine may be able to hold all the data. Thismakes it necessary to distribute jobs (with computational & data dependencies) on acluster of machines. CloudCV will have full support for three levels of parallelism: i)single machine with multiple cores; ii) multiple machines in a cluster with distributedstorage; and iii) “cloud-bursting” or dynamic allocation of computing resources via aprofessional elastic cloud computing service (Amazon EC2 [3]).

2. Burden on the User not the Provider: Today, computer vision libraries placeinfrastructural responsibilities squarely on the user of these systems, not the provider.The user must download the said library, resolve dependencies, compile code, arrangefor computing resources, parse bugs and faulty outputs. CloudCV will release the userof such burdens – the user uploads the data (or points to a cached database in therepository) and simply specifies what computation needs to be performed.

Finally, we stress that CloudCV is not another computer vision toolbox. Our focus isnot on re-implementing algorithms, rather we build on the success of comprehensive effortsof OpenCV, Caffe & others. Our core contribution will be to provide fully distributedimplementations on the cloud and make them available as a service.

Efforts losest to the goal of CloudCV: Multiple online services exist which provide specificalgorithms such as face, concept, celebrity recognition [12], audio and video understand-ing [5], personalized object detectors [18] as service. Recently, Google, Microsoft and othershave started to provide specific vision and natural language algorithms through software as aservice(SaaS) based platforms [19,26,27]. Unlike these services, CloudCV is an open-sourcearchitecture that aims to provide the capability of running user’s own algorithms and sys-tems on cloud services such as Amazon Web Services, Microsoft Azure, etc. This will alsoenable researchers to make their own architectures and models available as a service.

Chapter 3

System Architecture

3.1 CloudCV Back-end Infrastructure

In this section, we describe in detail all the components that form the back-end architectureof CloudCV.

The back-end system shown in Fig. 3.1 mainly consists of a web-server that is responsiblefor listening to incoming job requests and sending real-time updates to the user. A jobscheduler takes these incoming jobs and distributes them across a number of worker nodes.The system uses a number of open-source frameworks to ensure an efficient design that canscale to a production system.

3.1.1 Web Servers

The back-end consists of two servers that are constantly listening for incoming requests. Weuse a python based web-framework (Django) which handles Hypertext Transfer Protocol(HTTP) requests made by the web-interface or the APIs. These requests contains detailsabout the job such as list of images, which executable to run, executable parameters, userinformation etc. One drawback to HTTP requests is that it allows only a single request-response pair, i.e., for a given request the server can only return one response after whichthe connection breaks and the server cannot communicate with the client unless client sendsa request. This leads to serious limitations because a persistent real-time connection cannotbe established for the server to send updates to the user. To solve this problem, we use theweb-socket protocol (Socket.IO) on top of another server (Node.js)

9

10

Figure 3.1: Users can access CloudCV using a web interface, Python or MATLAB API. Theback-end consists of web servers which communicates with the client in real-time through HTTPand Web-sockets. The job schedule at the master node distributes incoming jobs across multiplecomputer servers (worker nodes).

Django

CloudCV uses Django [6] which is a high-level Python HTTP Web framework that is basedon the Model View Controller (MVC) pattern. MVC defines a way of developing software sothat the code for defining and accessing data (the model) is separate from request routinglogic (the controller), which in turn is separate from the user interface (the view).

A key advantage of such an approach is that components are loosely coupled and servesingle key purpose. The components can be changed independently without affecting theother pieces. For example, a developer can change the URL for a given part of the applicationwithout affecting the underlying implementation. A designer can change a page’s HTMLcode without having to touch the Python code that renders it. A database administratorcan rename a database table and specify the change in a single place, rather than having tosearch and replace through a dozen files.

11

Scaling up to serve thousands of web request is a crucial requirement. Django adopts a“share nothing” philosophy in which each part of the web-stack is broken down into singlecomponents so that inexpensive servers can be added or removed with minimum fuss.

In the overall CloudCV back-end architecture, Django is responsible for serving the web-pages, translating requests into jobs and calling the job scheduler to process these jobs. Theoutput of the algorithm that the job is running is pipelined to a message queue system. Thereceiver of the message queue system sends the output back to the user. In CloudCV, themessage queue system is Redis and the receiver is Node.js; both of these are explained inthe next two sections.

Node.js

Node.js [10] is an event-driven web-framework that excels in real-time applications such aspush updates and chat applications.

CloudCV uses Node.js for real-time communication with the user so that all updates relatedto a particular job can be communicated to the user.

Unlike traditional frameworks that use the stateless request-response paradigm such asDjango, Node.js can establish a two-way communication with the client so that server cansend updates to the client without the need for the client to query the server to check forupdates. This is in contrast to the typical web response paradigm, where the client alwaysinitiates communication. Real time communication with the client is important becausecompleting a job that contains large amounts of data will take some time and delaying com-munication with the client until the end of job makes for a poor user experience and havingthe client query the server periodically is wasteful.

The de facto standard for building real-time applications Node.js applications is via Socket.IO [14].It is an event-based bi-directional communication layer which abstracts many low-level detailsand transports, including AJAX long-polling and WebSockets, into a single cross-browsercompatible API. Whenever an event is triggered inside Node.js server, an event callbackmechanism can send a response to the client.

Redis

One of the use-cases of real time communication with the user is the ability to send algorithmoutput to the user during execution. To make this possible, there needs to be a system inplace that can pipeline the algorithm output to the Node.js server, which is responsible forcommunicating the output back to client.

In case of CloudCV, this system is Redis [13], a high-performance in-memory key-value datastore. Since the data is stored in RAM (in-memory), looking up keys and returning a value

12

Figure 3.2: Flow describing the execution of a job starting from the user connecting to the CloudCVsystem to the back-end sending results to the user during execution of the job in real-time.

is very fast.

Redis can also act as a message queue between two processes – worker process executing aparticular algorithm and the Node.js server. Whenever a text output is generated by theexecutable, the worker process sends the output string through Redis. Node.js triggers anevent whenever it receives the message from the message queue. The message consists ofthe output string, and the socket id of the client to which this output needs to be sent.Consequently, the event handler sends the output string to the user associated with thatparticular socket id.

Fig. 3.2 describes the process of executing a job in detail. The flow is as follows:

1. At the start of a job, the user establishes a two-way socket connection with the server.Each user is recognized by the unique socket id associated with this connection.

2. The details about the job such as list of images to process, name of the functionalitythat needs to be executed and its associated parameters are sent to the server usingHTTP request.

3. The server saves the image in the database and sends a response back to the user.

13

Figure 3.3: Celery Flow Chart.

4. The server then distributes the job to worker nodes by serializing all the data. Anidle worker node pops the job from the queue, fetches the image from the network fileserver and starts executing the functionality associated with the job.

5. Whenever the executable generates an output, the worker node informs the masternode by sending the generated output through a message queue.

6. Upon receiving the message, the master node sends the output to the client. Thisis made possible by the event-driven framework of Node.js (as explained in previoussections).

14

3.1.2 Distributed Processing and Job Scheduler

Celery

Celery [4] is an asynchronous task queue based on distributed message passing. The executionunits, called tasks, are executed concurrently on a single or more worker servers using theirmultiprocessing architecture. Tasks can execute asynchronously (in the background) orsynchronously (wait until ready).

CloudCV infrastructure contains heterogenous group of virtual machines that act as workernodes, also called ‘consumers’. The master node (‘producer’) on receiving a job requestconverts the request into a task by serializing the input data using format such as JSON [9]and sends it to a ‘broker’. The job of the broker is to receive a task from the producer andsend it to a consumer. Broker consists of two components: an exchange and queues. Basedon certain bindings or rules, exchange sends each task to a particular queue. For instance,GPU optimized tasks (such as image classification Section 4.2 are sent to ‘ClassificationQueue’ which are then processed by worker nodes that have GPUs. On the other hand,image stitching tasks that utilize multiple CPUs are sent to CPU-only machines via ‘ImageStitching Queue’. A queue is simply a buffer that stores the messages.

This protocol is known as AMQP Protocol [1] and Celery abstracts away details of theprotocol efficiently, allowing the system to scale.

GraphLab

GraphLab [97] is a high-level abstraction for distributed computing that efficiently and in-tuitively expresses data and computational dependencies with a sparse data-graph. Unlikeother abstractions such as Map-Reduce, computation in GraphLab is expressed as a program,which is executed in parallel on each processing node (potentially on different machines),while maintaining data consistency between machines and appropriate locking.

We implemented a parallel image stitching algorithm by creating GraphLab wrappers forthe image stitching pipeline in OpenCV [42], a widely used open-source computer visionlibrary. The implementation is open-source and is available in the GraphLab’s computervision toolkit [15].

3.1.3 Caffe - Deep Learning Framework

Caffe [79] is a deep learning framework initially developed by the Berkeley Vision group andnow an open-source project with multiple contributors. In performance tests, it consistentlyranks as of the fastest Convolution Neural Network (CNN) implementations available online.Caffe is widely used in academic research projects, industrial applications pertaining to

15

vision, speech, etc. A number of state-of-the-art models implemented in Caffe are availablepublicly available for download.

CloudCV uses Caffe at the back-end to provide services such as classification, feature ex-traction and object detection. CloudCV also allows adding a new category to a pre-trainedCNN model without retraining the entire model. This functionality is described in detail inSection 4.4.

3.2 Front-end Platforms

CloudCV computer vision algorithms are accessible via three front-end platforms: 1) Webinterface, 2) Python APIs, and 3) Matlab APIs.

3.2.1 Web interface

Modern web browsers offer tremendous capabilities in terms of accessing online content,multimedia etc. We built a web-interface available at http://cloudcv.org so that userscan access CloudCV from any device via any operating system without having to install anyadditional software.

As illustrated in the screen capture in Fig. 3.4, users can test CloudCV services by tryingthem out on a few images uploaded through local system or upload images from third partycloud storage such as Dropbox (shown in Fig. 3.5).

We are also working on providing user authentication such that users can have access to allthe trained models, training images, and job history. This will enable the user to seamlesslytransfer transfer data across multiple data sources.

3.2.2 Python API

To enable building end-to-end applications on top of CloudCV, we make our services acces-sible via a Python API.

Python has seen significant growth in terms of libraries developed for scientific computationbecause of its holistic language design. It also offers interactive terminal and user interfacewhich makes data analysis, visualization and debugging easier.

Loading necessary packages: To use the CloudCV Python API, a user only needs toimport the PCloudCV class.

from pcloudcv import PCloudCV

import utility.job as uj

http://cloudcv.org

16

import json

import os

At this point, the pcloudcv object may be used to access the various functionalities providedin CloudCV. These functionalities are detailed in Section 4.

Setting the configuration path: When used in the above manner, the user needs toprovide details about the job (executable name, parameter settings, etc.) for each such APIcall. In order to reduce this burden, our API includes a configuration file that stores all suchnecessary job information. A sample configuration file is shown below:

17

1 {2 "exec": "classify",

3 "maxim": 500,

4 "config": [

5 {6 "name": "ImageStitch",

7 "path": "dropbox:/1/",

8 "output": "/home/dexter/Pictures/test_download",

9 "params": {10 "warp": "plane"

11 }12 },13 {14 "name": "classify",

15 "path": "local: /home/dexter/Pictures/test_download

/3",


17 "params": {18 }19 },20 {21 "name": "features",

22 "path": "local: /home/dexter/Pictures/test_download

/3",


24 "params": {25 "name": "decaf",

26 "verbose": "2",

27 }28 }29 ]

30 }

The user must simply provide the full path to the configuration file.

#full path of the config.json file

config_path = os.path.join(os.getcwd(), "config.json")

dict = {"exec": "classify"}

Creating PCloudCV object: To run a job, the user simply needs to create a PCloudCVobject. The constructor takes the path to the configuration file, a dictionary that containsoptional settings for input directory, output directory, and executable, and a Boolean pa-rameter that tells the API whether the user wishes to login to his/her account using third

18

party authorization – Google accounts or Dropbox. If the Boolean parameter is false, thenthe job request is treated as anonymous.

p = PCloudCV(config_path , dict , True)

p.start ()

3.2.3 Matlab API

MATLAB is a popular high level language and interactive environment that offers high-performance numerical computation, data analysis, visualization capabilities, and applica-tion development tools. MATLAB has become a popular language in academia, especiallyfor computer vision researchers, because it provides easy access to thousands of low-levelbuilding-block functions and algorithms written by experts in addition to those specificallywritten by computer vision researchers. Therefore, CloudCV includes a MATLAB API, asshown in the screenshot, Fig. 3.6.

19

(a) Upload interface (b) Terminal updates

(c) Result Visualization

Figure 3.4: (a) shows the upload section. User can upload images either from his/her dropbox orlocal disk, (b) shows the terminal which receives real-time progress updates from the server and(c) shows the visualization interface for a given job. Note in this case the task was classificationand the result displays category and corresponding confidence for a given image.

20

(a) Upload interface (b) Terminal updates

Figure 3.5: The web-interface allows user to upload images and save features or models files insidehis/her Dropbox account. (a) shows the upload interface where a user can select one or multipleimages and (b) shows the save interface where the user can save all the data generated for the givenjob inside Dropbox. In this example, the user was trying to save features extracted in the form ofMat files inside Dropbox account.

21

Figure 3.6: MATLAB API Screenshot: Users can access CloudCV services within MATLAB. TheseAPIs run in background such that while the user is waiting for a response, the user can run othertasks and the API call is non-blocking. shows the screenshot of the MATLAB API.

Chapter 4

Functionalities

We now describe the functionalities and algorithms currently implemented in CloudCV.

4.1 Visual Question Answering

In order to completely solve artificial intelligence, multi-model understanding of vision andnatural language is paramount. To evaluate the fine-grained understanding of both imageand language of the system, the task of visual question answering was introduced recently. Inthis task,the system takes as input an image and a question about that image and producesan answer as output as shown in Fig. 4.1.

The model used for the project is based on the VQA model used in [29]. The model isshown in Fig. 4.2. The VQA model uses a LSTM unit to convert the question into a1024 dimension encoding. The LSTM model takes the word embedding representation ofthe question words as input, and the same image features as above followed by a lineartransformation to transform the image features to 1024 dimensions to match the LSTMencoding of the question. The question and image encodings are fused via element-wisemultiplication. The fused features are then passed through a multi-layer perceptron (MLP)neural network classifier with 2 hidden layers with 1000 hidden units each. Each fullyconnected layer is followed by dropout layer with a dropout ratio of 0.5 and tanh non-linearity. The output is a 1000 way Softmax classifier that that predicts one of top-1000answers in the training dataset. It has been observed in [29], that classifying into 1000 mostfrequent answers covers 82.67% of the answers present in train and validation dataset.

22

23

Figure 4.1: Visual Question Answering demo on the CloudCV website.

4.2 Classification

‘Image Classification’ refers to predicting the class labels of objects present in an image. Thisfinds myriad applications in visual computing. Knowing what object is visible to the camerais an immense capability in mobile applications. CloudCV image classification tackles thisproblem in the cloud. The classification API can be invoked to get a list of top five objectspresent in the image with the corresponding confidence scores.

CloudCV classification implementation uses the ‘CaffeNet’ model (bvlc reference caffenetin Caffe) shown in Fig. 4.3 which is based on AlexNet [87] architecture. The AlexNetarchitecture consists of 5 convolutional layers and 3 fully connected layers. The last fullyconnected layer (also known as FC8 layer) has 1000 nodes, each node corresponding to oneImageNet category.

4.3 Feature Extraction

It has been shown [58,114] that features extracted from the activation of a deep convolutionalnetwork trained in a fully-supervised fashion on an image classification task (with a fixedbut large set of categories) can be utilized for novel generic tasks that may differ significantlyfrom the original task of image classification. These features are popularly called DeCAFfeatures. A computer vision researcher who just needs DeCAF features on his dataset iscurrently forced to set up the entire deep learning framework, which may or may not berelevant otherwise. CloudCV alleviates this overhead by providing APIs that can be usedto extract DeCAF features on the cloud and then download them as a ‘mat’ file for furtheruse.

24

Figure 4.2: Model for Visual Question Answering described in [29]

Figure 4.3: CaffeNet model architecture

CloudCV feature extraction implementation uses the same architecture as Fig. 4.3. TheDeCAF features are the activations in the second-last fully connected layer (also known asFC7 layer), which consists of 4096 nodes. The CaffeNet model uses the FC7 activationscomputed from 10 sub-images – 4 corner regions, the center region and their horizontalreflections. Therefore, the output is a matrix of size (10,4096).

4.4 Train a New Category

Note: This functionality was added to CloudCV with the help of Yash Goyal The classifi-cation task described above is limited to a pre-defined set of 1000 ImageNet categories. Ina number of situations, a user may need a classification model with categories other than

25

Figure 4.4: Classification Pipeline

ImageNet but may not have sufficient data or resources to train a new model from scratch.CloudCV contains a ‘Train a New Category’ capability that can be used to efficiently addnew categories to the existing CaffeNet model with 1000 ImageNet categories.

The new model is generated by appending additional nodes in the last fully connectedlayer (FC8 layer) of the existing CaffeNet model. Each new node added corresponds toa new category. The weights and biases for these additional nodes are computed usingLinear Discriminant Analysis (LDA), which is equivalent to learning a Gaussian Naive Bayesclassifier with equal covariance matrices for all categories. All other weights and biases arekept same as the existing CaffeNet model.

The LDA weight vector (wk) and bias (bk) for a new category k are computed as:

wk = Σ−1µk

bk = log πk −1

2µTk Σ−1µk

(4.1)

where, πk is the prior probability of kthcategory,Σ is the covariance matrix of FC7 (secondlast fully connected layer in CaffeNet model) feature vectors, and µk is the mean vector ofFC7 feature vectors of the given training images for the new category. The prior distributionis assumed to be uniform for this demo, thus the prior probability πk is just the reciprocalof number of categories. Notice that the covariance matrix Σ can be computed offline usingall images in the ImageNet training dataset, and its inverse can be cached. This is the mostcomputationally expensive step in calculating the new parameters (weights and biases), but

26

Figure 4.5: Feature Extraction Pipeline

Figure 4.6: Train a New Category pipeline

is done once offline. The mean vector µk is computed from the training images for the newcategory in real time, Thus, a new category can be added to the network instantaneously!

We have also experimented with fine-tuning the Softmax layer, and the entire network fromthis LDA initialization, however, that is useful only when significant training data is availablefor the new category.

4.5 Object Detection

The computer vision community shares a strong sentiment that to accelerate advancementsin the field of computer vision, there is a need for more data. Datasets and correspond-ing benchmarks like PASCAL VOC [63] and ImageNet [117] have played a crucial role in

27

advancing computer vision algorithms. One of the important problems in computer visiondeals with detecting and localizing objects in an image. To develop new techniques for ob-ject detection that can leverage large-scale dataset and to evaluate object recognition anddetection algorithms, ImageNet dataset was collected. It consisted of 1,461,406 images of1000 object classes. There are two tasks associated with the dataset. In the first task , thealgorithm needs to detect each instance of the 200 object categories present in the image.Each annotation will consist of the class label and bounding box. In the second task, thealgorithm needs to predict whether a particular object category is present in an image ornot. These two tasks has allowed for unprecedented advancement in object recognition anddetection tasks. Unfortunately, with minor exceptions, such challenges also result in massiveduplication of effort, with each research group developing its own infrastructure and code-base.

CloudCV tries to unify these fragmented efforts by being a unified data and code reposi-tory. We provide cached versions of 16 popular features for all 1.2 million images in theILSVRC2014 classification and localization dataset that the community can build on. Thesefeatures are worth terabytes of data and having them readily accessible allows research teamsto focus on the other parts of the pipeline. The list of features can be found on CloudCVwebpage1.

Recent advances in object detection are driven by the success of object proposals. Objectproposal are a set of candidate regions in an image that may potentially contain an object.Object proposals algorithm has replaced sliding window approach in the object detectionpipeline. To allow researchers for running different object proposal algorithms, we released aMATLAB toolbox that contains most of the recent object proposal approaches. We describethe toolbox and relavant analysis in Appendix A

4.6 VIP: Finding Important People in Group Images

Note: This functionality was added to CloudCV in collaboration with Clint S. Mathialagan

When multiple people are present in a photograph, there is usually a story behind the situa-tion that brought them together: a concert, a wedding, a presidential swearing-in ceremony(Fig. 4.7), or just a gathering of a group of friends. In this story, not everyone plays an equalpart. Some person(s) are the main character(s) and play a more central role.

Consider the picture in Fig. 4.8a. Here, the important characters are the couple who appearto be the British Queen and the Lord Mayor. Notice that their identities and social status

1http://cloudcv.org/objdetect

http://cloudcv.org/objdetect

28

Figure 4.7: VIP: Predict the importance of individuals in group photographs (without assumingknowledge about their identities).

(a) Socially prominent people. (b) Non-celebrities. (c) Equally important people.

Figure 4.8: Who are the most important individuals in these pictures? (a) the couple (the BritishQueen and the Lord Mayor); (b) the person giving the award and the person receiving it play themain role; (c) everyone seems to be nearly equally important. Humans have a remarkable abilityto understand social roles and identify important players, even without knowing identities of thepeople in the images.

play a role in establishing their positions as the key characters in that image. However,it is clear that even someone unfamiliar with the oddities and eccentricities of the BritishMonarchy, who simply views this as a picture of an elderly woman and a gentleman incostume receiving attention from a crowd, would consider those two to be central charactersin that scene.

Fig. 4.8b shows an example with people who do not appear to be celebrities. We can seethat two people in foreground are clearly the focus of attention, and two others in thebackground are not. Fig. 4.8c shows a common group photograph, where everyone is nearlyequally important. It is clear that even without recognizing the identities of people, we ashumans have a remarkable ability to understand social roles and identify important players.

Goal. The goal of CloudCV VIP is to automatically predict the importance of individuals ingroup photographs. In order to keep our approach general and applicable to any input image,

29

we focus purely on visual cues available in the image, and do not assume identification ofthe individuals. Thus, we do not use social prominence cues. For example, in Fig. 4.8a,we want an algorithm that identifies the elderly woman and the gentleman as the two mostimportant people that image without utilizing the knowledge that the elderly woman is theBritish Queen.

A number of applications can benefit from knowing the importance of people. Algorithms forim2text (generating sentences that describe an image) can be made more human-like if theydescribe only the important people in the image and ignore unimportant ones. Photo crop-ping algorithms can do “smart-cropping” of images of people by keeping only the importantpeople. Social networking sites and image search applications can benefit from improvingthe ranking of photos where the queried person is important. Intelligent social robots canbenefit from identifying important people in any scenario.

Who is Important? In defining importance, we can consider the perspective of threeparties (which may disagree):

• the photographer, who presumably intended to capture some subset of people, andperhaps had no choice but to capture others;

• the subjects, who presumably arranged themselves following social inter-personalrules; and

• neutral third-party human observers, who may be unfamiliar with the subjectsof the photo and the photographer’s intent, but may still agree on the (relative) im-portance of people.

Navigating this landscape of perspectives involves many complex social relationships: thesocial status of each person in the image (an award winner, a speaker, the President), and thesocial biases of the photographer and the viewer (e.g ., gender or racial biases); many of thesecan not be easily mined from the photo itself. At its core, the question itself is subjective: ifthe British Queen “photo-bombs” while you are taking a picture of your friend, is she stillthe most important person in that photo?

In CloudCV VIP, to establish a quantitative protocol, we rely on the wisdom of the crowd toestimate the “ground-truth” importance of a person in an image. Our relative importancemodels are trained using real-valued importance scores obtained using Amazon MechanicalTurk.

Pipeline. The basic flow of CloudCV VIP is shown in Fig. 4.9. First, face detection isperformed using third-party face detectors. As described in the paper [102], we used SkyBiometry’s API [122] for face detection. CloudCV VIP uses OpenCV [42] to avoid networklatency. For every pair of detected faces, features are extracted that describe the relativeconfiguration of these faces. These features are fed to our pre-trained regressors to derive a

30

Figure 4.9: VIP Pipeline.

relative importance score for this pair. Finally, the faces are sorted in descending order ofimportance. The models and features are described in detail in Mathialagan et al . [102]. Inorder to be fast during test time, CloudCV VIP does not use DPM based pose features.

4.7 Gigapixel Image Stitching

The goal of Image Stitching is to create a composite panorama from a collection of images.The standard pipeline [43] for Image Stitching, consists of four main steps:

1. Feature Extraction: distinctive points (or key-points) are identified in each image anda feature descriptor [37,98] is computed for each key-point.

2. Image/Feature Matching: features are matched between pairs of images to estimaterelative camera transformations.

3. Global Refinement: of camera transformation parameters across all images.

4. Seam Blending: seams are estimated between pairs of images and blending is per-formed.

Consider a data graph G = (V,E) where each vertex corresponds to a image and two verticesare connected by an edge if the two images overlap in content, i.e. capture a part of the scenefrom two viewpoints. In the context of this graph, different steps of the stitching pipelinehave vastly different levels of parallelism. Step 1 (feature extraction) is vertex-parallel since

31

Image

Stitching

(a) Data-Graph and Panorama.

Feature Extraction

Image/Feature Matching

Global Camera Refinement

Seam Blending

Vertex Parallel Edge Parallel Edge Parallel

(b) Stitching Pipeline.

Figure 4.10: Gigapixel Image Stitching

features extraction at each vertex/image may be run completely independently. Step 2 (im-age/feature matching) and step 4 (seam blending) are edge-parallel since these computationsmay be performed completely independently at each edge in this graph. Together these stepsare data-parallel, where parallelism is achieved simply by splitting the data onto differentmachines with no need for coordination.

Step 3 (global refinement) is the most sophisticated step since it not embarrassingly parallel.This global refinement of camera parameters, involves minimizing a nonlinear error function(called re-projection error) that necessarily depends on all images [128], and ultimately slowsthe entire pipeline down.

We formulate this optimization as a “message-passing” operation on the data graph – eachvertex simply gathers some quantities from its neighbors and makes local updates to itscamera parameters. Thus, different image can be processed on different machines as long asthey communicate their camera parameters to their neighbors.

Thus, while this pipeline may not be data-parallel, it is graph-parallel show in Fig. 4.10b,meaning that data and computational dependencies are captured by a sparse undirectedgraph and all computation can be written as vertex-programs. It is clear that thinking aboutvisual sensing algorithms as vertex programs is a powerful abstraction.

The CloudCV image stitching functionality can be accessed through the web-interface, ascreenshot of which is shown in Fig. 4.11

32

(a) Sample images and upload interface. (b) Result for the sample images.

Figure 4.11: Image stitching web interface.

Chapter 5

Impact On Community

From the very start of the project, CloudCV was built with the aim to democratize computervision and reduce the barrier of entry into the world of deep learning and computer vision forthe students, young researchers and interested software developers. In this regard, we opensourced the entire architecture of CloudCV on GitHub1. This allowed us to build a symbioticrelationship with the open source community. We now have more than 500 commits by 6contributors on our project and has a total of more than 100 forks (user creating a copy ofthe repository to add their own changes to the repository).

In our effort to spread the knowledge of computer vision and deep learning and make studentsaware of the possibilities, we also participated in one of the most coveted and highly selectiveopen source programs called the ‘Google Summer of Code’ in 2015. We were selected as oneof the 180 organizations for two consecutive years (2015 and 2016). GSOC is an annualinternational program sponsored by Google focused on connecting post-secondary studentswith an open source mentoring organization to work on a project for the summer. Theprogram attracts thousand of smart and enthusiastic students from around the world thatspend their summer vacations working on an open source project under the guidance ofmentors from participating organizations.

5.1 Google Summer of Code (GSOC), 2015

In the year 2015, we selected three students that helped us improve existing functionalitiesand add some core features to the CloudCV project. Next, we describe in brief some of thosecontributions:

1https://github.com/batra-mlp-lab/CloudCV

33

https://github.com/batra-mlp-lab/CloudCV

34

5.1.1 Open Source Contributions to Digits

Recently NVIDIA released DIGITS [11], an interactive Deep Learning GPU Training Sys-tem, which provides access to the rich set of functionalities provided by Caffe [79] throughan intuitive browser-based graphical user interface. DIGITS complements Caffe functional-ity by introducing an extensive set of visualizations commonly used by data scientists andresearchers. Moreover since DIGITS runs as a web service, this facilitates seamless collabo-ration between large teams enabling data, trained CNNs model and results sharing.

During the GSOC 2015 program, we worked on integrating DIGITS with the functionalitiesprovided by CloudCV. CloudCV and DIGITS provide complementary functionalities. WhileCloudCV provides an ability to provide any deep learning algorithm as a service, DIGITSact as a front-end to training new models. Therefore, integrating both these frameworksallows user to train a model and make them available as a cloud service. Specifically:

• We built workspaces into DIGITS borrowing functionalities from user authenticationsystem in CloudCV. All jobs can now be created inside a specific workspace limitingthe scope of the project to that particular workspace. This allows different teams toshare the computing infrastructure without over-riding each others work.

• We also added the feature extraction functionality into DIGITS. Given a trained model,one can upload one or more images to perform prediction on the images and export layerwise filter weights, activation values. Users can specify ‘all’ for exporting/visualizingall layers information, or specify a single layer name or a subset of layers as commaseparate entries (i.e conv1,conv2,fc7). Data can then be exported as .npy file (PythonNumpy array) or .mat file (MATLAB cell array).

• We also extended the data hosting functionality include cloud storage such as Dropboxand Amazon S3.

5.1.2 CloudCV Containers for Easy Distribution and Installation

CloudCV consists of multiple components and depend on a lot of open source tools. Wedescribed each of these components in Chapter 3. To enable other developers to implementnew features and build their own custom CloudCV systems for their research, it was impor-tant to build a highly portable CloudCV system that can be deployed in minutes withoutany hassle. With the help of contributions from GSOC participants, we released Docker [7]containers for CloudCV. Docker is an open source project that allows to pack any applicationsuch that it can be run on any infrastructure. Docker provides many advantages to us. Someof these are: (1) Lightweight Resource Utilization: instead of virtualizing a whole operatingsystem, containers isolate at the process level and uses the hosts kernel. (2) Portability: itcan run on any operating system since all of the dependencies for a containerized applicationare bundled inside of the container. (3) Reliability: The interface between the host machineand the container are standardized and the interactions are predictable.

35

Containerizing various modules of the application resulted in a lot of refactoring of the code.Consequently, the deploying process of CloudCV now takes just a few minutes and we caneasily run CloudCV on different machines.

5.1.3 Improvements in Python API

As discussed in Section 3.2.2, we maintain Python APIs for CloudCV to allow developersand researchers to plug CloudCV services into their own pipeline. We also focussed someof the efforts in improving the Python API. We added two core features into the API: (1)Resumability: In case of large jobs (for example: classifying thousands of images), if the jobgets suspended due to external factors like loss of internet connection, the jobs can be startedfrom where the job was suspended. (2) Multi-thread architecture: Some of the tasks that theAPI needs to perform like uploading image to the CloudCV server, reading images from thedisk, downloading results, pre-processing images can happen in parallel. We implemented amulti-thread architecture that can perform all these operations asynchronously. Althoughit’s a work in progress, the improvement in speeds were very promising.

5.2 Google Summer of Code, 2016

To continue our efforts of building an extensive open source platform for building computervision algorithms and sharing it with the rest of the community, we will be participating inGSOC 2016 to collaborate with students interested in learning more about computer vision.Like last year’s GSOC, we will be mentoring three students. One of the participants from lastyear will be serving as a mentor for the new participants. In the next chapter, we describesome of the future directions we are interested in exploring.

Chapter 6

Future Work

In this chapter, we discuss some of the future features we will be working on:

6.1 CloudCV-fy your Vision/Deep-Learning Code

Deep learning and its application in AI-subfields (computer vision, natural language pro-cessing) has seen a tremendous growth in the recent years. Driven in part by code, data,and manuscript sharing on Github and Arxiv, we are witnessing increasing public access tostate-of-the-art deep learning models for object detection, classification, image captioning,and visual question answering.

However, running someones released implementation and making sense of the results ofteninvolves painstaking preparation and execution involving steps like setting up the environ-ment and dependencies (installing torch, Caffe, Tensorflow, Keras or Theano), setting upthe I/O pipeline, keeping track of inter-package consistencies, etc.

We want to build a system that can automatically CloudCV-fy the code and create anonline demo and a corresponding API that other researchers / developers can use withoutunderstanding fine-grained details about how the algorithm works. Testing or experimentingthe model should be as simple as going to a web-page and uploading images to look at theresults.

6.2 Build Deep Learning Models Online

One of the other problems facing young researchers who want to learn more about deeplearning is the amount of effort it takes to learn these new frameworks. There are more

36

37

than 70 deep learning frameworks that are available1. Each framework is good at differentpurposes. The goal of this project is to provide an online platform for trying deep learningalgorithms / models that will reduce the barrier of entry to the world of deep learningand applications in computer vision. We want to build a drag-and-drop interface whichwill consist of various modules like Convolution, Max-Pool, ReLU, LSTM unit, Soft-Maxunit that will allow users to plug together a system for training/testing their deep learningmodel. This allow for rapid, interactive experience without having to worry about settingup infrastructure.

6.3 Tutorials and Courses on CloudCV

CloudCV can play a role in demonstrating the role and promise of computer vision to thebroader public via educational outreach. We believe this is a timely effort and an immenseopportunity is at hand. There is a overwhelming demand for learning about computervision. Massive Open Online Classes (MOOCs) like Udacity and Coursera have demonstratedsuccess. They offer many machine learning, computer vision and deep learning courses.However, computer vision is by its nature an interactive and experimental field of study,and cannot just be learnt by passive lectures or multiple choice quizzes alone. CloudCV canbuild on the success of MOOCs by being an important teaching tool for learning computervision by building simple apps on the cloud. Imagine a student writing 4 lines of code inCloudCV development environment to run a face-detector on a stream of images capturedfrom his laptop webcam. We plan to offer interactive tutorials and implementation of popularcomputer vision algorithms and architectures so that the students can learn and play withthese algorithms from the convenience of their browsers.

1http://bit.ly/1Nsv3QJ

http://bit.ly/1Nsv3QJ

Chapter 7

Conclusion

In this thesis, we presented CloudCV, a comprehensive system that will provide access tostate-of-the-art distributed computer vision algorithms on the cloud. CloudCV consists of aconsists of a group of virtual machines running on Amazon Web services capable of runningstate of the art computer vision and deep learning algorithms in a distributed and parallelsetting. These algorithms are accessible to the user through multiple front-end platforms.The web interface allows users to simply play with the algorithms from the convenienceof a web browser by uploading images through Dropbox or local file system. For largerdatasets, the system enables to embed CloudCV services into a bigger end-to-end systemby utilizing Python and Matlab APIs. CloudCV acts as a unified data and code repositoryproviding easy comparisons to baselines and providing the necessary infrastructure to scalealgorithms to bigger datasets. CloudCV also acts as a teaching tool and a playground forvision algorithms for young researchers and students helping them learn and understandcomputer vision through interactive demos. To further extend the reach of CloudCV, weopen sourced the entire architecture of CloudCV which allowed us to build a symbioticrelationship with the open source community. In our effort to spread the knowledge ofcomputer vision and deep learning and make students aware of the possibilities, we alsoparticipated in one of the most coveted and highly selective open source programs calledthe Google Summer of Code (GSOC) in 2015 and 2016. Through these programs, we areclosely working with the students to further develop the functionalities of CloudCV in orderto make the field of computer vision more accessible to young researchers and help scalecomputer vision research to bigger and more massive datasets.

38

Appendices

39

Appendix A

Object-Proposals Evaluation Protocolis Gameable

Object proposals have quickly become the de-facto pre-processing step in a number of visionpipelines (for object detection, object discovery, and other tasks). Their performance isusually evaluated on partially annotated datasets. In this section, we argue that the choiceof using a partially annotated dataset for evaluation of object proposals is problematic –as we demonstrate via a thought experiment, the evaluation protocol is ‘gameable’, in thesense that progress under this protocol does not necessarily correspond to a “better” categoryindependent object proposal algorithm.

To alleviate this problem, we: (1) Introduce a nearly-fully annotated version of PASCALVOC dataset, which serves as a test-bed to check if object proposal techniques are over-fitting to a particular list of categories. (2) Perform an exhaustive evaluation of objectproposal methods on our introduced nearly-fully annotated PASCAL dataset and performcross-dataset generalization experiments; and (3) Introduce a diagnostic experiment to de-tect the bias capacity in an object proposal algorithm. This tool circumvents the need tocollect a densely annotated dataset, which can be expensive and cumbersome to collect.Finally, we have released an easy-to-use toolbox which combines various publicly availableimplementations of object proposal algorithms which standardizes the proposal generationand evaluation so that new methods can be added and evaluated on different datasets. Wehope that the results presented in the paper will motivate the community to test the cate-gory independence of various object proposal methods by carefully choosing the evaluationprotocol.

40

41

(a) (Green) Annotated, (Red)Unannotated

(b) Method 1 with recall 0.6 (c) Method 2 with recall 1

Figure A.1: (a) shows PASCAL annotations natively present in the dataset in green. Other objectsthat are not annotated but present in the image are shown in red; (b) shows Method 1 and (c)shows Method 2. Method 1 visually seems to recall more categories such as plates, glasses, etc.that Method 2 missed. Despite that, the computed recall for Method 2 is higher because it recalledall instances of PASCAL categories that were present in the ground truth. Note that the numberof proposals generated by both methods is equal in this figure.

A.1 Introduction

In the last few years, the Computer Vision community has witnessed the emergence of a newclass of techniques called Object Proposal algorithms [28, 31, 48, 59, 78, 85, 100, 112, 113, 130,144].

Object proposals are a set of candidate regions or bounding boxes in an image that maypotentially contain an object.

Object proposal algorithms have quickly become the de-facto pre-processing step in a num-ber of vision pipelines – object detection [51, 60, 71, 73, 74, 89, 127, 129, 135, 143], segmen-tation [30, 44, 45, 52, 142], object discovery [49, 54, 80, 116], weakly supervised learning ofobject-object interactions [50, 111], content aware media re-targeting [125], action recogni-tion in still images [120] and visual tracking [91, 134]. Of all these tasks, object proposalshave been particularly successful in object detection systems. For example, nearly all top-performing entries [73, 94, 108, 126] in the ImageNet Detection Challenge 2014 [118] usedobject proposals. They are preferred over the formerly used sliding window paradigm dueto their computational efficiency. Objects present in an image may vary in location, size,and aspect ratio. Performing an exhaustive search over such a high dimensional space isdifficult. By using object proposals, computational effort can be focused on a small numberof candidate windows.

The focus of this paper is the protocol used for evaluating object proposals. Let us begin byasking – what is the purpose of an object proposal algorithm?

In early works [28,59,100], the emphasis was on category independent object proposals, where

42

(a) (Green) Annotated, (Red)Unannotated

(b) Method 1 with recall 0.5 (c) Method 2 with recall 0.83

Figure A.2: (a) shows PASCAL annotations natively present in the dataset in green. Other objectsthat are not annotated but present in the image are shown in red; (b) shows Method 1 and (c)shows Method 2. Method 1 visually seems to recall more categories such as lamps, picture, etc.that Method 2 missed. Clearly the recall for Method 1 should be higher. However, the calculatedrecall for Method 2 is significantly higher, which is counter-intuitive. This is because Method 2recalls more PASCAL category objects.

the goal is to identify instances of all objects in the image irrespective of their category.While it can be tricky to precisely define what an “object” is1, these early works presentedcross-category evaluations to establish and measure category independence.

More recently, object proposals are increasingly viewed as detection proposals [85,86,130,144]where the goal is to improve the object detection pipeline, focusing on a chosen set of objectclasses (e.g . ˜20 PASCAL categories). In fact, many modern proposal methods are learning-based [48,70,78,81,85,86,88,109] where the definition of an “object” is the set of annotatedclasses in the dataset. This increasingly blurs the boundary between a proposal algorithmand a detector.

Notice that the former definition has an emphasis on object discovery [49,54,116], while thelatter definition emphasises on the ultimate performance of a detection pipeline. Surprisingly,despite the two different goals of ‘object proposal,’ there exists only a single evaluationprotocol, which is the following:

1. Generate proposals on a dataset: The most commonly used dataset for evaluationtoday is the PASCAL VOC [61] detection set. Note that this is a partially annotateddataset where only the 20 PASCAL category instances are annotated.

2. Measure the performance of the generated proposals: typically in terms of ‘recall’ ofthe annotated instances. Commonly used metrics are described in Section A.4.

The central thesis of this paper is that the current evaluation protocol for object proposalmethods is suitable for object detection pipeline but is a ‘gameable’ and misleading protocolfor category independent tasks. By evaluating only on a specific set of object categories,

1Most category independent object proposal methods define an object as “stand-alone thing with a well-defined closed-boundary”. For “thing” vs. “stuff” discussion, see [75].

43

we fail to capture the performance of the proposal algorithms on all the remaining objectcategories that are present in the test set, but not annotated in the ground truth.

Figs. A.1, A.2 illustrate this idea on images from PASCAL VOC 2010. Column (a) shows theground-truth object annotations (in green, the annotations natively present in the datasetfor the 20 PASCAL categories –‘chairs’, ‘tables’, ‘bottles’, etc.; in red, the annotations thatwe added to the dataset by marking object such as ‘ceiling fan’, ‘table lamp’, ‘window’, etc.originally annotated ‘background’ in the dataset). Columns (b) and (c) show the outputsof two object proposal methods. Top row shows the case when both methods produce thesame number of proposals; bottom row shows unequal number of proposals. We can seethat proposal method in Column (b) seems to be more “complete”, in the sense that itrecalls or discovers a large number of instances. For instance, in the top row it detects anumber of non-PASCAL categories (‘plate’, ‘bowl’, ‘picture frame’, etc.) but misses out onfinding the PASCAL category ‘table’. In both rows, the method in Column (c) is reportedas achieving a higher recall, even in the bottom row, when it recalls strictly fewer objects,not just different ones. The reason is that Column (c) recalls/discovers instances of the 20PASCAL categories, which are the only ones annotated in the dataset. Thus, Method 2appears to be a better object proposal generator simply because it focuses on the annotatedcategories in the dataset.

While intuitive (and somewhat obvious) in hindsight, we believe this is an importanta crucialfinding because it makes the current protocol ‘gameable’ or susceptible to manipulation(both intentional and unintentional) and misleading for measuring improvement in categoryindependent object proposals.

Some might argue that if the end task is to detect a certain set of categories (20 PASCAL or80 COCO categories) then it is enough to evaluate on them and there is no need to care aboutother categories which are not annotated in the dataset. We agree, but it is important tokeep in mind that object detection is not the only application of object proposals. There areother tasks for which it is important for proposal methods to generate category independentproposals. For example, in semi/unsupervised object localization [49, 54, 80, 116] the goal isto identify all the objects in a given image that contains many object classes without anyspecific target classes. In this problem, there are no image-level annotations, an assumptionof a single dominant class, or even a known number of object classes [49]. Thus, in sucha setting, using a proposal method that has tuned itself to 20 PASCAL objects would notbe ideal – in the worst case, we may not discover any new objects. As mentioned earlier,there are many such scenarios including learning object-object interactions [50,111], contentaware media re-targeting [125], visual tracking [91], etc.

To summarize, the contributions of this paper are:

• We report the ‘gameability’ of the current object proposal evaluation protocol.• We demonstrate this ‘gameability’ via a simple thought experiment where we propose

a ‘fraudulent’ object proposal method that significantly outperforms all existing objectproposal techniques on current metrics, but would under any no circumstances be

44

considered a category independent proposal technique. As a side contribution of ourwork, we present a simple technique for producing state-of-art object proposals.

• After establishing the problem, we propose three ways of improving the current eval-uation protocol to measure the category independence of object proposals:

1. evaluation on fully annotated datasets,2. cross-dataset evaluation on densely annotated datasets.3. a new evaluation metric that quantifies the bias capacity of proposal generators.

For the first test, we introduce a nearly-fully annotated PASCAL VOC 2010 where weannotated all instances of all object categories occurring in the images.

• We thoroughly evaluate existing proposal methods on this nearly-fully and two denselyannotated datasets.

• We have released release all code and data for experiments2, and an object proposalslibrary that allows for comparison of popular object proposal techniques.

A.2 Related Work

Types of Object Proposals: Object proposals can be broadly categorized into two cate-gories:

• Window scoring: In these methods, the space of all possible windows in an imageis sampled to get a subset of the windows (e.g ., via sliding window). These windowsare then scored for the presence of an object based on the image features from thewindows. The algorithms that fall under this category are [28,47,48,70,112,144].

• Segment based: These algorithms involve over-segmenting an image and mergingthe segments using some strategy. These methods include [31, 59, 78, 85, 88, 100, 109,113, 130, 133]. The generated region proposals can be converted to bounding boxes ifneeded.

Beyond RGB proposals: Beyond the ones listed above, a wide variety of algorithms fallunder the umbrella of ‘object proposals’. For instance, [68, 103, 107, 137, 141] used spatio-temporal object proposals for action recognition, segmentation and tracking in videos. An-other direction of work [34, 35, 72] explores use of RGB-D cuboid proposals in an objectdetection and semantic segmentation in RGB-D images. While the scope of this paper islimited to proposals in RGB images, the central thesis of the paper (i.e., gameability of theevaluation protocol) is broadly applicable to other settings.

Evaluating Proposals: While a variety of approaches have been proposed for generatingobject proposals, tThere has been a relatively limited analysis and evaluation of proposalmethods or the proposal evaluation protocol. Hosang et al . [77] focus on evaluation of objectproposal algorithms, in particular the stability of such algorithms on parameter changes

2Data and code can be accessed at: https://filebox.ece.vt.edu/~aroma/web/object-proposals.

html

https://filebox.ece.vt.edu/~aroma/web/object-proposals.html

https://filebox.ece.vt.edu/~aroma/web/object-proposals.html

45

and image perturbations. Their works shows that a large number of category independentproposal algorithms indeed generalize well to non-PASCAL categories, for instance in theImageNet 200 category detection dataset [118]. Although these findings are important (andconsistent with our experiments), they are unrelated to the ‘gameability’ of the evaluationprotocol. In [76], authors present an analysis of various proposal methods regarding proposalrepeatability, ground truth annotation recall, and their impact on detection performance.They also introduced a new evaluation metric (Average Recall). Their argument for a newmetric is the need for a better localization between generated proposals and ground truth.While this is a valid and significant concern, it is orthogonal to the‘gameability’ of theevaluation protocol, which to the best of our knowledge has not been previously addressed.Another recent related work is [110], which analyzes various methods in segment-basedobject proposals, focusing on the challenges faced when going from PASCAL VOC to MSCOCO. They also analyze how aligned the proposal methods are with the bias observedin MS COCO towards small objects and the center of the image and propose a methodto boost their performance. Although there is a discussion about biases in datasets butit is unlike our theme, which is ‘gameability’ due to these biases. As stated earlier, whileearly papers [28,59,100] reported cross-dataset or cross-category generalization experimentssimilar to ones reported in this paper, with the trend of learning-based proposal methods,these experiments and concerns seem to have fallen out of standard practice, which we showis problematic.

A.3 Object Proposals Library

In this section, we provide an overview of the Object Proposals Library – a Github repositoryfor object proposal algorithms.

It is common practice in the computer vision research community for authors to make theiralgorithms open source. While this is a great service to the computer vision community,one downside is that we end up with disparate implementations of these algorithms. Forexample, about 10 different algorithms have been proposed in literature for generating objectproposals. But these algorithms do not generate proposals in a common format. Exampleformats are:

• image segment regions

• bounding boxes. Even within bounding boxes, the proposals may specify

– top–left (x,y) corner pixels, bottom–right (x,y) corner pixels of the sub–window

– top–left (y,x) corner pixels, bottom–right (y,x) corner pixels

– top–left (y,x) corner pixels, height and width, etc.

46

Thus, there is a lack of consistency in the output of these algorithms, even though they areall solving the same problem. In order to standardize the output of all the object proposalalgorithms and have a central repository for all the existing algorithms, we built a easy–to–use Object Proposals Library. Following are its features:

• Generate proposals using all existing object proposal algorithms

• Outputs of all the algorithms are standardized into a single format

• Evaluate the object proposals using the following metrics:

1. Recall at specific threshold

2. Recall at specific number of proposals

3. Area under recall curves

4. Average Best Overlap

Fig. A.3, Fig. A.4 and Fig. A.5 depict the library and its usage. The library can be accessedfrom our Github repository 3.

A.4 Evaluating Object Proposals

Before we describe our evaluation and analysis, let us first look at the object proposalevaluation protocol that is widely used today. The following two factors are involved:

1. Evaluation Metric: The metrics used for evaluating object proposals are all typicallyfunctions of intersection over union (IOU) (or Jaccard Index) between generated pro-posals and ground-truth annotations. For two boxes/regions bi and bj, IOU is definedas:

IOU(bi, bj) =area(bi ∩ bj)area(bi ∪ bj)

(A.1)

The following metrics are commonly used:• Recall @ IOU Threshold t: For each ground-truth instance, this metric checkswhether the ‘best’ proposal from list L has IOU greater than a threshold t. If so,this ground truth instance is considered ‘detected’ or ‘recalled’. Then average recall ismeasured over all the ground truth instances:

Recall @ t =1

|G|∑gi∈G

I [maxlj∈L

IOU(gi, lj) > t], (A.2)

where I[·] is an indicator function for the logical preposition in the argument. Objectproposals are evaluated using this metric in two ways:

3https://github.com/batra-mlp-lab/object-proposals

https://github.com/batra-mlp-lab/object-proposals

47

Figure A.3: Github page of the Object Proposals Library

48

Figure A.4: Steps for generating proposals

49

Figure A.5: Steps for evaluating proposals

50

– plotting Recall-vs .-#proposals by fixing t– plotting Recall-vs .-t by fixing the #proposals in L.

• Area Under the recall Curve (AUC): AUC summarizes the area under theRecall-vs .-#proposals plot for different values of t in a single plot. This metric measuresAUC-vs .-#proposals. It is also plotted by varying #proposals in L and plotting AUC-vs-t.• Volume Under Surface (VUS): This measures the average recall by linearlyvarying t and varying the #proposals in L on either linear or log scale. Thus it mergesboth kinds of AUC plots into one.• Average Best Overlap (ABO): This metric eliminates the need for a threshold.We first calculate the overlap between each ground truth annotation gi ∈ G, and the‘best’ object hypotheses in L. ABO is calculated as the average:

ABO =1

|G|∑gi∈G

maxlj∈L

IOU(gi, lj) (A.3)

ABO is typically is calculated on a per class basis. Mean Average Best Overlap(MABO) is defined as the mean ABO over all classes.• Average Recall (AR): This metric was recently introduced in [76]. Here, averagerecall (for IOU between 0.5 to 1)-vs .-#proposals in L is plotted. AR also summarizesproposal performance across different values of t. AR was shown to correlate withultimate detection performance better than other metrics.

2. Dataset: The most commonly used datasets are the PASCAL VOC [61] detectiondatasets. Note that these are partially annotated datasets where only the 20 PASCALcategory instances are annotated. Recently analyses have been shown on ImageNet [77],which has more categories annotated than PASCAL, but is still a partially annotateddataset.

A.5 A Thought Experiment:

How to Game the Evaluation Protocol

Let us conduct a thought experiment to demonstrate that the object proposal evaluationprotocol can be ‘gamed’.

Imagine yourself reviewing a paper claiming to introduce a new object proposal method –called DMP.

Before we divulge the details of DMP, consider the performance of DMP shown in Fig. A.6on the PASCAL VOC 2010 dataset, under the AUC-vs .-#proposals metric.

As we can clearly see, the proposed method DMP significantly exceeds all existing proposalmethods [28, 31, 48, 59, 85, 100, 112, 130, 144] (which seem to have little variation over one

51

another). The improvement in AUC at some points in the curve (e.g ., at M=10) seems tobe an order of magnitude larger than all previous incremental improvements reported in theliterature! In addition to the improvementgain in AUC at a fixed M, DMPs also achievesthe same AUC (0.55) at an order of magnitude fewer number of proposals (M=10 vs . M= 50for edgeBoxes [144]). Thus, fewer proposals need to be processed by the downstreamensuingdetection system, resulting in an equivalent run-time speedup. This seems to indicate thata significant advancementprogress has been made in the field of generating object proposals.

So what is our proposed state-of-art technique DMP?It is a mixture-of-experts model, consisting of 20 experts, where each expert is a deep feature(fc7)-based [57] objectness detector. At this point, you, the savvy reader, are probablyalready beginning to guess what we did.

DMP stands for ‘Detector Masquerading as Proposal generator’. We trained object detectorsfor the 20 PASCAL categories (in this case with RCNN [71]), and then used these 20 detectorsto produce the top-M most confident detections (after NMS), and declared them to be ‘objectproposals’.

The point of this experiment is to demonstrate the following fact – clearly, no one would con-sider a collection of 20 object detectors to be a category independent object proposal method.However, our existing evaluation protocol declared the union of these top-M detections tobe state-of-the-art.

Why did this happen? Because the protocol today involves evaluating a proposal generatoron a partially annotated dataset such as PASCAL. The protocol does not reward recall ofnon-PASCAL categories; in fact, early recall (near the top of the list of candidates) of non-PASCAL objects results in a penalty for the proposal generator! As a result, a proposalgenerator that tunes itself to these 20 PASCAL categories (either explicitly via training orimplicitly via design choices or hyper-parameters) will be declared a better proposal generatorwhen it may not be (as illustrated by DMP). Notice that as learning-based object proposalmethods improve on this metric, “in the limit” the best object proposal technique is a detectorfor the annotated categories, similar to our DMP. Thus, we should be cautious of methodsproposing incremental improvements on this protocol – improvements on this protocol donot necessarily lead to a better category independent object proposal method.

This thought experiment exposes the inability of the existing protocol to evaluate categoryindependence.

A.6 Evaluation on Fully and Densely Annotated Datasets

As described in the previous section, the problem of ‘gameability’ is occurring due to theevaluation of proposal methods on partially annotated datasets. An intuitive solution wouldbe evaluating on a fully annotated dataset.

52

In the next two subsections, we evaluate the performance of 7 popular object proposal meth-ods [28, 31, 48, 100, 112, 130, 144] and two DMPs (RCNN [71] and DPM [67]) on one nearly-fully and two densely annotated datasets containing many more object categories. This is toquantify how much the performance of our ‘fraudulent’ proposal generators (DMPs) dropsonce the bias towards the 20 PASCAL categories is diminished (or completely removed).We begin by creating a nearly-fully annotated dataset by building on the effort of PAS-CAL Context [105] and evaluate on this nearly-fully annotated modified instance level PAS-CAL Context; followed by cross-dataset evaluation on other partial-but-densely annotateddatasets MS COCO [95] and NYU-Depth V2 [106].

Experimental Setup: On MS COCO and PASCAL Context datasets we conducted ex-periments as follows:

• Use the existing evaluation protocol for evaluation, i.e., evaluate only on the 20 PAS-CAL categories.

• Evaluate on all the annotated classes.• For the sake of completeness, we also report results on all the classes except the PAS-

CAL 20 classes.4

Training of DMPs: The two DMPs we use are based on two popular object detectors -DPM [67] and RCNN [71]. We train DPM on 20 PASCAL categories and use it as an objectproposal method. To generate large number of proposals, we chose a low value of thresholdin Non-Maximum Suppression (NMS). Proposals are generated for each category and a scoreis assigned to them by the corresponding DPM for that category. These proposals are thenmerge-sorted on the basis of this score. Top M proposals are selected from this sorted listwhere M is the number of proposals to be generated.Another (stronger) DMP is RCNN which is a detection pipeline that uses 20 SVMs (eachfor one PASCAL category) trained on deep features (fc7) [57] extracted on selective searchboxes. Since RCNN itself uses selective search proposals, it should be viewed as a trainedreranker of selective search boxes. As a consequence, it ultimately equals selective searchperformance once the number of candidates becomes large. We used the pre-trained SVMmodels released with the RCNN code, which were trained on the 20 classes of PASCAL VOC2007 train-val set. For every test image, we generate the Selective Search proposals usingthe ‘FAST’ mode and calculate the 20 SVM scores for each proposal. The ‘objectness’ scoreof a proposal is then the maximum of the 20 SVM scores. All the proposals are then sortedby this score and top M proposals are selected.5

4On NYU-Depth V2 evaluation is done on all categories. This is because only 8 PASCAL categories arepresent in this dataset.

5It was observed that merge-sorting calibrated/rescaled SVM scores led to inferior performance as com-pared to merge-sorting without rescaling.

53

A.6.1 Fully Annotated Dataset

PASCAL Context: This dataset ()introduced by Mottaghi et al . [105]) contains addi-tional annotations for all images of PASCAL VOC 2010 dataset [62]. The annotations aresemantic segmentation maps, where every single pixel previously annotated ‘background’in PASCAL was assigned a category label. In total, annotations have been provided for459 categories. This includes the original 20 PASCAL categories and new classes such askeyboard, fridge, picture, cabinet, plate, clock.Unfortunately, the dataset contains only category-level semantic segmentations, not instance-level segmentations. For our task, we needed instance-level bounding box annotations, whichcan’t be reliably extracted from category-level segmentation masks because the masks forseveral instances (of say chairs) may be merged together into a single ‘blob’ in the category-level mask.

Creating Instance-Level Annotations for PASCAL Contextnote1: Thus, we createdinstance-level bounding box annotations for all images in PASCAL Context dataset. First,out of the 459 category labels in PASCAL Context, we identified 396 categories to be ‘things’,and ignored the remaining ‘stuff’ or ‘ambiguous’ categories6 – neither of these lend themselvesto bounding-box-based object detection.7 for details.We selected the 60 most frequent non-PASCAL categories from this list of ‘things’ andmanually annotated all their instances. Selecting only top 60 categories is a reasonablechoice because the average per category frequency in the dataset for all the other categories(even after including background/ambiguous categories) was roughly one third as that ofthe chosen 60 categories (Fig. A.7a). Moreover, the percentage of pixels in an image leftunannotated (as ‘background’) drops from 58% in original PASCAL to 50% in our nearly-fully annotated PASCAL Context. This manual annotation was performed with the aid ofthe semantic segmentation maps present in the PASCAL Context annotations. Examplesannotations are shown in Fig. A.7d. Detailed statistics are available in our Arxiv papernote2.

Results and Observations: We now explore how changes in the dataset and anno-tated categories affect the results of the thought experiment from Section A.5. Figs. A.8a,A.8b, A.8c, A.8h compare the performance of DMPs with a number of existing proposalmethods [28,31,48,59,85,100,112,130,144] on PASCAL Context.

We can see in Column (a) that when evaluated on only 20 PASCAL categories DMPs trainedon these categories appear to significantly outperform all proposal generators. However, wecan see that they are not category independent because they suffer a big drop in performancewhen evaluated on 60 non-PASCAL categories in Column (b). Notice that on PASCALcontext, all proposal generators suffer a drop in performance between the 20 PASCAL cat-egories and 60 non-PASCAL categories. We hypothesize that this due to the fact that thenon-PASCAL categories tend to be generally smaller than the PASCAL categories (which

6e.g ., a ‘tree’ may be a ‘thing’ or ‘stuff’ subject to camera viewpoint.7See Appendix in: http://arxiv.org/abs/1505.05836

http://arxiv.org/abs/1505.05836

54

were the main targets of the dataset curators) and hence difficult to detect. But this couldalso be due to the reason that authors of these methods made certain choices while design-ing these approaches which catered better to the 20 annotated categories. However, the keyobservation here (as shown in Fig. A.8h) is that DMPs suffer the biggest drop. This dropis much greater than all the other approaches. It is interesting to note that due to the ratioof instances of 20 PASCAL categories vs other 60 categories, DMPs continue to slightlyoutperform proposal generators when evaluated on all categories, as shown in Column (c).

A.6.2 Densely Annotated Datasets

Besides being expensive, “full” annotation of images is somewhat ill-defined due to thehierarchical nature of object semantics (e.g . are object-parts such as bicycle-wheel, windowsin a building, eyes in a face, etc. also objects?). One way to side-step this issue is to usedatasets with dense annotations (albeit at the same granularity) and conduct cross-datasetevaluation.

MS COCO: Microsoft Common Objects in Context (MS COCO) dataset [95] contains 91common object classes (82 of them having more than 5,000 labeled instances). It not onlyhas significantly higher number of instances per class than PASCAL, but also considerablymore object instances per image (7.7) as compared to ImageNet (3.0) and PASCAL (2.3).

NYU-Depth V2: NYU-Depth V2 dataset [106] is comprised of video sequences from avariety of indoor scenes as recorded by both the RGB and Depth cameras. It features 1449densely labeled pairs of aligned RGB and depth images with instance-level annotations. Weused these 1449 densely annotated RGB images for evaluating object proposal algorithms.To the best of our knowledge, this is the first paper to compare proposal methods on such adataset.

Results and Observations: Figs. A.8d, A.8e, A.8f, A.8i show a plot similar to PASCALContext on MS COCO. Again, DMPs outperform all other methods on PASCAL categoriesbut fail to do so for the Non-PASCAL categories. Fig. A.8g shows results for NYU-DepthV2. See that when many classes in the test dataset are not PASCAL classes, DMPs tendto perform poorly, although it is interesting that the performance is still not as poor as theworst proposal generators. Results on other evaluation criteria are in the supplementnote2.

A.7 Bias Inspection

So far, we have discussed two ways of detecting ‘gameability’ – evaluation on nearly-fullyannotated dataset and cross-dataset evaluation on densely annotated datasets. Althoughthese methods are fairly useful for bias detection, they have certain limitations. Datasets

55

can be unbalanced. Some categories can be more frequent than others while others can behard to detect (due to choices made in dataset collection). These issues need to be resolvedfor perfectly unbiased evaluation. However, generating unbiased datasets is an expensiveand time-consuming process. Hence, to detect the bias without getting unbiased datasets,we need a method which can measure performance of proposal methods in a way thatcategory specific biases can be accounted for and the extent or the capacity of this bias canbe measured. We introduce such a method in this section.

A.7.1 Assessing Bias Capacity

Many proposal methods [48, 70, 78, 81, 85, 86, 88, 109] rely on explicit training to learn an“objectness” model, similar to DMPs. Depending upon which, how many categories theyare trained on, these methods could have a biased view of “objectness”.One way of measuring the bias capacity in a proposal method to plot the performance vs . thenumber of ‘seen’ categories while evaluating on some held-out set. A method that involveslittle or no training will be a flat curve on this plot. Biased methods such as DMPs will getbetter and better as more categories are seen in training. Thus, this analysis can help us findbiased or ‘gamebility-prone’ methods like DMPs that are/can be tuned to specific classes.To the best of our knowledge, no previous work has attempted to measure bias capacityby varying the number of ‘object’ categories seen at training time. In this experiment,we compared the performance of one DMP method (RCNN), one learning-based proposalmethod (Objectness), and two non learning-based proposal methods (Selective Search [130],EdgeBoxes [144]) as a function of the number of ‘seen’ categories (the categories trained on8)on MS COCO [95] dataset. Method names ‘RCNNTrainN’, ‘objectnessTrainN’ indicate thatthey were trained on images that contain annotations for only N categories (50 instancesper category). Total number of images for all 60 categories was ˜2400 (because some imagescontain ¿ 1 object). Once trained, these methods were evaluated on a randomly-chosen setof ˜500 images, which had annotations for all 60 categories.Fig. A.9a shows Area under Recall vs . #proposals curve for learning-based methods trainedon different sets of categories. Fig. A.9b and Fig. A.9c show the variation of AUC vs .# seen categories and improvement due to increase in training categories (from 10 to 60)vs . #proposals respectively, for RCNN and objectness when trained on different sets ofcategories. The key observation to make here is that with even a modest increase in ‘seen’categories with the same amount of increased training data, performance improvement ofRCNN is significantly more than objectness. Selective Search [130] and edgeBoxes [144] arethe dashed straight lines since there is no training involved.

These results clearly indicate that as RCNN sees more categories, its performance improves.One might argue that the reason might be that the method is learning more ‘objectness’ as it

8The seen categories are picked in the order they are listed in MS COCO dataset (i.e., no specific criterionwas used).

56

is seeing more data. However, as discussed above, the increase in the dataset size is marginal(∼ 40 images per category) and hence it unlikely that such a significant improvement isobserved due to that. Thus, it is reasonable to conclude that this improvement is becausethe method is learning class specific features.

Thus, this approach can be used to reason about ‘gameability-prone’ and ‘gameability-immune’ proposal methods without creating an expensive fully annotated dataset. We be-lieve this simple but effective diagnostic experiment would help to detect and thus contributein managing the category specific bias in all learning-based methods.

A.8 Conclusion

To conclude, the main message of this chapter is simply this – the current evaluation protocolfor object proposal algorithms is not suitable if we view them as category independent objectproposal methods (meant to discover [49,54], instances of all categories). By evaluating the‘recall’ of instances on a partially annotated dataset, we fail to capture the performance ofthe proposal algorithm on all the remaining object categories that are present in the testset, but not annotated in the ground truth.We demonstrate this ‘gameability’ via a simple thought experiment where we propose a‘fraudulent’ object proposal method that outperforms all existing object proposal techniqueson current metrics. We introduce a nearly-fully annotated version of PASCAL VOC 2010where we annotated all instances of 60 object categories other than 20 PASCAL categoriesoccurring in all images. We perform an exhaustive evaluation of object proposal methods onour introduced modified instance level PASCAL dataset and perform cross-dataset general-ization experiments on MS COCO and NYU-Depth V2. We have also released an easy-to-uselibrary to evaluate and compare various proposal methods which we think will also be useful.Furthermore, since densely annotating the dataset is a tedious and costly task, we proposeda diagnostic experiment to detect and quantify the bias capacity in object proposal methods.As modern proposal methods become more learning-based and trained in an end-to-end fash-ion, it is clear that the distinction between detectors and proposal generators is becomingblurred. With that in mind, it is important to recognize and safeguard against the flaws inthe protocol, lest we over-fit as a community to a specific set of object classes.

57

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

DMPedgeBoxesobjectnessrandomPrimrahtumcgselectiveSearchbing

-50%runtime

+53%accuracy

Figure A.6: Performance of different object proposal methods (dashed lines) and our proposed‘fraudulent’ method (DMP) on the PASCAL VOC 2010 dataset. We can see that DMP significantlyoutperforms all other proposal generators. See text for details.

58

(a) Average #annotations for different cate-gories.

(b) Fraction of image-area covered by differ-ent categories.

(c) PASCAL Context annotations [105]. (d) Our augmented annotations.

Figure A.7: (a),(b) Distribution of object classes in PASCAL Context with respect to differentattributes. (c),(d) Augmenting PASCAL Context with instance-level annotations. (Green = PAS-CAL 20 categories; Red = new objects)

59

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

DPMRCNNedgeBoxesobjectnessrandomPrimrahtumcgselectiveSearchbing

(a) Performance on PASCALContext, only 20 PASCALclasses annotated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8 DPMRCNNedgeBoxesobjectnessrandomPrimrahtumcgselectiveSearchbing

(b) Performance on PAS-CAL Context, only 60 non-PASCAL classes annotated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


(c) Performance on PASCALContext, all classes anno-tated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7DPMRCNNedgeBoxesobjectnessrandomPrimrahtumcgselectiveSearchbing

(d) Performance on MSCOCO, only 20 PASCALclasses annotated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6


(e) Performance on MSCOCO, only 60 non-PASCALclasses annotated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6


(f) Performance on MSCOCO, all classes annotated.

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6


(g) Performance on NYU-Depth V2, all classes anno-tated

# proposals100 101 102 103 104

diffe

renc

e in

are

a un

der r

ecal

l

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

DPMRCNNedgeBoxesobjectnessrandomPrimrahtumcgselectiveSearchbing

(h) AUC @ 20 categories -AUC @ 60 categories on PAS-CAL Context.

# proposals100 101 102 103 104

diffe

renc

e in

are

a un

der r

ecal

l

-0.05

0

0.05

0.1

0.15

0.2


(i) AUC @ 20 categories -AUC @ 60 categories on MSCOCO.

Figure A.8: Performance of different methods on PASCAL Context, MS COCO and NYU Depth-V2with different sets of annotations.

60

# proposals100 101 102 103 104

area

und

er re

call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8RCNN1RCNN5RCNN10RCNN20RCNN40RCNN60objectness1objectness5objectness10objectness20objectness40objectness60selectiveSearchedgeBoxes

(a) Area under recall vs. # proposals for various#seen categories

# seen categories10 20 30 40 50 60

AU

C @

# p

ropo

sals

=100

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46RCNNobjectnessselectiveSearchedgeBoxes

(b) Area under recall vs. #training-categories.

# proposals100 101 102 103 104

Impr

ovem

ent i

n A

UC

(AU

C @

#ca

tego

ries

60 -

AU

C @

#ca

tego

ries

10)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08RCNNobjectnessselectiveSearchedgeBoxes

(c) Improvement in area under recall from #seencategories =10 to 60 vs. # proposals.

Figure A.9: Performance of RCNN and other proposal generators vs number of object categoriesused for training. We can see that RCNN has the most ‘bias capacity’ while the performance ofother methods is nearly (or absolutely) constant.

Appendix B

Role of Visual Attention for VisualQuestion Answering(VQA)

We discussed about the Visual Question Answering task in Section 4.1. In this chapter,we look at ways of incorporating visual attention for the task of visual question answering.Since humans are capable of attending to specific regions of the scene based on their need, webelieve using attention for visual question answering will be useful. This form of attentionis different from earlier work which focussed on finding salient regions in an image whileattention for VQA is task-specific since based on question, the region to attend to willbe different for different images. We collect human annotation data by showing subjectsimage-question pairs and ask them to mark regions necessary to answer the question. Weincorporate this attention into the VQA pipeline in various ways and study their performance.

B.1 Introduction

Multimodal understanding of vision and natural language is one of the big challenges ofartificial intelligence. The aim is to build models powerful enough to recognize objects ina scene, understand and reason about their relationships and express these relationships innatural language. To this end, there have been several recent efforts on tasks such as visualquestion answering [29,99,115] and image caption generation [46,56,64,82,83,101,132].

A visual question answering (VQA) system takes as input an image and a question aboutthat image and produces an answer as output. This goal-driven task is applicable to scenarioswhere visually-impaired humans or intelligence analysts obtain visual information from anintelligent system. Unlike image captioning, where a coarse understanding is often sufficientin describing images in a vague sense [55], a VQA model needs to pay attention to thefine-grained details to answer a question correctly.

61

62

Prior work [65] has shown that humans have the ability to quickly perceive a scene, saccadingmany times per second while scanning a complex scene. Visual questions selectively targetdifferent areas of an image, including background details and underlying context. Thissuggests that attention mechanisms should play a crucial role in any successful VQA system.For instance, if the question is ‘What color are the man’s socks?’, a VQA system that doesn’tattend to the region in the image corresponding to the man’s socks is unlikely to answer thequestion correctly.

There has been a lot of recent interest in modeling attention in a variety of tasks suchas object recognition [32, 104], caption generation [138] and machine translation [33]. Webelieve attention is critical in VQA, and in this paper, we want to study the role attentioncan play in VQA. We design and conduct human studies to collect ‘human attention maps’i.e. where do p eople look if asked to answer a question. These maps can also be used bothfor evaluating generated attention maps and for explicitly training attention-based models.

Concretely, our contributions are three-fold:

• We experiment with multiple variants of our VQA attention-annotation interface thattypically requires the subject to sharpen regions of a blurred image to answer visualquestions accurately. Using the interface that captures attention best, we collect humanannotations for the VQA dataset which will be made publicly available.

• We perform qualitative and quantitative evaluation of the maps generated by anattention-based VQA model [140] against our human attention maps via metrics suchas Earth Mover’s Distance and Intersection-over-Union.

B.2 Dataset: Collection and Analysis

In order to accurately capture attention regions that help in answering visual questions,we experimented with three variants of our attention-annotation interface. In all of these,we present a blurred image and a question about the image, and ask subjects to sharpenregions of the image that are relevant to answering the question correctly, in a smooth,click-and-drag, ‘coloring’ motion with the mouse.

B.2.1 Types of Interfaces

Starting with only a blurred image and question, like in our first interface (Fig. B.1a),has two consequences. Sometimes, we are able to capture exploratory attention, where thesubject lightly sharpens large regions of an image to find salient regions that eventually leadhim to the answer. However, for certain question types, such as counting (”How many ...”)and binary (”Is there ...”), the captured attention maps are often incomplete, and henceinaccurate.

63

Table B.1: Human study to compare the quality of annotations collected by different interfaces

Interface Type Human AccuracyBlurred Image without Ans. 75.17%Blurred Image with Ans. 78.69%Blurred & Original Image with Ans. 71.2%Original Image 80.0%

So in addition to the question and blurred image, in our second interface (Fig. B.1b), we showthe correct answer and ask subjects to sharpen as few regions as possible such that someonecan answer the question just by looking at the blurred image with sharpened regions.

To encourage exploitation instead of exploration, in our third interface (Fig. B.1c), we showthem the question-answer pair and full-resolution image. While we thought presenting thefull-resolution image and answer would enable subjects to sharpen regions most relevantto answering the question correctly, this task turns out to be counter-intuitive. We showfull-resolution images but ask them to imagine a scenario where someone else has to answerthe question without seeing the full-resolution image.

We ensure that we don’t present the same image to the same subject twice. Since each imagehas three associated questions, it was important that the subjects didn’t become familiarwith the image as that would bias the attention annotations.

We also performed human study to compare the quality of annotations collected by thedifferent interfaces. Table B.1 shows that Interface 2: Blurred Image with Answer, getsgood annotations that are almost as accurate at getting the subjects to answer by lookingat the entire image and the question. We therefore used interface 2 for collecting rest of theannotation. For this project, we collected annotations on 60,000 image-question pairs intraining set and 5,000 image-question pairs in validation set. The figure below shows all thethree interfaces.

B.2.2 Attention Map Examples

Some of the attention maps collected by us are shown in Fig. B.2

B.3 Experiments

Now that we have collected these human attention maps, we can ask the following question –do unsupervised attention models learn to predict attention maps that are similar to humanattention maps? To rephrase, do neural networks look at the same regions where humanslook to answer a visual question?

64

We evaluate the maps generated in an unsupervised manner by the Stacked Attention Net-work model proposed by Yang et al . [140]. The goal of our experiment is to perform acontrolled study and quantify how indicative the generated attention maps are of the inter-pretability associated with human visual attention based on the Earth Mover’s distance andIntersection-over-Union metrics.

Earth Mover’s Distance (EMD). EMD is a measure of the distance between two prob-ability distributions. Intuitively, given two distributions, which can be interpreted as twodifferent ways of piling up dirt over a region, EMD measures the least amount of work re-quired to turn one pile into the other, where a unit of work corresponds to transporting aunit of earth by a unit of ground distance.

Given an H × W machine-generated attention distribution and an H × W human atten-tion map, we first normalize the human attention map to be a probability distribution.Computing the EMD between them can then be formalized as the following linear pro-gramming problem: Let P = f(r1; pr1), . . . , (rH×W ; prH×W

) be the first ‘signature’, whereri is a spatial region and pri is the weight of that region, i.e. the attention probability;Q = f(s1; ps1), . . . , (sH×W ; psH×W

) the second signature; and D = [dij] the ground distancematrix where dij is the ground distance between regions ri and sj. We want to find a set offlows fij that minimize the overall cost

H×W∑i=1

H×W∑j=1

dijfij

Once we have found the optimal flow F, EMD is defined as

EMD(P,Q) =

H×W∑i=1

H×W∑j=1

dijfij

H×W∑i=1

H×W∑j=1

fij

where the denominator is a normalization factor that avoids favoring signatures with smallertotal weights, and can be ignored for comparing probability distributions.

The attention distributions produced by the Stacked Attention Network [140] (SAN) modelare 14 × 14. We use SAN with one attention layer (SAN-1). We scale down our humanattention maps to 14 × 14, spatially normalize them and compute EMD. We experimentwith two different choices of ground distance dij, L1 and L2. Table B.2 shows mean EMD(mEMD).

Intersection-over-Union. We binarize our spatially normalized human attention mapsand SAN-1 generated attention maps by setting thresholds on region probabilities, t ={0.003, 0.005, 0.007} and compute intersection-over-union (IoU).

65

IoU(P,Q)t =

H×W∑i=1

1 {pri > t ∧ psi > t}

H×W∑i=1

1 {pri > t ∨ psi > t}

Table B.2 shows mean IoU for different choices of t. We compare with uniform attentionmaps where every region has equal probability and random attention maps where we sampleeach region probability from [0, 1] and normalize it spatially.

Method SAN-1 Uniform Random

mEMD (L1) 3.29 2.21 2.25mEMD (L2) 2.63 1.75 1.79

mean IoU (t = 0.003) 0.263 0.573 0.453mean IoU (t = 0.005) 0.168 0.343 0.253mean IoU (t = 0.007) 0.092 0.0 0.151

Table B.2: Evaluation of machine-generated attention maps

B.4 Discussion

Necessary v/s Sufficient. If human attention maps are sufficient as ground truth forattention supervision, then so is any superset. For example, if attention mass is concen-trated on a ‘cat’ for ‘What animal is present in the picture?’, then an attention map thatassigns weights to any arbitrary-sized region that includes the ‘cat’ is sufficient as well. Onthe contrary, a necessary and sufficient attention map would be the smallest visual regionimportant for answering the question accurately. It is hard to define necessary and sufficientattention maps in the space of pixels, random pixels can be blacked out and chances arethat humans would still be able to answer the question given the resulting subset attentionmap. In pixel space, human attention maps will always be approximate. However, it maybe possible to work with a more constrained framework, for example hard attention blobs orbounding boxes, and define necessary and sufficient within that framework. Ideally, humanattention maps should be necessary and sufficient in the semantic space. This distinctionis important because it makes sense to train a model to emulate the best machines, i.e.humans, so attention maps that are necessary and sufficient for humans would make for the‘best’ ground truth.

66

B.5 Conclusions

We experiment with multiple interfaces to collect human attention maps for visual questionanswering. Using the interface that gives highest human accuracy, we collect a humanattention maps dataset which will be made publicly available. This dataset can be usedto evaluate attention maps generated in an unsupervised manner by attention-based neuralnetwork models, or to explicitly train models with attention supervision for visual questionanswering.

67

(a) Interface 1: Blurred image and question without answer

(b) Interface 2: Blurred image and question with answer

(c) Interface 3: Blurred image and original image with question andanswer

Figure B.1: All the three interfaces that were experimented with for the data collection process

68

(a) Q: What is in the vase?

(b) Q: What is the many using to take a picture?

(c) Q: What is floating in the sky?

Figure B.2: Attention Map Examples

Bibliography

[1] Advanced Message Queueing Protocol. Website - https://www.amqp.org. 14

[2] AForge.NET Image Processing Lab. Website - http://www.aforgenet.com/. 7

[3] Amazon elastic compute cloud (amazon ec2). http://aws.amazon.com/ec2/. 8

[4] Celery: Distributed Task Queue. Website - http://www.celeryproject.org/. 14

[5] Clarifai. http://www.clarifai.com/. 8

[6] Django: The web framework for perfectionists with deadlines. Website - https://www.

djangoproject.com/. 10

[7] Docker. https://www.docker.com/. 34

[8] Integrating Vision Toolkit. Website - http://ivt.sourceforge.net/. 7

[9] javaScript Object Notation. Website - http://www.json.org/. 14

[10] Node.js. Website - https://nodejs.org/. 11

[11] Nvidia digits. ://developer.nvidia.com/digits. 34

[12] Orbeus rekognition. https://rekognition.com/. 8

[13] Redis. Website - http://redis.io/. 11

[14] Socket.IO. Website - http://socket.io/. 11

[15] The Graphlab Computer Vision Toolkit. http://graphlab.org/toolkits/

computer-vision/. 14

[16] The Vision-something-Libraries. Website - http://vxl.sourceforge.net/. 7

[17] Torch:A scientific computing framework for LUAJIT. Website - http://torch.ch/. 7

[18] vision.ai. http://vision.ai/. 8

[19] Watson services. https://www.ibm.com/smarterplanet/us/en/ibmwatson/

developercloud/services-catalog.html. 8

[20] Big data, big impact: New possibilities for international development.World Economic Forum Report – http://www.weforum.org/reports/

big-data-big-impact-new-possibilities-international-development, 2012. 1

[21] Distributed neural networks with gpus in the aws cloud. http://techblog.netflix.com/

2014/02/distributed-neural-networks-with-gpus.html, 2014. 1

[22] Why nvidia thinks it can power the ai revolution. http://gigaom.com/2014/03/31/

why-nvidia-thinks-it-can-power-the-ai-revolution/, 2014. 1

69

https://www.amqp.org

http://www.aforgenet.com/

http://aws.amazon.com/ec2/

http://www.celeryproject.org/

http://www.clarifai.com/

https://www.djangoproject.com/

https://www.djangoproject.com/

https://www.docker.com/

http://ivt.sourceforge.net/

http://www.json.org/

https://nodejs.org/

https://rekognition.com/

http://redis.io/

http://socket.io/

http://graphlab.org/toolkits/computer-vision/

http://graphlab.org/toolkits/computer-vision/

http://vxl.sourceforge.net/

http://torch.ch/

http://vision.ai/

https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/services-catalog.html

https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/services-catalog.html

http://www.weforum.org/reports/big-data-big-impact-new-possibilities-international-development

http://www.weforum.org/reports/big-data-big-impact-new-possibilities-international-development

http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html

http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html

http://gigaom.com/2014/03/31/why-nvidia-thinks-it-can-power-the-ai-revolution/

http://gigaom.com/2014/03/31/why-nvidia-thinks-it-can-power-the-ai-revolution/

70

[23] 50 deep learning software tools and platforms, updated. http://www.kdnuggets.com/2015/12/deep-learning-tools.html, 2015. 2

[24] Google computer vision research at cvpr 2015. http://googleresearch.blogspot.com/

2015/06/google-computer-vision-research-at-cvpr.html, 2015. 1

[25] Baidu eyes deep learning strategy in wake of new gpu options. http://www.nextplatform.com/2016/04/22/baidu-eyes-deep-learning-strategy-wake-new-gpu-options/, 2016.1

[26] Google cloud vision api. https://cloud.google.com/vision/, 2016. 8

[27] Microsoft cognition services. https://www.microsoft.com/cognitive-services/en-us/

apis, 2016. 8

[28] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. PAMI,2012. 41, 44, 45, 50, 52, 53

[29] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visualquestion answering. In ICCV, 2015. xi, 22, 24, 61

[30] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. D. Bourdev, and J. Malik. Semantic segmen-tation using regions and parts. 2012. 41

[31] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorialgrouping. 2014. 41, 44, 50, 52, 53

[32] J. L. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object Recognition With Visual Attention.Iclr-2015, 2015. 62

[33] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to alignand translate. CoRR, abs/1409.0473, 2014. 62

[34] D. Banica and C. Sminchisescu. CPMC-3D-O2P: semantic segmentation of RGB-D imagesusing CPMC and second order pooling. CoRR, abs/1312.7715, 2013. 44

[35] D. Banica and C. Sminchisescu. Second-order constrained parametric proposals and sequen-tial search-based structured prediction for semantic segmentation in rgb-d images. In CVPR,2015. 44

[36] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard,and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsu-pervised Feature Learning NIPS 2012 Workshop, 2012. 7

[37] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Comput.Vis. Image Underst., 110(3):346–359, June 2008. 30

[38] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. InProceedings of the Python for Scientific Computing Conference (SciPy), June 2010. 7

[39] G. B. Berriman and S. L. Groom. How will astronomy archives survive the data tsunami?Queue, 9(10):21:20–21:27, Oct. 2011. 1

[40] D. S. Bolme and S. O’Hara. Pyvision - computer vision toolkit. Website - http://pyvision.sourceforge.net, 2008. 7

[41] J. Y. Bouguet. Camera calibration toolbox for Matlab. Website - http://www.vision.

caltech.edu/bouguetj/calib_doc/, 2008. 7

http://www.kdnuggets.com/2015/12/deep-learning-tools.html

http://www.kdnuggets.com/2015/12/deep-learning-tools.html

http://googleresearch.blogspot.com/2015/06/google-computer-vision-research-at-cvpr.html

http://googleresearch.blogspot.com/2015/06/google-computer-vision-research-at-cvpr.html

http://www.nextplatform.com/2016/04/22/baidu-eyes-deep-learning-strategy-wake-new-gpu-options/

http://www.nextplatform.com/2016/04/22/baidu-eyes-deep-learning-strategy-wake-new-gpu-options/

https://cloud.google.com/vision/

https://www.microsoft.com/cognitive-services/en-us/apis

https://www.microsoft.com/cognitive-services/en-us/apis

http://pyvision.sourceforge.net

http://pyvision.sourceforge.net

http://www.vision.caltech.edu/bouguetj/calib_doc/

http://www.vision.caltech.edu/bouguetj/calib_doc/

71

[42] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with the OpenCV Library.O’Reilly, 2008. Website - http://opencv.org. 7, 14, 29

[43] M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features. IJCV,74(1):59–73, August 2007. 30

[44] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. 2012. 41

[45] J. Carreira and C. Sminchisescu. CPMC: Automatic Object Segmentation Using ConstrainedParametric Min-Cuts. PAMI, 34, 2012. 41

[46] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for imagecaption generation. June 2015. 61

[47] X. Chen, H. Ma, X. Wang, and Z. Zhao. Improving object proposals with multi-thresholdingstraddling expansion. In CVPR, 2015. 44

[48] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed gradients forobjectness estimation at 300fps. In CVPR, 2014. 41, 42, 44, 50, 52, 53, 55

[49] M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localizationin the wild: Part-based matching with bottom-up region proposals. CoRR, abs/1501.06170,2015. 41, 42, 43, 56

[50] R. G. Cinbis and S. Sclaroff. Contextual object detection using set-based classification. 2012.41, 43

[51] R. G. Cinbis, J. Verbeek, and C. Schmid. Segmentation Driven Object Detection with FisherVectors. 2013. 41

[52] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmen-tation. In CVPR, 2015. 41

[53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In CVPR, 2009. 2

[54] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning withgeneric knowledge. (3):275–293, 2012. 41, 42, 43, 56

[55] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring Nearest NeighborApproaches for Image Captioning. arXiv preprint, 2015. 61

[56] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, andT. Darrell. Long-term recurrent convolutional networks for visual recognition and description.CoRR, abs/1411.4389, 2014. 61

[57] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. De-caf: A deep convolutional activation feature for generic visual recognition. arXiv preprintarXiv:1310.1531, 2013. 51, 52

[58] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: Adeep convolutional activation feature for generic visual recognition. In ICML, 2014. 23

[59] I. Endres and D. Hoiem. Category-independent object proposals with diverse ranking. PAMI,36(2):222–234, 2014. 41, 44, 45, 50, 53

[60] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deepneural networks. CoRR, abs/1312.2249, 2013. 41

http://opencv.org

72

[61] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html. 42, 50

[62] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html. 53

[63] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html. 26

[64] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He,M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts andback. CoRR, abs/1411.4952, 2014. 61

[65] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona. What do we perceive in a glance of a real-worldscene? Journal of Vision, 7(1):10, 2007. 62

[66] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformablepart models, release 4. http://www.cs.brown.edu/ pff/latent-release4/. 7

[67] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection withdiscriminatively trained part based models. PAMI, 32(9):1627–1645, 2010. 52

[68] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik. Spatio-temporal moving object propos-als. arXiv preprint arXiv:1412.6504, 2014. 44

[69] Y. Furukawa. Clustering Views for Multi-view Stereo (CMVS). Website - http://grail.

cs.washington.edu/software/cmvs/. 7

[70] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. Van Gool. Deepproposal: Huntingobjects by cascading deep convolutional layers. 2015. 42, 44, 55

[71] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurateobject detection and semantic segmentation. CoRR, abs/1311.2524, 2013. 41, 51, 52

[72] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning Rich Features from RGB-DImages for Object Detection and Segmentation. arXiv preprint arXiv:1407.5736, 2014. 44

[73] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networksfor visual recognition. CoRR, 2014. 41

[74] S. He and R. W. Lau. Oriented object proposals. In ICCV, 2015. 41

[75] G. Heitz and D. Koller. Learning spatial context: Using stuff to find things. In Proc. 10thEuropean Conference on Computer Vision, 2008. 42

[76] J. Hosang, R. Benenson, P. Dollar, and B. Schiele. What makes for effective detectionproposals? arXiv preprint arXiv:1502.05082, 2015. 45, 50

[77] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? In BMVC,2014. 44, 50

[78] A. Humayun, F. Li, and J. M. Rehg. Rigor- recycling inference in graph cuts for generatingobject regions. In CVPR, 2014. 41, 42, 44, 55

http://grail.cs.washington.edu/software/cmvs/

http://grail.cs.washington.edu/software/cmvs/

73

[79] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, andT. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014. 7, 14, 34

[80] C. Kading, A. Freytag, E. Rodner, P. Bodesheim, and J. Denzler. Active learning anddiscovery of object categories in the presence of unnameable instances. In CVPR, 2015. 41,43

[81] H. Kang, M. Hebert, A. A. Efros, and T. Kanade. Data-driven objectness. IEEE Trans.Pattern Anal. Mach. Intell., 37(1):189–195, 2015. 42, 55

[82] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions.CoRR, abs/1412.2306, 2014. 61

[83] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings withmultimodal neural language models. CoRR, abs/1411.2539, 2014. 61

[84] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts?PAMI, 26(2):147–159, 2004. 7

[85] P. Krahenbuhl and V. Koltun. Geodesic object proposals. 2014. 41, 42, 44, 50, 53, 55

[86] P. Krahenbuhl and V. Koltun. Learning to propose objects. In CVPR, 2015. 42, 55

[87] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutionalneural networks. In NIPS, 2012. 23

[88] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning objectness with convolutionalnetworks. 2015. 42, 44, 55

[89] A. Kuznetsova, S. Ju Hwang, B. Rosenhahn, and L. Sigal. Expanding object detector’shorizon: Incremental learning framework for object detection in videos. In CVPR, 2015. 41

[90] K. Kvilekval, D. Fedorov, B. Obara, A. Singh, and B. Manjunath. Bisque: A platform forbioimage analysis and management. Bioinformatics, 26(4):544–552, Feb 2010. 1

[91] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid. Unsupervised object discovery andtracking in video collections. CoRR, abs/1505.03825, 2015. 41, 43

[92] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Buildinghigh-level features using large scale unsupervised learning. In International Conference inMachine Learning, 2012. 2

[93] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. 1

[94] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. 41

[95] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick.Microsoft COCO: Common objects in context. In ECCV, 2014. 52, 54, 55

[96] S. Lohr. The age of big data. New York Times – http://www.nytimes.com/2012/02/12/

sunday-review/big-datas-impact-in-the-world.html?pagewanted=all, 2012. 1

[97] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab:A new parallel framework for machine learning. In UAI, 2010. 14

[98] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, Nov. 2004. 30

http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?pagewanted=all

http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?pagewanted=all

74

[99] L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neuralnetwork. CoRR, abs/1506.00333, 2015. 61

[100] S. Manen, M. Guillaumin, and L. Van Gool. Prime object proposals with randomized prim’salgorithm. 2013. 41, 44, 45, 50, 52, 53

[101] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodalrecurrent neural networks. CoRR, abs/1410.1090, 2014. 61

[102] C. S. Mathialagan, D. Batra, and A. C. Gallagher. Vip: Finding important people in groupimages. In Computer Vision and Pattern Recognition, 2015. 29, 30

[103] I. Misra, A. Shrivastava, and M. Hebert. Watch and learn: Semi-supervised learning forobject detectors from video. In CVPR, 2015. 44

[104] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent Models of Visual Attention.2014. 62

[105] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille.The role of context for object detection and semantic segmentation in the wild. In CVPR,2014. 52, 53, 58

[106] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and supportinference from rgbd images. In ECCV, 2012. 52, 54

[107] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection propos-als. In ECCV 2014, 2014. 44

[108] W. Ouyang, P. Luo, X. Zeng, S. Qiu, Y. Tian, H. Li, S. Yang, Z. Wang, Y. Xiong, C. Qian,et al. Deepid-net: multi-stage and deformable deep convolutional neural networks for objectdetection. arXiv preprint arXiv:1409.3505, 2014. 41

[109] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. CoRR,abs/1506.06204, 2015. 42, 44, 55

[110] J. Pont-Tuset and L. Van Gool. Boosting object proposals: From pascal to coco. In Interna-tional Conference on Computer Vision, 2015. 45

[111] A. Prest, C. Schmid, and V. Ferrari. Weakly supervised learning of interactions betweenhumans and objects. PAMI, 34(3):601–614, 2012. 41, 43

[112] E. Rahtu, J. Kannala, and M. B. Blaschko. Learning a category independent object detectioncascade. 2011. 41, 44, 50, 52, 53

[113] P. Rantalankila, J. Kannala, and E. Rahtu. Generating object segmentation proposals usingglobal and local search. 2014. 41, 44

[114] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: anastounding baseline for recognition. arXiv preprint arXiv:1403.6382, 2014. 23

[115] M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering.arXiv preprint arXiv:1505.02074, 2015. 61

[116] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery andsegmentation in internet images. In CVPR. IEEE, 2013. 41, 42, 43

[117] O. Russakovsky, J. Deng, J. Krause, A. Berg, and L. Fei-Fei. The ImageNetLarge Scale Visual Recognition Challenge 2012 (ILSVRC2012). http://www.image-net.org/challenges/LSVRC/2012/. 26

75

[118] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual RecognitionChallenge, 2014. 41, 45

[119] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,2015. 1

[120] F. Sener, C. Bas, and N. Ikizler-Cinbis. On recognizing actions in still images via multiplefeatures. 2012. 41

[121] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. InProceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies,pages 1–10, 2010. 4

[122] SkyBiometry. https://www.skybiometry.com/. 29

[123] N. Snavely. Bundler: Structure from Motion (SfM) for Unordered Image Collections. Website- http://phototour.cs.washington.edu/bundler/. 7

[124] N. H. Strickland. Pacs (picture archiving and communication systems): filmless radiology.Archives of Disease in Childhood, 83(1):82–86, 2000. 1

[125] J. Sun and H. Ling. Scale and object aware image retargeting for thumbnail browsing. 2011.41, 43

[126] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.41

[127] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov. Scalable, high-quality object detection.CoRR, abs/1412.1441, 2014. 41

[128] B. Triggs, P. Mclauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment – a modernsynthesis. In Vision Algorithms: Theory and Practice, volume 1883 of Lecture Notes inComputer Science, pages 298–372. Springer-Verlag, 1999. 31

[129] Y.-H. Tsai, O. C. Hamsici, and M.-H. Yang. Adaptive region pooling for object detection.In CVPR, 2015. 41

[130] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for objectrecognition. IJCV, 2013. 41, 42, 44, 50, 52, 53, 55

[131] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer visionalgorithms. http://www.vlfeat.org/, 2008. 7

[132] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image captiongenerator. CoRR, abs/1411.4555, 2014. 61

[133] C. Wang, L. Zhao, S. Liang, L. Zhang, J. Jia, and Y. Wei. Object proposal by multi-branchhierarchical segmentation. In CVPR, 2015. 44

[134] N. Wang, S. Li, A. Gupta, and D. Yeung. Transferring rich feature hierarchies for robustvisual tracking. CoRR, abs/1501.04587, 2015. 41

[135] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. 2013. 41

[136] C. Wu. VisualSFM : A Visual Structure from Motion System. Website - http://www.cs.washington.edu/homes/ccwu/vsfm/. 7

https://www.skybiometry.com/

http://phototour.cs.washington.edu/bundler/

http://www.vlfeat.org/

http://www.cs.washington.edu/homes/ccwu/vsfm/

http://www.cs.washington.edu/homes/ccwu/vsfm/

76

[137] Z. Wu, F. Li, R. Sukthankar, and J. M. Rehg. Robust video segment proposals with painlessocclusion handling. In CVPR, 2015. 44

[138] K. Xu, A. Courville, R. S. Zemel, and Y. Bengio. Show , Attend and Tell : Neural ImageCaption Generation with Visual Attention. Arxiv, 2015. 62

[139] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. InCVPR, pages 1385–1392, 2011. 7

[140] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for imagequestion answering. CoRR, abs/1511.02274, 2015. 62, 64

[141] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In CVPR,2015. 44

[142] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia. Semantic object segmentation via detectionin weakly labeled video. In CVPR, 2015. 41

[143] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segdeepm: Exploiting segmentationand context in deep neural networks for object detection. In CVPR, 2015. 41

[144] C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. 2014. 41, 42,44, 50, 51, 52, 53, 55

cloudcv: deep learning and computer vision in the cloud

Documents