vip: finding important people in imagesan importance measure of people in images and investigate the...

9
VIP: Finding Important People in Images Clint Solomon Mathialagan Virginia Tech [email protected] Andrew C. Gallagher Google [email protected] Dhruv Batra Virginia Tech [email protected] Abstract People preserve memories of events such as birthday party, weddings, or vacations by capturing photos, often depict- ing groups of people. Invariably, some persons in the im- age are more important than others given the context of the event. This paper analyzes the concept of the importance of specific individuals in photos of multiple people. Two questions that have several practical applications are ad- dressed – Who are the most important person(s) in an im- age? And, given multiple images of a person, which one depicts the person in the most important role? We introduce an importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can we automatically predict the impor- tance of people from purely visual cues, incorporating this predicted importance results in significant improvement in applications such as im2text (generating sentences that de- scribe images of groups of people). 1. Introduction When multiple people are present in a photograph, there is usually a story behind the situation that brought them to- gether: a concert, a wedding, or just a gathering of a group of friends. In this story, not everyone plays an equal part. Some person(s) are the main character(s) and play a more important role. Consider the picture in Fig. 1a. Here, the central characters are two people who appear to be the British Queen and the Bishop. Notice that their identities and social status play a role in establishing their positions as the central charac- ters. However, it is clear that even someone unfamiliar with the oddities and eccentricities of the British Monarchy, who simply views this as a picture of an elderly woman and a gentleman in costume, receiving attention from a crowd, would consider those two to be central characters in that scene. Fig. 1b shows an example with people who do not appear to be celebrities. We can see that two people in foreground are clearly the focus of attention, and two others in the back- ground are not. Fig. 1c shows a common kind of photo- graph, with a group of friends, where everyone is nearly equally important. It is clear that even without recogniz- ing the identities of people, we as humans have a remark- able ability to understand social roles and identify important players. Goal and Overview. The goal of our work is to auto- matically predict the importance of people in group pho- tographs. In order to keep our approach general and ap- plicable to any new image, we focus purely on visual cues available in the image, and do not assume identification of the people. Thus, we do not use social prominence cues. For example, given Fig. 1a, we want an algorithm that iden- tifies the elderly woman and the gentleman as the top-2 most important people (among all people in image) with- out utilizing the knowledge that the elderly woman is the British Queen. What is importance? In defining importance, we can con- sider the perspective of three parties, which do not neces- sarily agree: • the photographer, who presumably intended to capture some subset of people, and perhaps had no choice but to capture others; • the subjects, who presumably arranged themselves fol- lowing social inter-personal rules; and • neutral third-party human observers, who may be un- familiar with the subjects of the photo and the pho- tographer’s intent, but may still agree on the (relative) importance of people. Navigating this landscape of perspectives involves many complex social relationships: the social status of each per- son in the image (an award winner, a speaker, the President), and the social biases of the photographer and the viewer (e.g., gender or racial biases); none of these can be easily mined from the photo itself. At its core, the question itself is subjective: if the British Queen “photo-bombs” while you are taking a picture of your friend, who is more important in that photo? In this work, to establish a quantitative protocol, we rely 1 arXiv:1502.05678v1 [cs.CV] 19 Feb 2015

Upload: others

Post on 10-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

VIP: Finding Important People in Images

Clint Solomon MathialaganVirginia [email protected]

Andrew C. GallagherGoogle

[email protected]

Dhruv BatraVirginia [email protected]

Abstract

People preserve memories of events such as birthday party,weddings, or vacations by capturing photos, often depict-ing groups of people. Invariably, some persons in the im-age are more important than others given the context of theevent. This paper analyzes the concept of the importanceof specific individuals in photos of multiple people. Twoquestions that have several practical applications are ad-dressed – Who are the most important person(s) in an im-age? And, given multiple images of a person, which onedepicts the person in the most important role? We introducean importance measure of people in images and investigatethe correlation between importance and visual saliency. Wefind that not only can we automatically predict the impor-tance of people from purely visual cues, incorporating thispredicted importance results in significant improvement inapplications such as im2text (generating sentences that de-scribe images of groups of people).

1. IntroductionWhen multiple people are present in a photograph, there isusually a story behind the situation that brought them to-gether: a concert, a wedding, or just a gathering of a groupof friends. In this story, not everyone plays an equal part.Some person(s) are the main character(s) and play a moreimportant role.

Consider the picture in Fig. 1a. Here, the central charactersare two people who appear to be the British Queen and theBishop. Notice that their identities and social status playa role in establishing their positions as the central charac-ters. However, it is clear that even someone unfamiliar withthe oddities and eccentricities of the British Monarchy, whosimply views this as a picture of an elderly woman and agentleman in costume, receiving attention from a crowd,would consider those two to be central characters in thatscene.

Fig. 1b shows an example with people who do not appear tobe celebrities. We can see that two people in foreground areclearly the focus of attention, and two others in the back-

ground are not. Fig. 1c shows a common kind of photo-graph, with a group of friends, where everyone is nearlyequally important. It is clear that even without recogniz-ing the identities of people, we as humans have a remark-able ability to understand social roles and identify importantplayers.

Goal and Overview. The goal of our work is to auto-matically predict the importance of people in group pho-tographs. In order to keep our approach general and ap-plicable to any new image, we focus purely on visual cuesavailable in the image, and do not assume identification ofthe people. Thus, we do not use social prominence cues.For example, given Fig. 1a, we want an algorithm that iden-tifies the elderly woman and the gentleman as the top-2most important people (among all people in image) with-out utilizing the knowledge that the elderly woman is theBritish Queen.

What is importance? In defining importance, we can con-sider the perspective of three parties, which do not neces-sarily agree:

• the photographer, who presumably intended to capturesome subset of people, and perhaps had no choice butto capture others;

• the subjects, who presumably arranged themselves fol-lowing social inter-personal rules; and

• neutral third-party human observers, who may be un-familiar with the subjects of the photo and the pho-tographer’s intent, but may still agree on the (relative)importance of people.

Navigating this landscape of perspectives involves manycomplex social relationships: the social status of each per-son in the image (an award winner, a speaker, the President),and the social biases of the photographer and the viewer(e.g., gender or racial biases); none of these can be easilymined from the photo itself. At its core, the question itselfis subjective: if the British Queen “photo-bombs” while youare taking a picture of your friend, who is more importantin that photo?

In this work, to establish a quantitative protocol, we rely

1

arX

iv:1

502.

0567

8v1

[cs

.CV

] 1

9 Fe

b 20

15

Page 2: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

(a) Socially prominent people (b) Relatively less famous people (c) Equally important peopleFigure 1: Who are most important persons in these pictures? In (a), the important two people appear to be the British Queen and theBishop. In (b), the person giving the award and the person receiving it play the main role, with two others in the background. In (c),everyone seems to be nearly equally important. Often, people agree on the importance judgement even without knowing identities of thepeople in the images.

on the wisdom of the crowd to estimate the “ground-truth”importance of a person in an image. We found the designof the annotation task and the interface to be particularlyimportant, and discuss these details in the paper.

Applications. A number of applications can benefit fromknowing the importance of people. Algorithms for im2text(generating sentences that describe images) can be mademore human-like if they describe only the important peo-ple in the image and ignore unimportant ones. Photo crop-ping algorithms can do “smart-cropping” of images of peo-ple by keeping only the important people. Social network-ing sites and image search applications can benefit from im-proving the rank photos where the queried person is impor-tant, rather than simply present in the background.

Contributions. This paper makes the following contribu-tions. First, we learn a model for predicting importance ofpeople in photos based on a variety of features that capturethe pose and arrangement of the people. Second, we collecttwo importance datasets that serve to evaluate our approach,and will be broadly useful to others in the community study-ing related problems. Finally, we show that not only can weautomatically predict the important of people from purelyvisual cues, incorporating this predicted importance resultsin significant improvement in applications such as im2text.Despite perhaps the naturalness of the task, to the best ofour knowledge, this is the first paper to directly infer theimportance of people in the context of a single group im-age.

2. Related WorkAt a high level, our work is related to a number of previousworks that study the concept of importance.

General object importance. The importance of generalobject categories is studied in a lot of recent works [16] [8][1] in Computer Vision. In the approach of Berg et al. [1],importance is defined as the likelihood that an object in animage will be mentioned in a sentence describing the image,written by a person. The key distinction between their work

and ours is that they study the problems at a category level(“are people more important than dogs?”), while we study itat an instance level (“is person A more important than per-son B?”), restricted only to instances of people. One resultfrom [1] is that the person category tends to be importantin most kinds of scenes. Differentiating the importance be-tween different individuals in an image is beneficial as itproduces a more fine-grained understanding of the image.

Visual saliency. A number of works [3] [12] [7] havestudied visual saliency, identifying which parts of an imagedraw viewer attention. Humans tend to be a naturally salientcontent in images. Perhaps the closest to our goal, is thework of Jiang et al. [9], who study visual saliency in groupphotographs and crowded scenes. Their objective is to builda visual saliency model that takes into account the presenceof faces in the image. Although they study the same contentas our work (group photographs), the goals of the two aredifferent – saliency vs importance. At a high level, saliencyis about what draws the viewer’s attention; importance isabout higher-level concept about social roles. We conductextensive human studies, and discuss this comparison in thepaper. Saliency is correlated to, but not identical to impor-tance. People in photos may be salient but not important,important but not salient, both, and neither.

Understanding group photos. A line of work in computervision studies photograph of groups of people [5] [13] [14][4] [6], addressing issues such structural formation and at-tributes of groups. Li et al. [11] predict the aesthetics ofa group photo. If the measure is below a threshold, photocropping is suggested by eliminating unimportant faces andregions that do not seem to fit in with the general structureof the group. While their goal is closely related to ours, theystudy aesthetics, not importance. They suggest a face to beretrained or cropped depending on how it affects the aes-thetics of the group shot. To the best of our knowledge, weare the first to predict importance of individuals in a groupphoto.

Page 3: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

(a) Image-Level annotation interface (b) Corpus-Level annotation interfaceFigure 2: Annotation Interfaces: (a) Image level: Hovering over a button (A or B), highlights the person associated with it (b) CorpusLevel: Hovering over a frame, shows the where the person is located in the frame

3. ApproachRecall that our goal is to model and predict the importanceof people in images. We model importance in two ways:

• Image-Level importance: In this setting, we are in-terested in the question – “Who is the most importantperson in this image?” This reasoning is local to theimage in question, and the objective is to predict animportance score for each person in the image.

• Corpus-Level importance: Here, the question is “Inwhich image is this specific person most important?”This reasoning is across a corpus of photos (each con-taining a person of interest), and the objective is to as-sign an importance score to each image.

3.1. Dataset CollectionIn order to study two both these settings, we curate and an-notate two datasets (one for each setting).

Image-Level Dataset. We need a dataset of images eachcontaining at least three people with varying levels of im-portance. While the ‘Images of Groups’ dataset [5] has anumber of photos with multiple people, these are not idealfor studying importance as most images are group shots, asin Fig. 1c, where everyone poses for the camera and every-one is nearly equally important.

We collected a dataset of 200 images by mining Flickr forimages with appropriate licenses using search queries suchas "people+events", "gathering", and so on. Each imagehas three or more people, in varying levels of importance.In order to predict importance of people in the image, theyshould be annotated. For the scope of this work, we as-sume face detection as a solved problem. Specifically, theimage were first annotated using a face detection [15] API.This face detection service has a remarkably low false pos-itive rate. Missing faces and heads were annotated manu-ally. There are 1315 total annotated people in the dataset,

with ∼6.5 persons per image. Example images are shownthroughout the paper and more images are available in thesupplement.

Corpus-Level Dataset. In this setting, we need a datasetthat has multiple pictures of the same person; and multiplesets of such photos. The ideal source for such a dataset aresocial networking sites. However, privacy concerns hinderthe annotation of these images via crowd sourcing. TV se-ries, on the other hand, have multiple frames with the samepeople and are good sources to obtain such a dataset. Sincetemporally-close frames tend to be visually similar, thesevideos should be sampled properly to get varied images.

The personID dataset by Tapaswi et al. [17] contains facetrack annotations with character identification for the firstsix episodes of the Big Bang Theory TV series. The trackannotation of a person gives the coordinates of face bound-ing boxes for the person in every frame. By selecting onlyone frame from each track of a character, one can get diverseframes for that character from the same episode. From eachtrack, we selected the frame that has the most people. Someselected frames have only one person in them, but that is ac-ceptable since the task is to pick the most important framefor a person. In this manner, a distinct set of frames was ob-tained for each of the five main characters in each episode.

3.2. Importance AnnotationWe collected ground truth importance in both datasets viaAmazon Mechanical Turk (AMT). We conducted pilot ex-periments to identify the best way to annotate these datasets,and pose the question of importance. We found that whenpeople were posed an absolute question – “please mark theimportant people in this image” they found the task be-comes difficult. The Turkers commented that they had toredefine their notion of importance every time a new imagewas shown thereby making it difficult to be consistent. In-deed, we found low inter-human agreement, and a generaltendency for some workers to annotate everyone as impor-

Page 4: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

tant and others to annotate only one or two.

To avoid these artifacts, we redesigned the tasks to be pair-wise questions. This made the tasks simpler, and the anno-tations more consistent.

Image-Level Importance Annotation. From each imagein the Image-Level Dataset, random pairs of faces were se-lected to produce a set of 1078 pairs. These pairs cover91.82% of the total faces in these images. For each selectedpair, ten AMT workers were asked to pick the more impor-tant of the two. The interface is shown in Fig. 2a, and anHTML page is provided in the supplement. In addition toclicking on a face, the workers were also asked to reportmagnitude of the difference in importance between the twopeople: significantly different, slightly different and almostthe same. This forms a three-tier scoring system as depictedin Table 1.

Turker selection: A is A’s score B’s score

significantly more important than B 1.00 0.00slightly more important than B 0.75 0.25about as important as B 0.50 0.50

Table 1: Pairwise annotations to importance scores.

For annotated pair (pi, pj) the relative importance scores siand sj range from 0 to +1, and indicates the relative differ-ence in importance between pi and pj . Note that si and sjare not absolute, as they are not calibrated for comparisonto another person, say pk from another pair.

Corpus-Level Importance Annotation. From the Corpus-Level Dataset, approximately 1000 pairs of frames were se-lected. Each pair contains frames depicting same person butfrom different episodes. This ensures that the pairs do nothold similar looking images. AMT workers were shown apair of frames for a character and asked to pick the framewhere the character appears to be more important. The in-terface used is as shown in Fig. 2b, and an HTML page isprovided in the supplement.

Similar to the previous case, workers were asked to picka frame, and indicate the magnitude of difference in impor-tance of the character. The magnitude choices are convertedinto scores as in shown Table 1.

Table 2 shows a breakdown of both datasets along the mag-nitude of differences in importance. We note some interest-ing similarities and differences. Both datasets have nearlythe same percentage of pairs that are nearly ‘almost-same’.The instance-level dataset has significantly more pairs in the‘significantly-more’ category than the corpus-level dataset.This is because in a TV series dataset, the characters in ascene are usually playing some sort of a role in the scene,unlike typical consumer photographs that tend to containmany people in the background. Overall, both datasets con-

tain a healthy mix of the three categories.

Pair Category Image-Level Corpus-Level

significantly-more 32.65% 18.30%slightly-more 20.41% 39.70%almost-same 46.94% 42.00%

Table 2: Distribution of Pairs in the Datasets

3.3. Importance ModelWe now formulate a general importance prediction modelthat is applicable to both setups – instance-level and corpus-level. As we can see from the dataset characteristics in Ta-ble 2, our model should not only be able to say which personis more important, but also predict the relative strengths be-tween pairs of people/images. Thus, we formulate this asa regression problem. Specifically, given a pair of people(pi, pj) (coming from the same or different images) withscores si, sj , the objective is to build a model M that re-gresses to the difference in ground truth importance score:

M(pi, pj) ≈ Si − Sj (1)

We use a linear model: M(pi, pj) = wᵀφ(pi, pj), whereφ(pi, pj) are the features extracted for this pair, and w arethe regressor weights. We use ν-Support Vector Regressionto learn these weights.

Our pairwise feature φ(pi, pj) are composed from featuresextracted for individual people φ(pi) and φ(pj). In our ex-periments, we compare two ways of composing these in-dividual faces – using difference of features φ(pi, pj) =φ(pi) − φ(pj); and concatenating the two individual fea-tures φ(pi, pj) = [φ(pi);φ(pj)].

3.4. Person FeaturesWe now describe the features we used to assess importanceof a person. Recall that we assume all faces in the imageshave been detected.

Distance Features. We use a number of different ways tocapture distances between faces in the image.

Photographers often frame their subjects. In fact, a numberof previous works [19] [2] [18] have reported a “centerbias” – that the objects or people closest to the center tendto be the most important. Thus, we compute two distancefeatures. The image size is first scaled to a size of [1, 1].Normalized distance from center: The distance from thecenter of the head bounding-box to the center of the im-age [0.5, 0.5].Weighted distance from center: The previous feature is di-vide by the maximum dimension of the face bounding box,so that larger faces are not considered to be farther from thecenter.

We compute two more features to capture how far a personis from the center of the group of people. Normalized dis-

Page 5: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

tance from centroid: First, we find the centroid of all thecenter points of the heads. Then, we compute the distanceof a face to this centroid.

Normalized distance from weighted centroid: Here, thecentroid is calculated as the weighted average of centerpoints of heads, the weight of a head being the ratio of thearea of the head to the total area of heads in the image.

Scale of the bounding box. Large faces in the image oftencorrespond to people who are closer to the camera, and per-haps more important. This feature is a ratio of the area ofthe head bounding-box to the the area of the image.

Sharpness. Photographers often use a narrow depth-of-field to keep the indented subjects in focus, while blurringthe background. In order to capture this phenomenon, wecompute a sharpness feature in every face. We apply a Sobelfilter on the image and compute the the sum of the gradientenergy in a face bounding box, normalized by the sum ofthe gradient energy in all the bounding boxes in the image.

Face Pose Features. The facial pose of a person can be agood indicator of their importance, because important peo-ple often tend to be looking directly at the camera.

DPM face pose features: We resize the head bounding boxpatch from the image to 128×128 pixels, and run the facepose and landmark estimation algorithm of Zhu et al. [20].Note that [20] is mixture model where each component cor-responds to a an the angle of orientation of the face, in therange of -90◦ to +90◦ in steps of 15◦. Our pose feature isthis component id, which can range from 1 to 13. We alsouse a 13-dimensional indicator feature that has a 1 in thecomponent with maximum score and zeros elsewhere.

Aspect ratio: We also use the aspect ratio of the head bound-ing box is as a feature. While the aspect ratio of the head ofpeople is normally 1:1, this ratio can differentiate betweensome head poses such as frontal and lateral poses.

DPM face pose difference: It is often useful to know wherethe crowd is looking, and where a particular person is look-ing. To capture this pose difference between a person andothers in the image, we compute the pose of the person sub-tracted by the average pose of every other person in the im-age, as a feature.

Face Occlusion. Unimportant people are often occludedby others in the photo. Thus, we extract features to indicatewhether a face might be occluded.

DPM face scores: We use the difficulty in being detectedas a proxy for occlusion. Specifically, we use the score ofeach of the thirteen components in face detection model of[20] as a feature. We also use the score of the dominantcomponent.

Face detection success: This is a binary feature indicating

whether the face detection API [15] we used was successfulin detection the face, or whether it required human annota-tion. The API achieved a nearly zero false positive rate onour dataset. Thus, this feature served a proxy for occlusionsince that is where the API usually failed.

In total, we extracted 45 dimensional features for every face.

4. ResultsFor both datasets, we perform cross-validation on the anno-tated pairs. Specifically, we split the annotated pairs into 20folds. We train the SVRs on 8 folds, pick hyper-parameters(C in the SVR) on 1 validation fold, and make predictionson 1 test fold. This process is repeated for each test fold,and we report the average across all 20 test folds.

Baselines. We compare our proposed approach to three nat-ural baselines: center, scale, and sharpness baselines wherethe person closer to the center, larger, or more in focus (re-spectively) than the another is considered more important.The center measure we use is the weighted distance fromcenter which not only gives priority to distance from thecenter but also to the size of the face. It can been seen fromexamples that it is very robust as it is essentially a combina-tion of two features.

Metrics. We use a pair-wise classification accuracy met-ric – the percentage of pairs where the most important per-son is correctly identified. This metric focuses on the signof the predicted difference and not on its magnitude. Insome applications, this would be appropriate. We also use aweighted accuracy measure, where the weights for each pair(pi, pj) is ground truth importance score of the more impor-tant of the two, i.e. max{si, sj}. This metric cares aboutthe ‘significantly-more’ pairs more than the other pairs. Forevaluating the regressor quality, we also report the meansquared error from the ground truth importance difference.

Image-Level Importance Results. Table 3 shows the re-sults for different methods. For the center baseline, we usedthe weighted distance from center, as it encourages a largerface at the center to be more important than a smaller face atthe center. We can see that the best baseline correctly classi-fies at 69.24% of the pairs, whereas our approach performsat 73.42%. Overall, we achieve an improvement of 4.18%-points (6.37% relative improvement). The mean squarederror is 0.1480.

Method Accuracy Weighted Accuracy

Our Approach 73.42± 1.80% 92.67± 0.89%

Center baseline 69.24± 1.76% 89.59± 1.11%Scale baseline 64.95± 1.93% 88.51± 1.13%Sharpness baseline 65.31± 1.92% 87.50± 1.20%

Table 3: Image-Level: Performance compared to baselines

Table 4 show a break-down of the accuracies into the three

Page 6: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

Pair-Category Ours Baseline Improvement

significantly-more 94.66% 86.65% 08.01%slightly-more 78.80% 76.36% 02.44%almost-same 55.98% 52.96% 03.02%

Table 4: Image-Level: Category Wise Distribution of Correct Pre-dictions compared to Center Baseline.

categories of annotations. We can see that our approach out-performs the strongest baseline (Center) in every category,and the largest difference happens in the ‘significantly-more’ category.

Fig. 3 shows some qualitative results. We can see that in-dividual features such center, sharpness, scale, and face oc-clusion help in different cases. In 3(c), the woman in blue isjudged to be the most important, presumably because she isa bride. Unfortunately, our approach does not contain anyfeatures that can pick up on such social roles.

Corpus-Level Importance Results. Table 5 shows the re-sults for the corpus-level experiments. Interestingly, thestrongest baseline in this setting is sharpness, rather than thecenter. This makes sense since the dataset is derived fromprofessional videos – the important person is more likely toin focus, compared to others. Table 6 shows the categorybreakdown. While the method does extremely well with‘significantly-more’ pairs, it is poor in the ‘almost-same’category.

Our approach outperforms all baselines, with an improve-ment of 7.88%-points (11.1% relative improvement). Themean squared error is 0.1044.

Method Accuracy Weighted Accuracy

Our Approach 78.91± 2.03% 93.19± 0.69%

Center Baseline 68.46± 1.56% 86.15± 1.09%Scale Baseline 67.86± 0.61% 85.94± 1.00%Sharpness Baseline 71.03± 1.69% 88.63± 1.13%

Table 5: Corpus-Level: Performance compared to baselines

Fig. 3 also shows qualitative results for corpus experiments.

Sorting of Photos. One key application of learning impor-tance in the Corpus-Level approach is in enhancing photocollection management and searching. Say a query to viewimages of a person comes to a popular social networkingsite such as Facebook, the returned photos can be sorted bythe person’s importance in them.

In such an application, the viewer’s friendship graph or anyother social relationship cues can be used to come up withdifferent variants of sorting. As an example, if Person A is aclose friend of Person B and if Person A looks up for photosof Person C, then a photo where C is important as well asB is present can be ranked higher than a photo where C isimportant but with people A does not know.

Pair-Category Ours Baseline Improvement

significantly-more 96.35% 68.33% 28.02%slightly-more 83.18% 71.82% 11.36%almost-same 58.36% 69.93% −11.57%

Table 6: Corpus-Level: Category Wise Distribution of CorrectPredictions compared to Sharpness Baseline

5. Importance and SaliencyNow that we know we can effective predict importance, itis worth investigating how importance compares with visualsaliency. At a high level, saliency studies where people lookan image. Eye-gaze tracking systems are often used to trackhuman eye fixations and thus estimate pixel level saliencymaps for an image. Saliency is different from importancebecause saliency is controlled by low-level human visualprocessing, while importance involves understanding morenuanced social and semantic cues. However, as concludedby [3], important objects stand out in an image and so, ingeneral they also tend to be salient. So, how much doesthe salience of a face correlate with the importance of theperson?

We study this question via the dataset collected by Jiang etal. [9], to study saliency in group photos and crowdedscenes. The dataset contains eye fixation annotations andface bounding boxes. For the purpose of this evaluation,we restricted the dataset to images with a minimum of 3and maximum of 7 people, resulting in 103 images. In eachimage, the absolute salience of a face was calculated as asratio of the fixation points in the face bounding-box to thetotal number of fixation points in all the face boxes in theimage. This results in a ranking of people according to theirsaliency scores.

We then collected pairwise importance annotations for thisdataset on Mechanical Turk, using the same interface asused for the the Image-Level Importance dataset. Since thisdataset is smaller, we annotated all possible face pairs (fromthe same image). Thus, we can extract a full ranking ofpersons based on their importance in each image. Humanjudgement-based pairwise annotations are often inconsis-tent (e.g. si > sj , sj > sk, and sk > si). Thus, we used theElo rating system to obtain a full ranking.

We measured the correlation between importance andsaliency rankings using Kendall’s Tau. The Kendall’s Tauwas 0.5256. Moreover, the most salient face was also themost important person in 52.56% of the cases.

Fig. 4 shows qualitative examples of people who are judgedby human to be salient but not important, important but notsalient, both salient and important, and neither.

Table 7 shows the ‘confusion matrix’ of saliency vs. impor-tance, broken down over the three strength categories. It can

Page 7: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

Figure 3: Some Results: (a)(b)(c)(d) for Image-Level Prediction and (e)(f) for Corpus-Level Prediction

Figure 4: Examples showing the relationship between visual saliency and person importance

Importancesignificantly slightly about

more more same

significantly-more 38.33% 38.33% 23.33%slightly-more 22.66% 32.81% 44.53%about-same 03.82% 19.51% 76.67%

Table 7: Distribution of Saliency Pair Categories among Impor-tance Pair Categories

be seen that most face-pairs that are ‘about-same’ importantare also ‘about-same’ salient whereas the other other twocategories do not agree that much – given a pair (pi, pj),person pi may be more salient than pj , but less important,and vice versa.

6. Application: Importance for ImprovedIm2Text

We now show our importance estimation can help improveim2text approaches in producing more human-like descrip-

tions for images.

Sentence generation algorithms [10, 13] often approach theproblem by first predicting attributes, actions, and otherrelevant information for every person in an image. Thenthese predictions are combined to produce a description forphoto. In group photos or crowded scenes, such an algo-rithm would identify several people in the image, and mayend up producing overly lengthy, rambling descriptions. Ifthe relative importance of the people in the photo is known,the algorithm can focus on the most important people, andthe rest can be either weighed less or ignored as appropriate.How beneficial can predicting importance of a person be insuch cases? This experiment tries to address this questionquantitatively.

Setup. Our test dataset for this experiment is a set of ran-domly selected 50 images from the Image-Level Dataset.The train set comprises of the remaining 150 images. Sincethe implementation for im2text methods is not available on-line, we simulated them in the following way. First, we

Page 8: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

Figure 5: Qualitative results for the pruning descriptions experiment

collected 1-sentence descriptions for every individual in thetest set on Mechanical Turk. The annotation interface forthese tasks asked Turkers to only describe the individual inquestion, and is available in the supplement.

Prediction. We trained the importance model on the 150train images, and made predictions on the test set. We usethe predicted importance to find the most important per-son in the image, according to our approach. Similarly, wecompute the most important person according to the Cen-ter baseline. We also pick a random person as a baseline.For each method, we chose the 1-sentence description cor-responding their predicted most-important person. We per-formed forced-choice tests on Mechanical Turk with thesedescriptions, asking Turkers to evaluate which descriptionwas better for the image.

Results. The methods were evaluated by how often their de-scriptions ‘won’ i.e. was ranked first in these forced choicetests. The results show that reasoning about importance ofpeople in an image helps in selecting the best description.The ground truth of the Image-Level Dataset is used as anOracle that picks the sentence corresponding to the singlemost important person in the image. This provides an upperbound ( 71.43%) on how well we can hope to do, if we aredescribing an image with a single sentence about a personin it. While this experiment was done with single sentences,a similar approach can be used to obtain multiple sentencesand get better descriptions.

7. ConclusionsTo summarize, we studied the problem of automatically pre-dicting the importance of people in group photos, using a

Method Accuracy

Our Approach 57.14%Center 48.98%Random 22.45%Oracle 71.43%

Table 8: Better selection of sentences

variety of features that capture the pose and arrangement ofthe people. We formulated two versions of this problem –(1) given a single image, ordering the people in it by theirrelative importance, and (2) given a corpus of images for aperson, ordering the images by importance of that person.We collected two importance datasets that served to evalu-ate our approach, and will be broadly useful to others in thecommunity studying related problems. We compared andrelated importance to previous work in visual saliency, andshowed that while correlated, saliency is not the same asimportance. People in photos may be salient but not impor-tant, important but not salient, both, and neither. Finally, weshowed that not only can we successfully predict the impor-tant of people from purely visual cues, incorporating thispredicted importance results in significant improvement inapplications such as im2text.

Compiling a larger dataset for image level importance pre-diction, using richer attributes such as gender and age, andincorporating social relationship are the next steps in thisline of work.

Page 9: VIP: Finding Important People in Imagesan importance measure of people in images and investigate the correlation between importance and visual saliency. We find that not only can

References[1] A. C. Berg, T. L. Berg, and H. D. III. Understanding and

predicting importance in images, 2012. Computer Visionand Pattern Recognition.

[2] A. Borji, D. N. Sihite, and L. Itti. Quantifying the relativeinfluence of photographer bias and viewing strategy on sceneviewing. Journal of Vision, 11(11):166, 2011.

[3] L. Elazary and L.Itti. Interesting objects are visually salient.Journal of Vision, 8(3):1–8, 2008.

[4] A. Gallagher and T. Chen. Finding rows of people in groupimages. In Multimedia and Expo, 2009. ICME 2009. IEEEInternational Conference on, pages 602–605, June 2009.

[5] A. Gallagher and T. Chen. Understanding images of groupsof people. In Proc. CVPR, 2009.

[6] A. C. Gallagher and T. Chen. Using group prior to identifypeople in consumer images. In CVPR’07, pages –1–1, 2007.

[7] J. Harel, C. Koch, and P. Perona. Graph-based visualsaliency. In Advances in neural information processing sys-tems, pages 545–552, 2006.

[8] S. J. Hwang and K. Grauman. Learning the relative impor-tance of objects from tagged images for retrieval and cross-modal search. Int. J. Comput. Vision, 100(2):134–153, Nov.2012.

[9] M. Jiang, J. Xu, and Q. Zhao. Saliency in crowd. In ECCV.IEEE, 2014.

[10] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi,A. C. Berg, and T. L. Berg. Babytalk: Understanding andgenerating simple image descriptions. IEEE Trans. PatternAnal. Mach. Intell., pages 2891–2903, 2013.

[11] C. Li, A. C. Loui, and T. Chen. Towards aesthetics: A photoquality assessment and photo selection system, 2010. InACM Multimedia Conference.

[12] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, andH.-Y. Shum. Learning to detect a salient object. PatternAnalysis and Machine Intelligence, IEEE Transactions on,33(2):353–367, 2011.

[13] A. Sadovnik, A. Gallagher, and T. Chen. Not everybody’sspecial: Using neighbors in referring expressions with un-certain attributes, 2013. The V&L Net Workshop on Lan-guage for Vision, Computer Vision and Pattern Recogni-tion(CVPR).

[14] H. Shu, A. C. Gallagher, H. Chen, and T. Chen. Face-graphmatching for classifying groups of people. In ICIP’13, pages2425–2429, 2013.

[15] SkyBiometry. https://www.skybiometry.com/.[16] M. Spain and P. Perona. Measuring and predicting object im-

portance. International Journal of Computer Vision, 2010.[17] M. Tapaswi, M. Bäuml, and R. Stiefelhagen. “Knock!

Knock! Who is it?” Probabilistic Person Identification in TVSeries. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), Jun. 2012.

[18] B. W. Tatler. The central fixation bias in scene viewing: Se-lecting an optimal viewing position independently of motorbiases and image feature distributions. Journal of Vision,7(14), 2007.

[19] P. Tseng, R. Carmi, I. G. M. Cameron, D. P. Munoz, andL. Itti. The impact of content-independent mechanisms onguiding attention. In Proc. Vision Science Society AnnualMeeting (VSS07), May 2007.

[20] X. Zhu and D. Ramanan. Face detection, pose estimation andlandmark localization in the wild, 2012. Computer Visionand Pattern Recognition, Providence, Rhode Island.