feature article name-it: naming and detecting faces in ... · names in the sound track, and then...

We developed Name-It, a system thatassociates faces andnames in newsvideos. It processesinformation from thevideos and can inferpossible namecandidates for agiven face or locate aface in news videosby name. Toaccomplish this task,the system takes amultimodal videoanalysis approach:face sequenceextraction andsimilarity evaluationfrom videos, nameextraction fromtranscripts, andvideo-captionrecognition.

The Name-It system1,2 associates namesand faces in news videos. Assume thatwe’re watching a TV news program.When persons we don’t know appear

in the news video, we can eventually identifymost of them by watching only the video. To dothis, we detect faces from a news video, locatenames in the sound track, and then associate eachface to the correct name. For face-name associa-tion, we use as many hints as possible based onstructure, context, and meaning of the newsvideo. We don’t need any additional knowledgesuch as newspapers containing descriptions of thepersons or biographical dictionaries with pictures.Similarly, Name-It can associate faces in newsvideos with their right names without using an apriori face-name association set. In other words,Name-It extracts face-name correspondences onlyfrom news videos.

Name-It takes a multimodal approach toaccomplish this task. For example, it uses severalinformation sources available from news videos—image sequences, transcripts, and video captions.Name-It detects face sequences from imagesequences and extracts name candidates fromtranscripts. It’s possible to obtain transcripts from

audio tracks by using the proper speech recogni-tion technique with an allowance for recognitionerrors. However, most news broadcasts in the USalready have closed captions. (In the near future,the worldwide trend will be for broadcasts to fea-ture closed captions.) Thus we use closed-captiontexts as transcripts for news videos. In addition,we employ video-caption detection and recogni-tion. We used “CNN Headline News” as our pri-mary source of news for our experiments.

Given image sequences, transcripts, and videocaptions as information sources, Name-It associ-ates extracted faces with extracted name candi-dates using the correlation of their timinginformation and face similarity information.Video captions are also taken into account as sup-plementary information. To associate faces andnames, Name-It integrates several advanced imageprocessing and natural-language processing tech-niques—face sequence extraction and similarityevaluation from videos, name extraction fromtranscripts, and video-caption recognition.Although these technologies aren’t always highlyaccurate, integrating these results will help thesystem achieve more accurate output.

With respect to face-name association, thePiction system3 works similarly to Name-It.Piction identifies faces within a given captionednewspaper photograph by extracting faces fromthe photograph and analyzing the caption toobtain geometric constraints among faces. Thesystem then labels each face with a name. A draw-back of Piction is that face location information isassumed to be described in captions—for exam-ple, “top row, from left, are Michael, Brian ….” Onthe other hand, Name-It doesn’t assume such adescription. Instead, while Piction deals with onephotograph and caption at a time, Name-Itprocesses many videos—including many newstopics—to collect a fraction of a hint from eachvideo fragment to infer face-name association. Indoing this, Name-It uses face similarity whilePiction doesn’t.

To realize Name-It, further analysis of videosemantic content proves necessary. State-of-the-art video analysis technologies automaticallyextract video structure information.4,5 Typically,once a video is given as a target, it’s decomposedinto segments or shots. These shots are then clas-sified based on the video’s structure. This processemploys several techniques, such as cut detection,color histogram calculation and comparison, cam-era motion analysis, motion segmentation, and soon. Among them, cut detection and color his-

22 1070-986X/99/$10.00 © 1999 IEEE

Name-It: Naming andDetecting Facesin News Videos

Shin’ichi SatohNational Center for Science Information Systems

Yuichi NakamuraUniversity of Tsukuba

Takeo KanadeCarnegie Mellon University

Feature Article

togram calculation and comparison are incorpo-rated into Name-It to provide hints for video con-tent analysis.

Since Name-It primarily handles face informa-tion, face detection and face similarity evaluationplay essential roles. Much research has targetedface detection and matching (for an extensive sur-vey, see Chellappa, Wilson, and Sirohey6). It’snoteworthy to contrast face identification andName-It. In face identification, a face (category)set for comparison with given faces is given a pri-ori. Although Name-It primarily associates facesand names in videos, it automatically generates aface-name association set from given videos thatmay even be used for face identification.

By providing face-name association, Name-Itperforms “individual detection” rather than mere“face detection,” because associated faces andnames correspond to certain individuals who areof interest in news video topics. As a result, Name-It enables several potential applications (see Figure1), including

❚ a news video viewer that interactively providesa personal description of the displayed face,

❚ a news text browser that gives facial informa-tion in response to names, and

❚ an automated video annotation generator forfaces.

Overview of Name-ItTypical news video consists of several topics,

each having a corresponding person or persons ofinterest. Figure 2 shows the typical structure of anews topic in which US President Bill Clinton isthe person of interest.

We set the primary goal of Name-It as associat-ing faces and names of persons of interest in newsvideo topics. To achieve this goal, we employ theprocess shown in Figure 3, next page. Name-It

must extract faces from image sequences andnames from transcripts—both the faces andnames correspond to persons of interest.However, these tasks are hard to accomplish. Facesof persons of interest tend to appear under sever-al conditions such as frontal views or close-ups, orthey’re on screen for a long duration. But faces

23

Figure 1. Potential

applications of Name-It.

Figure 2. Typical

composition of a news

topic.

that meet these conditions don’t always corre-spond to persons of interest. That is, there’s noperfect method to extract faces of persons of inter-est by image-sequence analysis alone. Meanwhile,extracting names of persons of interest requires anin-depth semantic analysis of the transcript.Several studies reported at the MessageUnderstanding Conference7 achieved sufficientaccuracy in selecting all names from text.However, selecting names of only persons of inter-est proved a much harder problem. Therefore, weextract faces and names likely to correspond topersons of interest. The system employs facedetection and tracking to extract face sequencesand natural-language processing techniques usinga dictionary, thesaurus, and parser to locate namesin transcripts.

Given extracted faces and names, Name-Itassociates those that correspond to one another.Since transcripts don’t necessarily give explana-tions of videos, no straightforward method existsfor associating faces in videos and names in tran-scripts. However, by observing the typical newsvideo composition given in Figure 2, we canassume that a corresponding face and name arelikely to coincide and may be an associated face-name pair. However, potential difficulties exist inassociating faces and names: the lack of necessaryfaces or names and possible multiple correspon-dences of faces and names. For example, even ifthe system successfully extracts a person of inter-est’s face in a topic, it might not find the correctname that coincides with that face.

As an example of multiple correspondence,

24

IEEE

Mul

tiM

edia

Figure 3. The Name-It

process. First, Name-It

extracts faces from

video and names from

transcripts. Then it

associates the names

and faces.

assume that topic A is about Clinton and formerUS Senator Robert Dole, and topic B is aboutClinton and former House Speaker NewtGingrich. The system can’t decide whetherClinton’s face shown in topic A corresponds toname “Clinton” or “Dole” or whether Clinton’sface shown in topic B corresponds to name“Clinton” or “Gingrich.” To compensate for thisdrawback, Name-It gives priority to a face-namepair that coincides more frequently and outputsthe pair as an associated face-name pair.Obviously, face similarity is required to evaluateface-name association (for example, to match thefaces in topic A and topic B). Thus the systemregards these faces as identical and can infer thatthe face coincides in more topics with the name“Clinton” than with others (“Dole” or“Gingrich”). Evaluating face similarity may alsoresolve the problem of lack of faces or names.Even if a face doesn’t coincide with the correctname, we expect other faces identical to this facein other topics to coincide with the correct name.

The system also employs video-caption recog-nition to obtain face-name association. Video cap-tions are superimposed text on video frames,therefore representing literal information. The cap-tions are directly attached to image sequences andgive an explanation of the video. In many cases, avideo caption is attached to a face and usually rep-resents a person’s name. Thus, video-caption recog-nition provides rich information for face-nameassociation. However, because video captions don’tnecessarily appear for all faces of persons of inter-est, Name-It uses the video captions as supple-ments to the transcripts. For example, some facesappear like persons of interest in a news program,but aren’t mentioned in the transcripts. Instead,their names often show up in the video captions.To achieve video-caption recognition, we use textdetection and character-recognition techniques(see Figure 3).

Finally, results obtained by these techniquesshould be integrated to provide face-name associ-ation. As a unified measurement integrating mul-timodal analysis, we use a co-occurrence factor,which represents a likelihood factor that a faceand a name correspond to each other. This inte-gration should give better face-name associationresults, even though the results of analyses areimperfect. As shown in Figure 3, to compensatefor the problems of the lack of faces or names andmultiple correspondence of faces and names, thesystem employs face similarity to evaluate the co-occurrence factor. Since character recognition for

video captions can’t be perfect because of the poorquality of video images, it’s compensated for byan inexact string match method. As a result,although each analysis may not discriminate facesor names of persons of interest in topics, associa-tion results may eventually correspond to face-name pairs of persons of interest in topics.

Face information extractionHere we describe extraction of those faces that

might correspond to persons of interest in topics.We first employ face detection and tracking todetect face sequences in videos. Then, using aneigenface-based method, we evaluate face similar-ity. To enhance the face similarity evaluation, weselect the most frontal view of a detected facesequence and use the selected face for the eigen-face method. Finally, given videos as input, thesystem outputs a two-tuple list: timing informa-tion (start∼ end frame) and face identificationinformation.

Face trackingFace tracking consists of three components—

face detection, skin-color model extraction, andskin-color region tracking (see Figure 4). Wedescribe the face tracking components in the fol-lowing subsections.

25

January–M

arch 1999

Figure 4. The face-

tracking process, which

involves face detection,

skin-color model

extraction, and skin-

color region tracking.

Face detection. First, Name-It applies facedetection to every frame within a certain intervalof frames. This interval should be small enough sothat the detector doesn’t miss important facesequences, yet large enough to ensure reasonableprocessing time. Optimally, we apply the facedetector at intervals of 10 frames. The neural net-work-based face detector8 employs a neural net-work arbitration method and bootstrap algorithmto detect mostly frontal faces of various sizes andat various locations. The system outputs thedetected face as a rectangular region that includesmost of the skin, but excludes the hair and thebackground. The face detector can also detecteyes. To ensure that the faces are frontal and closeup, we use only faces in which eyes are detectedsuccessfully. A detected face is tracked bidirec-tionally in time to obtain a face sequence.

Skin-color model extraction and tracking.Once the system detects a face, it extracts the skin-color model. In several cases, researchers haveused the Gaussian model in (r, g) space (r = R/(R +G + B), g = G/(R + G + B)) as a general skin-colormodel for face tracking.9,10 However, for ourresearch we used the Gaussian model in (R, G, B)space because this model is more sensitive to theskin color’s brightness, and thus much more suit-able for the model tailored for each face sequence.

Let F be the detected face region and I(x, y) becolor intensities [R G B]t at (x, y). A skin-colormodel consists of a covariance matrix C, a meanM, and a distance d:

(1)

(2)

where N is the number of pixels in F. We used aconstant for d. The system extracts a model foreach detected face and uses it to extract skin can-didate pixels in the subsequent frames. (A pixel I(x,y) is a skin candidate pixel if (I(x, y) − M)t C−1(I(x, y)− M) < d2.) Then the system composes a binaryimage of the skin candidate pixels. It applies noisereduction with region enlarging or shrinking andcontour tracing of regions to obtain skin candidateregions. The overlap between each of these regionsand each of the face regions of the previous frameis evaluated to decide whether one of the skin can-didate regions is the succeeding face region. In

addition, the system applies the scene-changedetection method based on the subregion color his-togram matching.11 Face-region tracking continuesuntil the system encounters a scene change or untilit can’t find a succeeding face region.

Face similarity evaluationTo evaluate face similarity, we employ a face

similarity measurement based on the eigenfacemethod. Since this method is very sensitive to faceorientation, we prefer using frontal faces for eval-uation. However, faces detected by the methoddescribed above aren’t necessarily frontal enough.On the other hand, since we have face sequences,we can choose any face from the sequence.Therefore, we first select the best frontal view of aface—that is, the most frontal face from each facesequence. Then we apply the eigenface method tothe selected faces for similarity evaluation of facesequences.

The most frontal face selection. To choose themost frontal face from all detected faces, the systemfirst applies a face-skin region clustering method.For each detected face, cheek regions—which wepresume have skin color—are located by using theeye locations the face detector obtained. Using thecheek regions as initial samples, the system employsthe region growing method in the (R, G, B, x, y)space to obtain the face-skin region. We assume aGaussian model in (R, G, B, x, y) space; (R, G, B) con-tributes by making the region have skin color, and(x, y) contributes by keeping the region almost cir-cular. Then, the system calculates the face-skinregion’s center of gravity (xf, yf). Let the locations of the right and left eyes of the face be (xr, yr), (xl, yl), respectively. We assume that the mostfrontal face has the smallest difference between xf

and (xl + xr)/2 and the smallest difference betweenyl and yr. To evaluate these conditions, we calculatethe frontal factor Fr for every detected face

(3)

(4)

where wf is the normalized face-region size. Thissize is determined so that a square wf × wf overlapswith the eyes, nose, and mouth, but barely over-laps with the background in most cases. The fac-tor for an ideal frontal face is 1.5. The first term of

Frx x x

w

y y

wf r l

f

l r

f

= −− −

+ −−

1

2 12

1

w x xf l r= −5

3( )

CN

I x y M I x y Mx y F

t= ( ) −( ) ( ) −( )

∈∑1

, ,( , )

MN

I x yx y F

= ( )( )∈∑1

,,

26

IEEE

Mul

tiM

edia

Equation 4 becomes 1 iff xf equals (xl + xr)/2, lessthan 1 otherwise. The second term of Equation 4becomes 1/2 iff yl equals yr, less than 1/2 other-wise. The system chooses the face having thelargest Fr as the most frontal face of the facesequence. Figure 5 shows example faces, extract-ed face skin regions, and frontal factors.

Eigenface-based face similarity evaluation.To evaluate face similarity, we employ the eigen-face-based method.12 Although it doesn’t neces-sarily achieve the best performance, we chose thismethod because it’s less restrictive to input faceimages (that is, it doesn’t require face featuresdetection, such as eyes, nose, mouth corners, andso on). Plus, it’s compatible with face-similarityvalues. Each of the most frontal faces is normal-ized into a 64 × 64 image using the eye positions,then converted to a point in the 16-dimensionaleigenface space. As we’ll describe later, weprocessed five hours of news videos, and the sys-tem extracted 556 face sequences. To train eigen-faces, we used all 556 faces (that is, the training setequals the evaluation set). Since we fixed thevideo corpus to five-hour news videos, we can stilltake this approach. However, to incrementallyprocess news videos, for example, we need to fix

the training face set in advance and apply trainedeigenfaces to faces other than the training set.Face similarity can be evaluated as the face dis-tance—the Euclidean distance between two cor-responding points in the eigenface space. Let df(Fi, Fj) be the face distance of faces Fi and Fj. Wedefine the similarity Sf(Fi, Fj) as

(5)

where σf is a standard deviation of the Gaussian fil-ter in the eigenface space. The range of similarity isfrom 0 to 1, where similarity of the same face is 1.

EvaluationFigure 6 (next page) shows the start and end

frames of face sequences and the selected frontalface frames using the face extraction method. InFigure 6a, although the faces appearing in thestart and end frames aren’t frontal, the system suc-cessfully selected the frontal face. Figure 6b showsthat the system can handle face sequences havingscene changes by using special effects (such aswiping) in the sequence’s start and end frames. A30-minute video takes roughly 30 hours to processon a Silicon Graphics 200-MHz R4400 worksta-

S F Ff i j e

d F Ff i j

f,

,

( ) =−

( )2

22σ

27

January–M

arch 1999

Figure 5. Frontal face

selections showing

example faces,

extracted face-skin

regions, and frontal

factors.

tion. Most of the time goes to face detection.To evaluate face-sequence detection, we ex-

amined the face-sequence extraction results of ahalf-hour news video. The system extracted 65 facesequences and missed four sequences due to face-detection failure (two cases had specular reflection

on eye glasses and twocases had shade onfaces). The system out-put one nonfacesequence due to a face-detection error and twosequences composed oftwo face sequencesmerged into one se-quence. In one case, thesystem failed to detectthe scene change be-cause the scene dis-solved between twoface sequences. Inanother case, the sys-tem failed to track theface because the videosegment was mono-chrome. This video wasone of the most difficultfor face-sequence ex-traction, yet the systemextracted more than 90percent of actual facesequences with onlyone false extraction.

To examine face-similarity evaluationresults, we manuallynamed each face se-quence. Among 556face sequences taken

from five hours of news videos, we manuallynamed 308 sequences and left 248 unknown.Then we examined every pair of face sequences toobtain distances df(Fi, Fj). We labeled pairs whosedistance fell below a certain threshold (df(Fi, Fj) < θ) as expected identical pairs. On theother hand, we called pairs having the same namereal identical pairs. To evaluate the appropriatenessof face distance, we used a precision-recall graphcomparing expected and real identical pairs whilevarying threshold θ (see Figure 7). Although preci-sion is 99 percent and recall is 14 percent for θ =1,000, if we increase θ, precision decreases veryrapidly. To preserve precision as high as possible,we used 1,000 for σf in Equation 5. The graphdepicts that the defined face distance doesn’tachieve good separation between identical andnonidentical face pairs. However, we will show inour experimental results that this face similarityevaluation method still works fairly well whenintegrated into the Name-It system.

28

IEEE

Mul

tiM

edia

Figure 6. Face extraction results. (a, c, and d) Successful selection of a frontal face. (b) Even with scene

changes using special effects, the system can handle face sequences.

← θ = 1,500

← θ = 1,000

← θ = 2,000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Prec

isio

n

0 0.2 0.4 0.6 0.8 1Recall

Figure 7. Precision-

recall graph of an

identical face-pair

selection.

Name information extractionHere we describe the extraction of

names that might correspond to per-sons of interest in topics. The systemuses advanced natural-language pro-cessing to extract name candidatesfrom transcripts. We’ll also describehow the name candidate extractionuses lexical and grammatical analysisand knowledge of a topic’s structurein news videos. The system outputs athree-tuple list: a name candidate,timing information, and a score rep-resenting the likelihood of that namebelonging to a person of interest.

Typical structure of news videosThe highest component in news

video is an individual topic. Eachtopic contains one or more para-graphs, which roughly correspond toscenes. In closed-caption texts of“CNN Headline News,” the compo-nents can be easily distinguished—a topic is pre-ceded by >>> and a paragraph by >> (see Figure 8).We use this literal information to discriminatebetween an anchor paragraph and a live or filevideo shot from videos. For other news programs,we could use news video parsing techniques.4,5

An anchor paragraph typically appears at thebeginning of the topic, in which an anchor persongives an overview of the topic. Live or file videoparagraphs—actual videos related to the topic orspeeches by a person of interest—typically appearafter an anchor paragraph. A live or file video para-graph, especially one containing close-up scenes ofa person of interest, proves quite useful for Name-It. In some cases, such a paragraph includes thenarrator’s explanations of the person in the close-up. Since the face coincides with its name in corre-sponding transcripts, Name-It simply evaluates thecoincidence of extracted name candidates with facesequences in order to obtain face-name association.However, in other cases, the live or file video para-graph consists of the speech of the person in theclose-up. Since the person rarely mentions his orher own name in the speech, corresponding tran-scripts may not contain the desired name. This sit-uation requires extra care. In such cases, the systemoffsets time lags in name information. (We providedetailed descriptions in the following sections.)Finally, each name candidate is output with thescore that represents the likelihood that the namecorresponds to a person of interest.

Conditions of name candidatesEach name candidate should satisfy some of

the following conditions:

1. The candidate should be a noun that repre-sents a person’s name or that describes a per-son (president, fireman, and so on).

2. The candidate should preferably be an agentof an act—especially an act of speech, atten-dance at a meeting, or a visit. For example, aspeaker is usually centered in the speech scene,while other people aren’t always shown invideos even if they’re mentioned.

3. The candidate tends to be mentioned earlierthan others in the topic in transcripts. (In anews video, important information that mighthave corresponding images is usually men-tioned earlier rather than later.)

4. The candidate tends to be mentioned justbefore a live video is shown. The personappearing in a live video rarely mentions hisor her own name. Instead, just before the livevideo, an anchor person tends to introducethe candidate (see Figure 8).

The system evaluates these conditions for eachword in the transcripts by using a dictionary (theOxford Advanced Learner’s Dictionary of Current

29

January–M

arch 1999

Figure 8. An anchor

person shot with

accompanying anchor

paragraph and a live

video feed with its

corresponding

paragraph. Positional

scores for the live video

are shown on the right-

hand side.

English13), a thesaurus (WordNet14), and a parser(Link Parser15). Then the system outputs the three-tuple list: a word, timing information (frame), anda normalized score reflecting the aboveconditions.

Score calculationReferring to the dictionaries and parsing

results, the system calculates the score for eachword in the transcripts. The score is normalized sothat a score close to 1 corresponds to a word thatmost likely corresponds to a person of interest. Wedefine the score calculation as follows:

❚ Grammatical score: After consulting the dictio-nary, the system gives 1 to proper nouns, 0.8to common nouns, and 0 to other words. Byconsulting the parsing results, the system gives1 to nouns and 0.5 to other words. If the sys-tem fails to parse, it gives 0.5 to all words. Thenet grammatical score equals the product ofthe two scores.

❚ Lexical score: After consulting the thesaurus, thesystem gives 1 to persons, 0.8 to social groups,and 0.3 to other words.

❚ Situational score: The act corresponding to theword is represented by the verb in the sentencethat includes the word. After consulting thethesaurus, the system gives 1 to speech, 0.8 toattendance at meetings, and 0.3 otherwise.

❚ Positional score: The system gives 1 to wordsthat appear in the first sentence of a topic, 0.5to words that appear in the last sentence of aparagraph, and a linearly interpolated score toother words according to the position of thesentence in which the word appears. For live orfile video paragraphs, the system also outputsthe same tuples as those of the paragraph thatappears before the live or file video paragraphs(possibly the anchor paragraph), replacing thetiming information with that of the live or filevideo (see Figure 8). In addition, the systemreplaces the positional score according to theposition of the sentence in the anchor para-graph: 1 for the sentence just before the livevideo, 0.5 for the first sentence of the topic,and a linearly interpolated score otherwise.

Finally, the system calculates the net score asthe product of all four scores. The execution timefor a 30-minute news video is approximately 1.5

hours on an SGI 200-MHz R4400 workstation.Most of that time goes to parsing. We determinedseveral parameters used in score calculationempirically. Although we had an impression thatthese parameters wouldn’t be very sensitive toface-name association results, there’s still room foran in-depth study in score calculation for namecandidates.

EvaluationWe examined one 30-minute news video and

manually extracted 105 name words from a tran-script containing 3,462 words. While the systemautomatically extracted 752 words as name can-didates, only 94 of them were correct (it missed 9and 658 were false alarms, that is, precision was13 percent and recall 91 percent). This excessivename-candidate extraction resulted from the sys-tem extracting words that were proper nouns ornouns used as agents, in order not to miss any“name.” But even with this poor name-candidateextraction, the overall system achieved good per-formance in face-name association, as we’ll showin the experimental results section.

Video-caption recognitionAttached directly to image sequences, video

captions provide text information. In many cases,they’re attached to faces and usually represent aperson’s name. Thus, video-caption recognitionprovides rich information for face-name associa-tion, although not necessarily attached to all facesof persons of interest. We briefly describe video-caption recognition in this section. (See Sato etal.16 for further information.)

Figure 9 shows a typical frame with video cap-tions. Since we use “CNN Headline News” for tar-get news videos, captions appear in bright color,superimposed directly onto the backgroundimages. To achieve video-caption recognition, thesystem first detects text regions from video frames.Several filters, including differential filters andsmoothing filters, help achieve this task. Clusterswith bounding regions that satisfy several sizeconstraints are selected as text regions. The detect-ed text regions are preprocessed to enhance video-caption image quality. First, the system appliesthe filter that minimizes intensities among frames.This filter suppresses complicated and movingbackgrounds, yet enhances characters becausethey’re placed at the exact position for a sequenceof frames. Next, the system applies the linearinterpolation filter to quadruple the resolution.Then it applies template-based character recogni-

30

IEEE

Mul

tiM

edia

tion. The current system can recognize onlyuppercase letters, but it has achieved a 76 percentcharacter-recognition rate.

Since character recognition results aren’t per-fect, inexact matching between the results andcharacter strings is essential for face-name associ-ation. To cope with this problem, we extended theedit-distance method.17 Assume that C is the char-acter-recognition result and N is a word. Definethe distance dc(C, N) to represent how much C dif-fers from N. When C and N are the same, the dis-tance is 0, whereas when C and N differ (that is, Cand N don’t share any characters), the distance is1. The system calculates the distance by using adynamic programming algorithm.

Face-name information associationHere we’ll describe the algorithm and co-

occurrence factor that work together to associateface-name information. The system calculates theco-occurrence factor taking into account analysisresults of face-sequence extraction, face matching,name-candidate extraction, and video-captionrecognition. Inaccuracy of each technology iscompensated in this process.

AlgorithmIn this section, we describe the algorithm for

retrieving face candidates by a given name. Weuse the co-occurrence factors that take advantageof face extraction and similarity evaluation, nameextraction, and video-caption recognition. Let Nand F be a name and a face, respectively. The co-occurrence factor C(N, F) measures the degree towhich face F matches name N. Think of thenames Na, Nb, …, and the faces Fp, Fq, …, where Na

corresponds to Fp. Then C(Na, Fp) should have thelargest value among co-occurrence factors of anycombination of Na and the other faces (such asC(Na, Fq) and so on) or of the other names and Fp

(such as C(Nb, Fp) and so on). To retrieve face can-didates by a given name or name candidates by agiven face, we use the co-occurrence factor:

1. Calculate the co-occurrences of combinationsof all face candidates with a given name orvice versa (name candidates with a given face).

2. Sort the co-occurrences.

3. Output faces (or names) that correspond to theN largest co-occurrences.

Co-occurrence factorHere we define the co-occurrence factor C(N, F)

of a face F and a name N. Extracted face sequencesare obtained as a two-tuple list (timing, face iden-tification): {(tFi, Fi)}={(tF1, F1), (tF2, F2),…}, wheretFi=tFi

start~tFiend. We can define the duration of a face

sequence by the function dur(tFi )=tFiend − tFi

start. Nameextraction results are given as a three-tuple list(word, timing, score):

Note that a name Nj may occur several times in avideo, so each occurrence is indexed by k.

Timing similarity, St(tF, tN), between the timingof a face F and a name N is defined as follows:

(6)

This is basically a step function having 1 if tN fallsin the range between tF

start and tFend, but its edges are

dispersed using a Gaussian filter with standarddeviation σt. The Gaussian filter should compen-sate for the time delay between the video andtranscript.

S t t

e t t

t t t

e t t

t F N

t t

N F

F N F

t t

F N

N F

t

F N

t

,( ) =

− <( )≤ <( )

− ≤

−

−

start

end

start

start end

end

2

2

2

2

2

2

1

σ

σ

{( , , )}

{( , , ), ( , , ),

( , , ), }

N t s

N t s N t s

N t s

j kN

kN

N N N N

N N

j j

= 1 1 1 1 2 2

2 1 1

1 1 1 1

2 2

K

K

31

January–M

arch 1999

Figure 9. A typical video

caption.

The caption recognition results are obtained asa two-tuple list (timing, recognition result):

where because each caption hasa duration. First, the system chronologically com-pares the caption-recognition result with a face.We simply define the timing similarity, S′t (tC, tF),of timing of caption C and face F, as follows:

(7)

Next, the similarity between a caption recog-nition result C and a name N is defined using thedistance dc(C, N). To take into account only pairsof a caption and name that match well—that is,pairs of a caption and name whose distance is verysmall—we define the similarity Sc(C, N) of a cap-tion C and a name N as

(8)

where θc is the threshold value for the distancebetween captions and names. We set θc to 0.2 forour experimental system.

Finally, we define the co-occurrence factor C(N, F) of the name N and the face F as

(9)

(10)

where wc is the weight for caption-recognitionresults. Roughly speaking, when a name and a cap-tion match and the caption and a face match at thesame time, the face equivalently coincides with wc

occurrences of that name. We use 1 for the value ofwc. Figure 10 depicts the calculation process for themain portion of the numerator in Equation 9.Intuitively, the numerator represents the numberof occurrences of the name N that coincide withface F, taking face similarities and name scores into

account. That number is then nor-malized with the denominator to pre-vent the “anchor person problem.”An anchor person coincides withalmost any name. A face or name thatcoincides with any name or faceshould correspond to no name orface. The more names an anchor per-son coincides with, the larger thedenominator becomes by summingfor each coincident name. Thus theresulting co-occurrence factor be-comes small for anchor persons.

ExperimentsWe implemented the Name-It

system on an SGI workstation. Weprocessed 10 “CNN Headline News”videos (30 minutes each) for a totalof five hours of video. The systemextracted 556 face sequences fromthe videos.

Name-It performs name candi-date retrieval from a given face andface candidate retrieval from a givenname. In face-to-name retrieval, thesystem is given a face, then outputsname candidates with co-occurrence

∗ = ′( ) ( )∑w S t t S C Nc t C F C j

jj i, ,

C N F

S F F s S t t

S F F t s

f i kN

t F kN

ki

f i F

i

kN

k

i

i

,

, ,

,( ) =

( ) ( ) + ∗

( ) ( ) ( )

∑∑∑ ∑2 2

dur

S C Nd C N

cc c

, ),( ) =

( ) >( )( )

0

1

θ

otherwise

′( ) =< <( )

( )

S t t

t t t tt C F

F C C F,

0

1

end start end startor

otherwise

t t tC C Ci i i= start end~

t C t C t CC i C Ci

, , , , ,( ){ } = ( ) ( ){ }1 21 2 K

32

Figure 10. Calculating the co-occurrence factor. This figure depicts the calculation of co-

occurrence between the face of Clinton (top) and the name “Clinton” (bottom). Edges from

Clinton’s face to other faces represent face similarity Sf(Fi, FClinton); edges from faces to names

represent timing similarity St (tFi, tNk); and edges from names to “Clinton” represent name

scores sNk in the numerator of Equation 9.

factors in descending order. Likewise, in name-to-face retrieval, the system outputs face candidatesof a given name with co-occurrence factors indescending order.

Figure 11 shows theresults of face-to-nameretrieval. Each resultshows an image of agiven face and rankedname candidates asso-ciated with co-occur-rence factors. Correctanswers are denoted bythe circled numberrankings. Figure 12shows the results ofname-to-face retrieval.The top four face candi-dates are shown inorder from left to rightwith corresponding co-occurrence factors.Although the correctanswers acquire higherranking, the resultsmight be recognized asimperfect due to manyincorrect candidateswithin the top fourresults. However, whenwe recall that Name-Itextracts face and nameinformation and com-bines these unreliablesets of information toobtain face-name asso-ciation, inevitably theresults contain unnec-essary candidates. Thus,the results demonstrate

good performance in face-to-name and name-to-face retrieval.

After the experiments, we evaluated the Name-It system in terms of accuracy. We used 308 man-

33

January–M

arch 1999

Figure 11. Face-to-name

retrieval results.

Figure 12. Name-to-face

retrieval results.

ually named face sequences as the correct answer.Figures 13 and 14 depict the accuracy of face-to-name and name-to-face retrieval. In this accuracyevaluation, if the correct answer is output in thetop N candidates, we regard this output as correct.(The output is correct with N allowed candidates.Note that a name may correspond to several iden-tical faces, and a face may correspond to both thegiven name and the family name.) Thus thesegraphs represent relationships between accuracyand the number of allowed candidates. They alsoshow results using both name scores and video-caption recognition, results without name scores(set all scores to 1.0), results without video-cap-tion recognition (set wc to 0), and results withouteither name scores or video-caption recognition.

By comparing the results using both namescores and video captions to the results withoutvideo captions for both graphs, we can say thatvideo-caption recognition contributes to higheraccuracy. Actually, some faces aren’t mentioned inthe transcripts, but described in video captions.These faces can be named only by incorporatingvideo-caption recognition (such as Figure 11d andFigure 12e). Figure 13 shows that name score eval-uation proves effective for face-to-name retrieval.

However, according to Figure 14, it doesn’t causeany major difference in accuracy for name-to-faceretrieval. This result indicates that name scoresproperly reflect whether each word corresponds toa person of interest in topics (in face-to-nameretrieval). By contrast, name scores cannot repre-sent which occurrence of a certain word coincideswith a face sequence of the person of the name inname-to-face retrieval. In other words, name scoressucceed in inferring which word is likely to corre-spond to a person of interest. However, they fail toinfer which word actually coincides with the facesequence. The main reason for this is the fact thattranscripts don’t explain videos directly. To over-come this problem, the system may need in-depthtranscript recognition, as well as in-depth sceneunderstanding, and a proper way to integrate theseanalysis results. The graphs also disclose thatName-It achieves an accuracy of 33 percent inface-to-name retrieval and 46 percent in name-to-face retrieval with five candidates allowed.

ConclusionsName-It associates faces and names in news

videos by integrating face-sequence extractionand similarity evaluation, name extraction, andvideo-caption recognition into a unified factor:co-occurrence. The successful experimental resultsdemonstrate the effectiveness of a multimodalapproach in video content extraction. Althoughthe performance of each individual technology isnot always high, our experiments demonstratethat Name-It achieves good face-name associa-tion. Further research will aim to enhance eachtechnique, as well as analyze and improve theintegration method. MM

AcknowledgmentThis material is based on work supported by the

National Science Foundation under CooperativeAgreement No. IRI-9411299. Any opinions, find-ings, and conclusions or recommendations ex-pressed in this material are those of the authors anddo not necessarily reflect the views of the NSF.

References1. S. Satoh and T. Kanade, “Name-It: Association of

Face and Name in Video,” in Proc. of Computer Vision

and Pattern Recognition, IEEE Computer Society

Press, Los Alamitos, Calif., 1997, pp. 368-373.

2. S. Satoh, Y. Nakamura, and T. Kanade, “Name-It:

Naming and Detecting Faces in Video by the

Integration of Image and Natural-Language

Processing,” in Proc. of Int’l Joint Conf. on Artificial

34

IEEE

Mul

tiM

edia

50

45

40

35

30

25

20

15

10

5

0

Acc

urac

y (%

)

1 2 3 5 84 6 7 9 10Number of allowed candidates

Face-to-name accuracyWithout name scoreWithout video caption recognitionWithout name score and video-caption recognition

60

50

40

30

20

10

0

Acc

urac

y (%

)

1 2 3 5 84 6 7 9 10Number of allowed candidates

Face-to-name accuracyWithout name scoreWithout video caption recognitionWithout name score and video-caption recognition

Figure 13. Accuracy of

face-to-name retrieval.

Figure 14. Accuracy of

name-to-face retrieval.

Intelligence, Morgan Kaufmann, San Francisco,

1997, pp. 1488-1493.

3. R. Chopra and R.K. Srihari, “Control Structures for

Incorporating Picture-Specific Context in Image

Interpretation,” Proc. Int’l Joint Conf. on Artificial

Intelligence, Morgan Kaufmann, San Francisco, 1995.

4. D. Swanberg, C.F. Shu, and R. Jain, “Knowledge

Guided Parsing in Video Database,” Proc. Symp. on

Electric Imaging, Science, and Technology, SPIE Press,

Bellingham, Wash., Vol. 1908, 1993, pp. 13-24.

5. S.W. Smoliar and H. Zhang, “Content-based Video

Indexing and Retrieval,” IEEE MultiMedia, Vol. 1, No.

2, April-June (Summer) 1994, pp. 62–72.

6. R. Chellappa, C.L. Wilson, and S. Sirohey, “Human

and Machine Recognition of Faces: A Survey,” Proc.

IEEE, Vol. 83, No. 5, 1995, pp. 705-740.

7. Proc. Sixth Message Understanding Conference,

Morgan Kaufmann, San Francisco, 1995.

8. H.A. Rowley, S. Baluja, and T. Kanade, “Neural

Network-based Face Detection,” IEEE Trans. on

PAMI, Vol. 20, No. 1, 1998, pp. 23-38.

9. J. Yang and A. Waibel, “Tracking Human Faces in

Real Time,” Tech. Rep. CMU-CS-95-210, School of

Computer Science, Carnegie Mellon University,

Pittsburgh, 1995.

10.H.M. Hunke, “Locating and Tracking of Human

Faces with Neural Networks,” Tech. Rep. CMU-CS-

94-155, School of Computer Science, Carnegie

Mellon University, Pittsburgh, 1994.

11.M. Smith and T. Kanade, “Video Skimming and

Characterization through the Combination of Image

and Language Understanding Techniques,” Proc.

Computer Vision and Pattern Recognition, IEEE CS

Press, Los Alamitos, Calif., 1997, pp. 775-781.

12.M. Turk and A. Pentland, “Eigenfaces for

Recognition,” J. Cognitive Neuroscience, Vol. 3, No.

1, 1991, pp. 71-86.

13.Oxford Advanced Learner’s Dictionary of Current

English (computer usable version), R. Mitton, ed.,

The Oxford Text Archive, Oxford, UK, June 1992,

http://ota.ox.ac.uk/.

14.G. Miller et al., “Introduction to WordNet: An Online

Lexical Database,” Int’l J. Lexicography, Vol. 3, No. 4,

1990, pp. 235-244.

15.D. Sleator, “Parsing English with a Link Grammar,”

Proc. Third Int’l Workshop on Parsing Technologies,

Assoc. for Computational Linguistics, 1993.

16.T. Sato et al., “Video OCR for Digital News

Archives,” Proc. IEEE Workshop on Content-based

Access of Image and Video Databases (ICCV 98), IEEE

CS Press, Los Alamitos, Calif., 1998, pp. 52-60.

17.P.A.V. Hall and G.R. Dowling, “Approximate String

Matching,” ACM Computing Surveys, Vol. 12, No. 4,

1980, pp. 381-402.

Shin’ichi Satoh is an associate

professor at the National Center

for Science Information Systems

(NACSIS), Japan. He received a BE

in 1987 and ME and PhD degrees

in 1989 and 1992, respectively,

from the University of Tokyo. His research interests

include video analysis and multimedia databases. He

was a visiting scientist at the Robotics Institute, Carnegie

Mellon University, Pittsburgh from 1995 to 1997.

Yuichi Nakamura is an assistant

professor at the University of

Tsukuba, Japan. He received a BE in

1985 and ME and PhD degrees in

electronical engineering from

Kyoto University in 1987 and 1992,

respectively. His research interests and activities include

video analysis and video utilization for knowledge sources.

He was a visiting scientist at the Robotics Institute,

Carnegie Mellon University, Pittsburgh, in 1996.

Takeo Kanade is currently the

U.A. Helen Whitaker University

Professor of computer science and

Director of the Robotics Institute

at Carnegie Mellon University. He

received a PhD in electrical engi-

neering from Kyoto University, Japan in 1974. He has

written more than ten patents and worked in multiple

areas of robotics—computer vision, manipulators,

autonomous mobile robots, and sensors. He has been

elected to the National Academy of Engineering and is

a Fellow of the IEEE, a Fellow of ACM, a Founding

Fellow of the American Association of Artificial

Intelligence, and the founding editor of International

Journal of Computer Vision. He has received several

awards including the Joseph Engelberger Award, Japan

Robot Association Award, Otto Franc Award, Yokogawa

Prize, and Marr Prize Award.

Readers may contact Satoh at the National Center for

Science Information Systems, 3-29-1 Otsuka, Bunkyo-

ku, Tokyo 112-8640, Japan, e-mail [email protected].

35

January–M

arch 1999

feature article name-it: naming and detecting faces in ... · names in the sound track, and then...

Documents