lessons learned from building a terabyte digital video library

8
66 Computer February 1999 66 Lessons Learned from Building a Terabyte Digital Video Library T he Informedia Project at Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for index- ing, segmenting, and accessing the library contents. The project, begun in 1994, was one of six funded by the US National Science Foundation, the US Defense Advanced Research Projects Agency, and the National Aeronautics and Space Administration, under the US Digital Library Initiative. Digital video presented a number of interesting chal- lenges for library creation and deployment: the way it embeds information, its voluminous file size, and its temporal characteristics. In the Informedia Project, we addressed these challenges by automatically extracting information from digi- tized video, creating interfaces that allowed users to search for and retrieve videos based on extracted infor- mation, and validating the system through user testbeds. We met these objectives during the course of the pro- ject, learning many lessons along the way. VIDEO PROCESSING Our library consisted of two types of video: news video (from the Cable News Network) and docu- mentary video (from the British Open University, QED Communications, the Discovery Channel, and a number of US government agencies, including NASA, the National Park Service, and the US Geological Survey). As of May 1998, the library contained more than 1,000 hours of news and 400 hours of documentary video, with additional video being added daily. The news video often included tags which marked story boundaries within longer broadcasts. At first, we added these story boundaries manually for the documentary video; in our subsequent experiments we looked at gen- erating story boundaries automatically. The stories, or video segments, averaged a few minutes in length, so that the entire library contained more than 40,000 video segments. We used artificial intelligence techniques to create metadata, the data that describes video content. We found that all the AI techniques we used were applic- able to both the news corpus and the documentary corpus. These techniques included speech recognition, image processing, and information retrieval. We learned that by integrating across these three areas we were able to compensate for limitations in accuracy, coverage, and communicative power. Speech processing We analyzed the audio component of the video with the CMU Sphinx speech recognition system. 1 This cre- ated a complete transcript for text-based retrieval from the speech and aligned existing imperfect tran- scripts to the video. The tightly synchronized tran- script was subsequently used in the library interface for quickly locating regions of interest within relevant video segments. In creating transcripts, CMU Sphinx’s word error rate is inversely proportional to the amount of pro- cessing time devoted to the task. Processing time varies from real time, which offers relatively poor transcrip- tion (more than half the words were wrong), to sev- eral hundred times real time, which offers optimal performance. Our standard processing is around 35 times real time, and it offers very accurate transcrip- The Informedia Project has integrated artificial intelligence techniques in creating a digital library providing full-content search and discovery in the video medium. Howard D. Wactlar Michael G. Christel Yihong Gong Alexander G. Hauptmann Carnegie Mellon University Cover Feature .

Upload: ag

Post on 22-Sep-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lessons learned from building a terabyte digital video library

66 Computer February 1999 66

Lessons Learnedfrom Building aTerabyte DigitalVideo Library

The Informedia Project at Carnegie MellonUniversity has created a terabyte digitalvideo library in which automatically deriveddescriptors for the video are used for index-ing, segmenting, and accessing the library

contents. The project, begun in 1994, was one of sixfunded by the US National Science Foundation, theUS Defense Advanced Research Projects Agency, andthe National Aeronautics and Space Administration,under the US Digital Library Initiative.

Digital video presented a number of interesting chal-lenges for library creation and deployment: the way itembeds information, its voluminous file size, and itstemporal characteristics. In the Informedia Project, weaddressed these challenges by

• automatically extracting information from digi-tized video,

• creating interfaces that allowed users to searchfor and retrieve videos based on extracted infor-mation, and

• validating the system through user testbeds.

We met these objectives during the course of the pro-ject, learning many lessons along the way.

VIDEO PROCESSING Our library consisted of two types of video: news

video (from the Cable News Network) and docu-mentary video (from the British Open University,QED Communications, the Discovery Channel, anda number of US government agencies, includingNASA, the National Park Service, and the USGeological Survey).

As of May 1998, the library contained more than1,000 hours of news and 400 hours of documentary

video, with additional video being added daily. Thenews video often included tags which marked storyboundaries within longer broadcasts. At first, we addedthese story boundaries manually for the documentaryvideo; in our subsequent experiments we looked at gen-erating story boundaries automatically. The stories, orvideo segments, averaged a few minutes in length, sothat the entire library contained more than 40,000video segments.

We used artificial intelligence techniques to createmetadata, the data that describes video content. Wefound that all the AI techniques we used were applic-able to both the news corpus and the documentarycorpus. These techniques included speech recognition,image processing, and information retrieval. Welearned that by integrating across these three areas wewere able to compensate for limitations in accuracy,coverage, and communicative power.

Speech processingWe analyzed the audio component of the video with

the CMU Sphinx speech recognition system.1 This cre-ated a complete transcript for text-based retrievalfrom the speech and aligned existing imperfect tran-scripts to the video. The tightly synchronized tran-script was subsequently used in the library interfacefor quickly locating regions of interest within relevantvideo segments.

In creating transcripts, CMU Sphinx’s word errorrate is inversely proportional to the amount of pro-cessing time devoted to the task. Processing time variesfrom real time, which offers relatively poor transcrip-tion (more than half the words were wrong), to sev-eral hundred times real time, which offers optimalperformance. Our standard processing is around 35times real time, and it offers very accurate transcrip-

The Informedia Project has integrated artificial intelligence techniques in creating a digital library providing full-content search and discovery in the video medium.

Howard D.WactlarMichael G.ChristelYihong GongAlexander G.HauptmannCarnegieMellon University

Cove

r Fea

ture

.

Page 2: Lessons learned from building a terabyte digital video library

tion. CMU Sphinx, however, can run on parallelmachines, providing excellent word recognition in twoto three times real time.

In our initial experiments on news data in 1994,word error rate—the sum of erroneous word inser-tions, deletions, and substitutions—was 65 percent.The latest version of the speech recognition system,CMU Sphinx III, which was trained on 66 hours ofnews broadcasts, provides a word error rate of about24 percent. We have been able to lower that to 19percent through the use of general language modelsinterpolated with the current “news of the day” asobtained from the Web sites of CNN, Reuters, andthe Associated Press. In the interpolated news-lan-guage model, the speech recognition vocabularyreflects the names and places of today’s news andadjusts probabilities of word pairs and triplets thatreflect expressions used in today’s news.

Information retrievalIn Informedia, we needed to provide accurate

searching of audio content despite relatively high worderror rates. Our initial experiments suggested thatinformation retrieval would still be effective despitethe transcription mistakes by the speech recognitionsystem. With word error rates up to 30 percent, infor-mation retrieval precision and recall were degradedby less than 10 percent. This is a reflection of theredundancy in language, where key words repeat mul-tiple times within a document and different key termsdescribe the same thing. Redundancy means that evenif the speech recognizer misses one or two instances,the document will still be considered appropriatelyrelevant to a user’s query.

Other experiments showed that transcript errorsdue to words missing from the speech recognizer’svocabulary could be mitigated by using a phonetictranscription together with a word transcription ofthe document.2,3 We were also able to improve infor-mation retrieval by having the speech recognizer pro-duce multiple alternate candidate transcriptionhypotheses rather than a single best guess.4

From these experiments we learned that even rela-tively high word error rates in the speech recognitionnevertheless permit relatively effective informationretrieval. Using standard information retrieval tech-niques, the state of the art in speech recognition wasadequate for information retrieval from broadcastnews type video. Additional techniques specialized forspeech transcripts could further improve the retrievalresults.

However, there are still many open questions relatedto information retrieval from video documents. Weneed to verify that research results obtained with rel-atively small amounts of spoken data will apply aswell to very large collections. There was initial evi-

dence that retrieval from heterogeneous sources(such as a corpus of some perfect text docu-ments and some error-laden speech transcripts)would be more adversely affected by speecherrors. Finally, errors in the automatic parti-tioning of video streams into video segmentsmay also adversely affect information retrievaleffectiveness. If a single relevant news story isinadvertently broken up into two pieces, withsome key terms in each piece, the resulting rel-evance rank for the individual halves of thestory may be substantially lower than for thecomplete news story segmented properly.

Image processingThe ultimate goal for image processing in the con-

text of digital video libraries is the ability to fully char-acterize the scene and all objects within it and toprovide efficient, effective access to this informationthrough indices and similarity measures. Otherresearch groups have made progress toward this goalby restricting the domain, for example, to a particu-lar sport or to news broadcasts from a particularagency.5-7 Programs can then look for certain cues, likethe appearance of scoreboards or news anchors. TheInformedia Project, however, focused on general-pur-pose processing without narrowing the candidatevideos to a particular domain.

MPEG-1 compression was used for the Informedialibrary, with video at 30 frames per second and indi-vidual frame resolution at 352 × 240 pixels. Our imageprocessing techniques worked with the compressedvideo frames to reduce processing time requirements.

The relatively low resolution and compression arti-facts adversely affected the quality of the metadataproduced through image processing, such as face andtext detection. By taking advantage of the temporalredundancy in video, we minimized the effects of sucherrors. For example, text that appears on one frameof video is likely to remain visible in subsequentframes of video for at least the next few seconds.

In the Informedia Project, we used image process-ing to

• partition each video segment into shots, definedas a set of contiguous frames representing a con-tinuous action in time or space,5,6 and choose arepresentative frame (key frame) for each shot;

• identify and index features to support image sim-ilarity matching; and

• create metadata derived from imagery for use intext matching.

Shots and key frames. To enable quick access to con-tent within video segments and to provide visual sum-maries, we decomposed each video segment into

February 1999 67

Even relatively highword error rates

in the speech recognition

nevertheless permitrelatively effective

informationretrieval.

.

Page 3: Lessons learned from building a terabyte digital video library

68 Computer

shots. Each shot was bounded by frames inwhich significant color shifts occurred.5 Shotdetection was improved by incorporating opti-cal flow characteristics for tracking camera andobject motion. For example, a camera pan wasleft intact as a single shot.

By default, a shot’s middle frame became itskey frame representing that section of video. Wefound we could improve the choice of key frameby applying cinematic heuristics. For example,when there is camera motion in a shot, the pointwhere it stops is probably the most informative

and significant, so this point becomes the key frame.Similarly, low-intensity images are likely less important,so they are excluded as key frames.

Face and color detection. In support of imagematching and richer visual summaries, we weightedthe key frame selection to use images with faceswherever possible. We also provided library userswith a face-searching capability based on this tech-nique.

Face detection was adequate in the case of fullfrontal shots (where both eyes were distinguishable)and when faces were well lit. In news stories, then, itdid well for anchorpersons in the news studio butoften not for subjects of stories in the field. Withimproved face detection, the other key players besidesjust the anchorperson could be identified, resulting inmore representative key frames for the video andmore complete face matching across the wholelibrary.

Library users could also match images based purelyon color. However, matching on the basis of color his-tograms often failed to deliver relevant images. Thecolor histogram technique failed in part because it usesa primitive representation of image color propertiesand in part because it cannot accomplish partial map-ping (matching images in finer granularities). Theinability to use partial mapping meant that users couldnot retrieve images on the basis of regions of interest:For example, they could not retrieve all the video thatcontained a green lawn without regard to the rest ofthe imagery.

We developed a new method for image retrievalbased on human perceptual color clustering.8 This per-ceptual color clustering method first clusters colorimages into a few prominent colors and uniformregions based on human color perception. It then cre-ates a multidimensional feature vector for each of theprominent colors and uniform regions. Finally, it usesthese feature vectors to form indices of the originalcolor images.

To retrieve images using this technique, the userspecifies colors and regions of interest in the sampleimage. The system retrieves relevant images by firstconstructing feature vectors of the user-specified col-

ors and regions, and it then uses these feature vectorsas search keys to perform a nearest-neighbor searchon the database. An experimental evaluation showedthat this new image retrieval method surpassed colorhistogram-based methods.8

Video OCR (VOCR). We used text superimposed onthe video (VOCR) to identify key frames and enrichsearching.9 The VOCR process involves three steps.

1. Identify video frames with probable text. Text cap-tions in video can typically be characterized as ahorizontal rectangular structure of clustered sharpedges. A horizontal differential filter highlightsvertical edge features, with postprocessing to elim-inate extraneous fragments and connect charac-ter segments that may have become detached. Theresulting clusters are analyzed for their horizontaland rectangular text-like appearance. Clusters sat-isfying this analysis are identified as text regions ina frame.

2. Filter the probable text region across the multiplevideo frames containing it, producing a higherquality image than if a single frame were used.

3. Use commercial OCR software to process the finalfiltered image.

Tests run with MPEG-1 videos of broadcast newsshowed that text regions were identified correctly 90 percent of the time, with 83.5 percent of characterssubsequently recognized correctly and a word recog-nition rate of 70 percent.9 We could improve wordaccuracy further by basing corrections on dictionaries,thesauri, and the same “news of the day” informationas was used to create interpolated language modelsfor speech recognition.

Future challenges. Image processing has not yetreached the point where it can automatically charac-terize all objects and their interrelationships withinvideo segments.7 Associated challenges include:

• Evolving from color-based image similaritymatching to content-based image similaritymatching. A long-term goal is to recognize botha close-up of a single soccer player and an aerialshot of the whole soccer field as similar shotsbased on content, despite differences in low-levelimage features like color, texture, and shape.

• Effectively segmenting general-purpose video intostories, and further breaking down these storiesinto meaningful shots. Segmentation will likelybenefit from improved analysis of the video cor-pus, analysis of video structure, and applicationof cinematic rules of thumb.

We believe these challenges can be met by applyingtechniques across various modalities—for example,

Face detection was adequate in the case of full

frontal shots andwhen faces

were well lit.

.

Page 4: Lessons learned from building a terabyte digital video library

February 1999 69

using language models together with transcript infor-mation to improve video segmentation. Correlatinghuman faces in video images with human names intranscripts could facilitate finding appearances of indi-viduals in video sequences. Other correlation couldenable labeling of video segments with who, when,and where information.

THE INFORMEDIA INTERFACEThe Informedia interface was designed to provide

users with quick access to relevant information in thedigital video library. A basic element of the design wasthe provision of alternative browsing capabilities—ormultimedia abstractions—to help users decide whichvideo they wanted to see. In the Informedia Project,abstractions included headlines, thumbnails, film-strips, and skims.

As Cox et al. state, “powerful browsing capabili-ties are essential in a multimedia information retrievalsystem”7 because the underlying speech, image, andlanguage processing are imperfect and produceambiguous, incomplete metadata. The search engineintroduces additional ambiguity, while the users them-selves contribute more error through poor query for-mulation or simply not knowing what to ask for. Theresults from a query to a digital video library are henceimprecise, with users left to filter out the useful infor-mation. Multimedia abstractions have been invalu-able for assisting in this task.

The Informedia Project began with a conceptualprototype suggesting the utility of certain multimediaabstractions and interface features. The prototype wasrefined into an operational interface that was thenused at the Winchester Thurston School, Pittsburgh,by high school students and teachers in computer lab-oratory and traditional library settings. The systemwas later delivered to the US Defense AdvancedResearch Projects Agency as a prototype applicationfor “News-on-Demand” and also used at CarnegieMellon University for demonstrations and controlledexperiments.

We tracked usage data primarily by automaticallylogging mouse and keyboard input actions, supple-mented with user interviews. We conducted formalempirical studies to measure the effectiveness of par-ticular multimedia abstractions. During the fall of1997, students in a graduate course in human-com-puter interfaces at Carnegie Mellon University con-ducted a number of evaluations on the system,including contextual inquiry, heuristic evaluation, cog-nitive walk-throughs, and think-aloud protocols. TheInformedia system was also demonstrated at numer-ous multimedia, user interface, and digital library con-ferences.

Data from these instruments and forums collec-tively forms the basis for the discussion that follows.

Figure 1. Query input and result set interface for the Informedia digital video library.

HeadlinesFigure 1 shows the Informedia interface following a

query on the Northern Ireland peace treaty vote. Thedisplay at the bottom shows thumbnail images for videosegments returned as matches for the query. When theuser positions the mouse arrow over a thumbnail, theinterface pops up a headline for the segment.

Early Informedia implementations created headlinesfor segments based on the “most significant” wordsin their text (for example, in the transcript and VOCR-produced text). These words were determined by termfrequency-inverse document frequency (tf-idf) statis-tics. If the word occurred often in the text associatedwith the video segment and infrequently in the wholelibrary’s text, then that word became part of the head-line. These headlines were deficient in the followingways:

• They did not read well, being an assemblage ofwords usually not occurring together naturally.

• They did not use semantics provided by the seg-ment belonging to a certain class of video, such asnetwork news or animal documentaries.

.

Page 5: Lessons learned from building a terabyte digital video library

70 Computer

• They did not offer additional summary informa-tion, such as segment size or date.

The new headlines, shown in Figure 1, addressedthese shortcomings:

• Phrases, instead of words, were weighted usingtf-idf statistics and became the component piecesfor the headlines.

• Headlines for news segments used the heuristicthat the most significant information is givenimmediately, with details and illustrations fol-lowing it.

• Headlines included segment size and the copy-right date. If the user sorted results by date, thedate appeared first in the headline.

ThumbnailsFigure 1 shows thumbnails for 12 segments. A for-

mal experiment was conducted to test whether theresults layout shown in Figure 1 had any benefits overpresenting the results in a text menu, where the head-line text would be displayed for each result. The exper-iment, conducted with high school scholarshipstudents, showed that a version of the pictorial menu

had significant benefits for both performance time anduser satisfaction: Subjects found the desired informa-tion in 36 percent less time with certain pictorialmenus over text menus.

Most interestingly, the manner in which the thumb-nail images were chosen was critical. If the thumbnailfor a video segment was taken to be the key frame forthe first shot of the segment, then the resulting picto-rial menu of “key frames from first shots in segments”produced no benefit compared to the text menu. Onlywhen the thumbnail was chosen based on usage con-text was there an improvement. When the thumbnailwas chosen based on the query, by using the key framefor the shot producing the most matches for the query,then pictorial menus produced clear advantages overtext-only menus.10

This empirical study validated the use of thumbnailsfor representing video segments in query result sets. Italso provided evidence that leveraging multiple pro-cessing techniques improves digital video library inter-faces. Thumbnails derived from image processingalone, such as choosing the image for the first shot ina segment, produced no improvements over the textmenu. However, through speech recognition, naturallanguage processing and image processing, improve-ments can be realized.

• Via speech recognition, the spoken dialoguewords are tightly aligned to the video imagery.

• Through natural language processing, the query iscompared to the spoken dialogue, and matchingwords are identified in the transcript and scored.

• Via word alignment, each shot can be scored fora query, and the thumbnail can then be the keyframe from the highest scoring shot for a query.

Figure 2 illustrates these improvements. Result setsshowing such query-based thumbnails as in the picto-rial menu shown at the bottom of Figure 1 do produceadvantages over text-only result presentations.

FilmstripsKey frames from a segment’s shots can be presented

in sequential order as filmstrips, as shown at the topof Figure 3. The segment’s filmstrip quickly shows thatthe segment contains more than a story matching thequery “Mir collision,” including an opening sequenceand a weather report. The filmstrip helps identify keyshots with bars color-coded to specific query words, inthis case red for “Mir” and purple for “collision.”When the user moves the mouse arrow over the matchbars, a text window displays the matching word andother metadata.

Because key frame selection can be weighted to useimages of faces, filmstrips of news stories typicallyshow the anchorperson at points in the video seg-

Figure 2. Using speech, image, and language processing to derive a thumbnail bestmatching the user’s query.

.

Page 6: Lessons learned from building a terabyte digital video library

original video. The MoCA Project developed auto-matic techniques to create video skims that act asmovie trailers, short appetizers for a longer videointended to attract the viewer’s attention.6 Our goalfor video skims went beyond motivating a viewer towatch a full video segment; we sought to communi-cate the essential content of a video in an order ofmagnitude less time.

Initial progress was slow in the development of videoskims with this capability. Formal empirical tests inearly 1997 showed no advantages to skims producedthrough image and speech processing versus trivial sub-sampled skims, where sections of audio and video wereextracted from the longer source video at regular inter-

February 1999 71

ment when the story is introduced, when transitionalcomments are made before cutting to field footage ora new interviewee, and when the story is summa-rized before moving on to a different topic. Thecount and location of anchorperson images enablesthe user to infer the length, percentage of fieldfootage, and flow of a news video segment—all viathe filmstrip abstraction.

By investigating the distribution of match bars onthe filmstrip, the user can determine the relevancy ofthe returned result and the location of interest withinthe segment. The user can click on a match bar tojump directly to that point in the video segment.Hence, clicking the mouse as shown in Figure 3 wouldstart playing the video at this mention of Mir with theshot of the sitting cosmonauts. Similarly, the systemprovided “seek to next match” and “seek to previousmatch” buttons in the video player allowing the userto quickly jump from one match to the next.

In the example of Figure 3, these interface featuresallowed the user to bypass irrelevant video and seekdirectly to the Mir story. Users frequently accessedthese and other features that revealed the workingsof the search engine and explained why a result wasreturned, including details on the relative contribu-tions of each query term.

Transcript text appears at the bottom of the videoplayback window. As the video plays, the text scrolls,with words highlighted as they are spoken. The inter-face worked well with showing a single type of tightlysynchronized metadata: spoken transcripts. This inter-face feature might be useful to the hearing-impairedcommunity and to those learning English.

As other metadata types were introduced, however,such as VOCR, user notes, topic labels, and headlines,having one text box or even multiple text boxes stackedbeneath the video for displaying the metadata startedto be confusing. If one text box was used, the problemwas in how to show current scope: a type like a usernote could have a duration of 20 seconds but each tran-script word had a typical duration of a second or less.If multiple text boxes were used, one per metadata type,then not only did screen resolution dictate a maximumnumber of metadata types possible, but also the user’sattention was severely fragmented watching over themultiple scrolling text boxes. A possible solution is theDIVA approach,11 which shows temporal streams ofmetadata synchronized to the video as striped rectan-gular areas alongside the video presentation.

SkimsWe developed a temporal multimedia abstraction,

the video skim, which is played rather than viewedstatically. A skim incorporates both video and audioinformation from a longer source so that a two-minute skim, for example, may represent a 20-minute

Figure 3. Filmstrip and video playback windows for a result from the query for “Mir collision.”

.

Page 7: Lessons learned from building a terabyte digital video library

72 Computer

vals. These early skims suffered from selectingwords rather than phrases, much like headlines,and from ignoring information in the audio chan-nel pertaining to silences. By taking advantage ofaudio characteristics and utilizing larger compo-nent pieces based on phrases, later skims wereless choppy and better synchronized with theimagery. These later skims demonstrated signif-icant performance and user satisfaction benefitscompared to simple subsampled skims in follow-

up empirical studies.12 Further improvements canpotentially be gained through:

• Heuristics appropriate for particular types ofvideo and certain video uses. For example, innews broadcasts, the skim would emphasize theanchorperson’s opening comments, which likelysummarize the story.

• Creating skims on the fly, emphasizing video sec-tions matching current context, such as a user’squery.

OTHER LESSONS LEARNEDThe Informedia system attempted to be general

purpose, serving a wide range of users accessing wide-ranging video data. In retrospect, this approach maybe more limiting rather than liberating. Many pro-cessing techniques, such as video skim creation, workwell if heuristics can be applied based on the videobelonging to a particular subclass. Users don’t wantto only find a video segment; they then want to dosomething with it—for example, integrating videointo lesson plans and homework assignments. Alibrary system needs to be user-centric and not justtechnology-driven.

Anecdotal feedback from users has been very posi-tive. All age groups seem to find the digital library funto explore. Users quickly learned how to use the inter-face, conducting searches and retrieving videos.Initially users were not satisfied with a corpus of a fewhundred hours, but recognized the value of a news cor-pus that grew daily and one that held over a thousandhours of video, especially if the video were producedby multiple sources to represent multiple perspectives.In addition to wanting more data, users requested:

• Queries on multiple modalities—for example,searches for the word France and castles match-ing an image.

• Interface support for more than just queryaccess—for example, hierarchical browsingaccess or other views of the whole corpus orviews across large sets of query results.

In dealing with these requests, we have put the cor-pus under the control of a commercial database and

added information visualization so that users canmanipulate a thousand video segments at once usingquery terms and filters for relevance, date, and size.We now allow users to add their own notes as syn-chronized, searchable metadata to video segments,and have added automatically generated topics asanother means for navigating the library.

W hen the Informedia Project began in 1994,we anticipated a quick maturation of video-on-demand services via consolidation among

the communications, entertain- ment, and computerindustries for enabling digital video access in thehome. However, these services and interactive tele-vision failed to develop in time for use by the projectas a digital video delivery infrastructure. The WorldWide Web experienced tremendous growth in thesame time span, but without support for VHS-qual-ity video delivery.

The consequence of these shortfalls is that the ter-abyte digital video library amassed at CarnegieMellon University remains accessible at VHS-qualityplayback rates only to those connected to its localarea network or those with a network connectioncapable of sustaining MPEG-1 transfer rates.

One hope for the future is that video streaming tech-nologies will continue to mature and network band-width to increase so that digital video libraries areaccessible by users in a variety of settings, from schoolsto offices to homes around the globe.

Intellectual property concerns will inhibit the growthof centralized digital video libraries. While theInformedia library contained some public domainvideo, the majority of the contributions had accessrestrictions and could not be published to the Web,even if the video delivery infrastructure were in place.Often, the supplying agency would have liked to agreeto broader access, but was not sure whether such rightscould be granted because of lacking documentationand precedents on the rights held by producers, direc-tors, and other contributors to the archived video.

One approach to these concerns involves supportof federated digital video libraries, in which metadatadescriptors and indices are provided in a commonrepository but source video material remains underthe control of the various contributors. Access to themetadata, such as titles, abstracts, filmstrips, and per-haps even video skims, enables users to locate seg-ments of interest. The full video segments are providedvia the contributors, perhaps from widely distributedvideo servers, with the contributors managing theproperty rights of their own collections.

The Informedia Project began with the goal ofallowing complete access to information contentwithin multimedia sources. We believe the focus wasaccurate: A tremendous amount of potentially useful

Intellectual propertyconcerns will inhibit

the growth ofcentralized digital

video libraries.

.

Page 8: Lessons learned from building a terabyte digital video library

information remained locked in the audio narrative,nonspeech audio, and video imagery. Through speech,image, and natural language processing, the Infor-media Project has demonstrated that previously inac-cessible data can be derived automatically and usedto describe and index video segments. Much workremains, however, in refining this work, augmentingit with additional processing such as non-speechanalysis, improving segmentation, and tailoring tech-niques for increased accuracy on subclasses of videoand for specific user communities. ❖

AcknowledgmentsThe Informedia Project was supported by the US

National Science Foundation, the US DefenseAdvanced Research Projects Agency, and the NationalAeronautics and Space Administration, under NSFCooperative Agreement No. IRI-9411299. A completelist of Informedia Project sponsors is found at http://www.informedia.cs.cmu.edu. We thank them for theirsupport and also acknowledge the efforts of the manycontributors to this project, including CMU students,staff, faculty, and visiting scientists.

References1. M.J. Witbrock and A.G. Hauptmann, “Artificial Intel-

ligence Techniques in a Digital Video Library,” J. Am.Soc. Information Science, May 1998, pp. 619-632.

2. G.J.F. Jones et al., “Retrieving Spoken Documents byCombining Multiple Index Sources,” Proc. SIGIR 96,ACM Press, New York, 1996, pp. 30-38.

3. M.J. Witbrock and A.G. Hauptmann, “Using Words andPhonetic Strings for Efficient Information Retrieval fromImperfectly Transcribed Spoken Documents,” Proc. DL’97, ACM Press, New York, 1997, pp. 30-35.

4. A.G. Hauptmann et al., “Experiments in InformationRetrieval from Spoken Documents,” Proc. DARPAWorkshop on Broadcast News Understanding Systems,Morgan Kaufmann, San Francisco, 1998, pp. 175-181.

5. H.-J. Zhang et al., “A Video Database System for Digi-tal Libraries,” Lecture Notes in Computer Science 916,Springer-Verlag, Berlin, 1995, pp. 253-264.

6. R. Lienhart, S. Pfeiffer, and W. Effelsberg, “VideoAbstracting,” Comm. ACM, Dec. 1997, pp. 55-62.

7. R.V. Cox et al., “Applications of Multimedia Processing toCommunications,” Proc. IEEE, May 1998, pp. 754-824.

8. Y.H. Gong, G. Proietti, and C. Faloutsos, “Image Index-ing and Retrieval Based on Human Perceptual Color Clus-tering,” Proc. Computer Vision and Pattern Recognition,IEEE CS Press, Los Alamitos, Calif., 1998, pp. 578-583.

9. T. Sato et al., “Video OCR for Digital News Archive,”Proc. Workshop on Content-Based Access of Image andVideo Databases, IEEE CS Press, Los Alamitos, Calif.,1998, pp. 52-60.

10. M. Christel, D. Winkler, and C.R. Taylor, “ImprovingAccess to a Digital Video Library,” Human-ComputerInteraction: INTERACT97, Chapman & Hall, London,1997, pp. 524-531.

11. W.E. Mackay and M. Beaudouin-Lafon, “DIVA: Explora-tory Data Analysis with Multimedia Streams,” Proc. CHI’98, ACM Press, New York, 1998, pp. 416-423.

12. M. Christel et al., “Evolving Video Skims into UsefulMultimedia Abstractions,” Proc. CHI ’98, ACM Press,New York, 1998, pp. 171-178.

Howard D. Wactlar is the vice provost for researchcomputing and associate dean of the School of Com-puter Science at Carnegie Mellon University. Hisresearch interests are multimedia, distributed systems,digital libraries, and performance measurement. Wact-lar received a BS in physics from MIT and an MS inphysics from the University of Maryland. He is amember of the IEEE.

Michael G. Christel is a senior systems scientist in theComputer Science Department at Carnegie MellonUniversity and a charter member of CMU’s Human-Computer Interaction Institute. His current researchinterests include multimedia interfaces, informationvisualization, and digital libraries. He received a BSin math and computer science from Canisius College,Buffalo, N.Y., and a PhD in computer science fromthe Georgia Institute of Technology.

Yihong Gong is a project scientist in the RoboticsInstitute at Carnegie Mellon University. His researchinterests include image and video analysis, multime-dia database systems, and artificial intelligence. Hereceived his BS, MS, and PhD degrees in electronicengineering from the University of Tokyo.

Alexander G. Hauptmann is a senior systems scientistin the Computer Science Department at Carnegie Mel-lon University. His research interests include speechrecognition, speech synthesis, speech interfaces, and lan-guage in general. He received a BA and an MA in psy-chology from Johns Hopkins University, a diploma incomputer science from Technische Universitaet Berlin,and a PhD in computer science from Carnegie Mellon.

Contact Wactlar, Christel, Gong, and Hauptmann at{wactlar, christel, ygong, hauptmann}@cs.cmu.edu.

February 1999 73

.