informedia 03/12/97

17
Carnegie Mellon © Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 1 Informedia 03/12/97 Inform edia 11/30/2000 1 M ultilingual Inform edia: Search and R etrieval ofBroadcastN ew s from C om bined English Language and Foreign M edia Sources Carnegie M ellon U niversity

Upload: kaden

Post on 14-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Informedia 03/12/97. Multilingual Informedia: Innovations. Robust Indexing and Retrieval Spanish Speech Recognitiion Searchable User Annotations Data Extraction for Further Analysis Multilingual Document Access English or Spanish Queries English or Spanish Broadcast Video. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 1

Informedia03/12/97Informedia 11/30/2000 1

Multilingual Informedia:

Search and Retrieval of Broadcast News from Combined English Language and Foreign Media

Sources

Carnegie Mellon University

Page 2: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 2

Multilingual Informedia:Innovations

• Robust Indexing and Retrieval– Spanish Speech Recognitiion

– Searchable User Annotations

– Data Extraction for Further Analysis

• Multilingual Document Access– English or Spanish Queries

– English or Spanish Broadcast Video

Page 3: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 3

Extending the Informedia Digital Video Library

Original Informedia Goal – Full content search and retrieval

from digital video, audio and text libraries

Technology– Integrated speech, image and language processing

for automated library creation(indexing, segmentation, abstraction, summarization)

Page 4: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 4

Building on the Informedia Infrastructure

• Video and Audio Segmentation– improved segmentation algorithms

– extend to multiple languages

• Presentation, Reuse and Interoperability– abstractions and video summarization (skims)

– “cut and paste” for presentations and reports

– Annotations• Initially typed, later spoken

• Incrementally indexed for immediate retrieval

Page 5: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 5

Multilingual Integration

– Spanish News Broadcast– Digitized from PAL to MPEG-1– Speech Recognition/Alignment by Sphinx-III – Simple Phrase-based Translation – Processed Automatically into the Informedia

Digital Video Library

Page 6: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 6

Multilingual Demo

• Running prototype demo

• Demonstration of current technologies

Page 7: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 7

Title Generation for Informedia News Stories

• Informedia, a multimedia digital library, stores television broadcast news stories.

• An extractive summary feature currently locates snippets in news-story transcripts to use as story titles.

• GOAL: An improved, non-extractive title-generation feature for Informedia.

Page 8: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 8

KNN-based Topic Detection

• Build training index with pre-labeled topics

– 45000 Broadcast News stories

With new document:

• Search for top 10 related stories in training index

• Lookup topics for related stories

• Re-weight topics by story relevance (select top 5)

Page 9: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 9

Basic Idea for better Titles

• Train a statistical model on a corpus of documents with human-assigned titles.

• Compare title generation methods:– Extractive Titles– Naïve Bayes, EM, – KNN

Page 10: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 10

Extractive Summarization

• MS Word 2000 AutoSummarize

• Extracts sentences/fragments as summaries

• Similar performance to TF IDF implementation at CMU

• Does not use our training corpus

Page 11: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 11

Naïve Bayes

• Train a statistical model on a corpus of documents with human-assigned titles.

• Title need not be a snippet from the document (contrasts with extractive-summarization techniques).

• Suggested by Witbrock & Mittal, 1999.

• P(wTitle|wDoc)

– works better if Wtitle = WDoc

Page 12: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 12

(K) Nearest Neighbor

• Index a corpus of documents with human-assigned titles.

• Find the document in the training corpus closest to the current document

• Use that title (k=1)

Page 13: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 13

Evaluation of Title Accuracy• Apply to unseen documents, (2 * precision * recall)

F1 = _________________ (precision + recall)

• Precision = Correct/Retrieved• Recall = Correct/All Possible Correct

• Only measured word selection, not orderShould try String Edit Distance (DTW), or Maximal Substring

Page 14: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 18

Multi-Lingual Experiment• 40000 TV news stories with titles

from 1998 Broadcast News CD-ROM

• tested on 1000 held-out stories evaluated on titles

• Using SYSTRAN (Babelfish.altavista.com) translated English-French-English

Vocabulary overlap was about 70%

(need)

???

Page 15: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 20

Example English-French-EnglishTitle: CONTINUING COVERAGE OF O. J. SIMPSON CIVIL TRIAL: MORE PHOTOS OF SIMPSON IN MAGLI SHOES MAY SURFACE Translation: AGAIN THE SHOES OF SIMPSON OF O J BECOME A FOCAL POINT IN SA CIVIL TEST. AND AGAIN A PHOTOGRAPH EAST A CRUCIAL PART OF THE IMAGE. FELDMAN OF CHARLES OF C N N EXPLAINS. THE SOURCES INDICATE C N N THAT THE LAWYERS FOR THE FAMILIES CONTINUING SIMPSON OF O J HAVE NOW ACCESS TO SEVERAL PHOTOGRAPHS ALLEGEDLY LATELY CLEARLY DISCOVERED TO SHOW SIMPSON CARRYING A PAIR OF SHOES OF BRUNO MAGLI OF SWEDEN. AN EXPERT AS REGARDS F B I A TESTIFIED WITH THE CIVIL TEST TO SIMPSON TO THAT SUCH A PAIR A LEFT TO THE COPIES TO SHOE BEHIND TO THE SCENE TO MURDER THE FORMER WIFE TO SIMPSON AND HIS GOLDMAN TO RON TO FRIEND. THE AGENT FOR THE FAMILIES OF VICTIMS A PRESENTED IN THE OBVIOUSNESS A PHOTOGRAPH TAKEN BY THE OAR OF HARRY OF PHOTOGRAPHER BY AND PUBLISHED INSIDE QUOTE THE QUOTATION MARK NATIONALS OF INVESTIGATOR. A TESTIFIED EXPERT A THAT PHOTO A SHOWN SIMPSON CARRYING THE SHOES...

Method Name

Original Machine Generated Title Machine Generated Title after Translation

NBL simpson civil trial simpson's estate news

murder investigation simpson civil search victims

NBF continuing coverage simpson civil trial verdict

continuing coverage simpson civil trial president

KNN O. J. SIMPSON TRIAL RESUME MONDAY DAY BACK COURT JURORS SIMPSON CIVIL TRIAL

TF.IDF simpson civil trial photo magli shoes

continuing coverage simpson civil trial president

Page 16: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 22

0

0.05

0.1

0.15

0.2

F1

NBF NBL KNN TFIDF

Title Generation Method

Fig. 1 F1 Comparison of Title Generation Methods

French

Portuguese

Original

Multilingual Results

Page 17: Informedia 03/12/97

CarnegieMellon© Copyright 2000 Michael G. Christel and Alexander G. Hauptmann 23

Effect of Word Order

0

0.2

0.4

0.6

0.8

Avg. Words in Correct Order

NBF NBL KNN TFIDF

Title Generation Method

Fig. 2 Comparison of Word Order in Title Generation Methods

French

Portuguese

Original