question answering from errorful multimedia streams aquaint pi meeting – june 2002

Question Answering from Errorful Multimedia Streams

AQUAINT PI Meeting – June 2002

Howard D. WactlarCarnegie Mellon University, USA

Digital Video LibraryDigital Video Library

Outline

• Goals for QA from multimedia

• Background- Informedia

- Information extraction

• Determining answer information

• Presenting the answer and follow-up

Why is Multimedia Important

• TV and radio broadcasts record human events across the globe

• Broadcast interviews, analysis and opinions created globally provide varied interpretive perspectives and context

• Images of people, events, maps and charts provide additional content not conveyed orally

- May be correlated with the spoken words

• Some pictures are worth a thousand words

Annual Video and Audio Production

Commercial

• 4500 motion pictures -> 9,000 hours/year (4.5 TB)

• 33,000 TV stations x 4 hrs/day -> 48,000,000 hrs/yr (24,000 TB)

• 44,000 radio stations x 4 hrs/day -> 65,500,000 hrs/yr (3,275 TB)

Personal

• Photographs: 80 billion images -> 410,000 TB/yr

• Home videos: 1.4 billion tapes -> 300,000 TB/yr

• X-rays: 2 billion -> 17,000 TB/yr

Surveillance

• Airports: 14,000 terminals x 140 cameras x 24 hrs/day -> 48 M hrs/day

Background

REQUIREMENTS:

- Automated process for information extraction from video

- Full-content search and retrieval from any spoken language and visual document

Establishment of large video libraries as a network searchable information resource

Mission: Enable Search and Discovery in the Video Medium

APPROACH: Integration of machine speech, image and natural language

understanding for library creation and exploration

Exploit operational Informedia DVL infrastructure and technology

Indexing

Relevant Result SetRelevant Result Set

Requested Segment Requested Segment or Summarizationor Summarization

Information Exploration & DiscoveryInformation Exploration & DiscoveryONLINEONLINE

MultimodalMultimodalQueriesQueries

AnalystAnalyst

BrowsingBrowsingand Query and Query RefinementRefinement

Information Collection & AnalysisInformation Collection & AnalysisOFFLINEOFFLINE

Indexed DatabaseIndexed SegmentedTranscript Compressed Audio/Video& Images

Distribution To Users

Processing

Entity ExtractionFace, OCR Text Recognition

1010

011

100 01 10

Surveillance Broadcast TV Radio

Digital Encoding

ImageAnalysis

Speech Analysis

Informedia System Architecture

Related Language Processing Work

• MUC, DUC, TREC especially QA track- Pronoun and Anaphora resolution

- Part-of-speech tagging

- Fact extraction

- Summarization

- Question-answering

…Electronic text focus

Why is Multimedia Hard

• It’s a fundamentally linear, temporal medium

• Speech, image and language understanding are all errorful, ambiguous and incomplete

• Information must be time-synchronized and correlated across modalities for both produced and natural video

• Verbal content lacks:- sentence boundaries,

- punctuation,

- capitalization …that enables a syntactic analysis

• Image recognition w/o known context is very limited

• Many errors from many sources!

Why We Think the Problems are Trackable

• Lot’s of data enables LEARNING systems

• Have shown complete or perfect information is not necessary

• Utilize multiple sources of information jointly: - text, image, audio, web text and databases

Research Focus

• Determining the answer information- Resolving co-references

- Discovering semantic relations

- Learning Information flow

- Hardening uncertain information

• Organizing and presenting the answer result- Text summaries

- Augmenting contextual material

- Maps, charts and images to allow follow-up questions

- Explicit representation of uncertainty

Resolving Co-references

• When is the same person mentioned (or seen, or identified)

• Places referenced (in words, on signs, on maps)

• Organizations cited (verbally, on signage, in charts)

• Requires:- Pronoun resolution

- Merge multiple spellings, abbreviations and contractions

- Merge across media (OCR, audio, text, faces)

Mining Links and Learning Semantic Relations

• Visualize co-occurrence in documents, in location, in time- Location can be variably sized regions

- Times can be arbitrary periods

• Finding semantic roles for related named entities- Dr. X is CEO of company Y

Active Hardening of Evidence

• Extracted information is noisy

• Acquire new supporting or falsifying evidence from other sources (web)

- On-demand or

- Automatically when original evidence is weak

…Result is higher fidelity information

Learning Information Flow

Tightly correlated

Information flow

Conditional information flow3-6 days

CNN ABC

Radio Duetsch Welle

(Germany)

Wiretap 1(Saudi Arabia)

HiddenSource 3

3-6 days

HiddenSource 4

RadioTehran(Iran)

Lifestyle news

HiddenSource 1

HiddenSource 2

News onMiddle East,

407 days

Learning Information Flow

• Where did a fact originate?

• Multiple sources report facts over time, with small changes- E.g. Different newspapers get the same story from AP or

Reuters source. Story ‘looks’ different.

- Imagery frequently is reused as well

• Columbia’s Newsblaster exploits this idea for summarization of the core story sentences

Integrated Analysis Environment

• Summarize multimedia information visually and textually

• Allow explicit display of and control over acceptable level of uncertainty

• Show link structure of entities and relations

• Interactive visualization for drill-down and follow-up

Strategic Advantages of Multimedia Analysis and Response

• Collect Large Amounts of Data

• Learning Approaches

• Leverage across media types

• Perfection is not necessary (80% solution may be ok)

• User in the loop filters remaining errors

• Effective interfaces and visualizations

Digital Video LibraryDigital Video Library

question answering from errorful multimedia streams aquaint pi meeting – june 2002

Documents

perfect information

spoken language

radio stations x

terminals x

cameras x

natural language understanding

tv stations x

web text