impact final event 26-06-2012 - franciska de jong - indexing and searching of ‘noisy’ data

28
Indexing and searching of noisy data Franciska de Jong University of Twente cluster Human Media Interaction Enschede, The Netherlands Erasmus University Erasmus Studio Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong University of Twente cluster Human Media Interaction Enschede, The Netherlands Erasmus University Erasmus Studio for e- research Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong 1 IMPACT Closing Event - The Hague

Upload: impact-centre-of-competence

Post on 22-Nov-2014

758 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Indexing and searching of noisy data

Franciska de JongUniversity of Twente

cluster Human Media InteractionEnschede, The Netherlands

Erasmus University Erasmus Studio Rotterdam, The Netherlands

http://hmi.ewi.utwente.nl/~fdejong

University of Twente cluster Human Media Interaction

Enschede, The Netherlands

Erasmus University Erasmus Studio for e-research Rotterdam, The Netherlands

http://hmi.ewi.utwente.nl/~fdejong

1IMPACT Closing Event - The Hague

Page 2: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Overview

Part I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly use Part III: From noisy (meta)data towards

metadata mining

IMPACT Closing Event - The Hague 2

Page 3: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

J&M Figure 5.23

Noisy Channel for Spelling Correction

noise: limitations in spelling skills

Page 4: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Noisy Channel for Speech Recognition

noise: limitations in sound capturedJ&M Figure 9.2

Page 5: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Noisy Channel for Machine Translation

noise: loss of information through translation J&M Figure 25.15

Page 6: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

J&M Figure 5.23

Noisy Channel for OCR

noise: loss of information through typesetting/handwriting

Page 7: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Decoding spoken audio

• Audio modelling: collect data on the ground truth for audio segments

• Language modelling: collect data on co-occurrence s of words

• 100 hours of speech, • Text data (500 M words)

There is no data like more data

IMPACT Closing Event - The Hague 7

Page 8: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

After decoding

• multiple hypotheses with varying probabilities of being correct

• selection from n-best list: errors unavoidable• post-editing can be an option, but never

without extra costs – time (editors), money (editing platform)– complexity of workflow

IMPACT Closing Event - The Hague 8

Page 9: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Impact of noise on access tasks

• Content/metadata with a certain amount of errors

• Search with reduced accuracy:– missed hits (false negatives)– incorrect hits (false positive; ‘noise’)

• Noisy data less suited for presentation layer– pdf versus ascii– original audio versus transcript; alternatives: word

clouds, related contentIMPACT Closing Event - The Hague 9

Page 10: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

multimediainterviewarchive

metadata

speech recognition

automatic speech transcription

speaker detection

transcripts with time stamps and semantic annotations

summarization

speech/non-speech

detection

text mining tagging

automatic metadata extraction

users: general public,

archivists, researchers

query

search engine

resultpresentation

Access to interviews: transcript generation

Page 11: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Optimization Strategies (1)

• Error correction: post-editing, better recognition

• Improved recognition– typically effective for core collections (WER below

20%)– less effective for the long tail

Case: interviews with Willem Frederik Hermans• With models for news: 81% WER• Aim: reduction to around 60%

IMPACT Closing Event - The Hague 11

Page 12: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Optimization Strategies (2)

• Dedicated /task-specific evaluation– for seach applications errors in function words are

less critical than errors in e.g. names of persons and locations

• Dedicated weigthing schemes for search tasks– assign confidences scores to fragments found and

rerank search results accordingly

IMPACT Closing Event - The Hague 12

Page 13: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

multimediainterviewarchive

metadata

speech recognition

automatic speech transcription

speaker detection

transcripts with time stamps and semantic annotations

summarization

speech/non-speech

detection

text mining tagging

automatic metadata extraction

users: general public,

archivists, researchers

query

search engine

resultpresentation

Access to interviews: support for users

Page 14: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

• Part II: Emerging scenarios of scholarly use

IMPACT Closing Event - The Hague 14

Page 15: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

DLs and knowledge discovery

• Focus of attention for analysis is no longer the document alone.

• Room for statistical methods to analyse entire collections, archives, libraries.

• Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures.

• Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15

Page 16: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Paradigm evolution

Science examples

Information studies examples

Experimental work

direct obervation interpretation/ decoding of texts

Theoretical modeling

E = mc2

a2 + b2 = c2

S → NP VPPrinciple of Compositionality

Computational modeling

changesimulation

GIS for visualisation of mobility patterns

Data-intensive computing

particle physics, astronomy

rule-based parsing of large corpora (typology studies))

text-mining: cross-document entity linking for cultural heritage libraries

16IMPACT Closing Event - The Hague

Page 17: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

More than search: metadata extraction• For large-scale digital (distributed) collections the

potential added value of automatically generated metadata is becoming more and more apparent.

• Automatic content labeling:– not just a matter of speeding up the annotation process and

enlarging the scope of analysis, also– starting point for generating annotation layers at collection

level , and – basis for link structures for all kinds of semantic aspects of

content, such as chronological trends, topic shifts, style and authenticity.

– potentially noisy IMPACT Closing Event - The Hague 17

Page 18: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

“Multi”-issues for DL metadata (1)

• Multi-layer– beyond tomb stone: content description at

fragment level (full text, full content, etc.)– free text annotation versus thesaurus-based

labeling• Multiple media formats

– text, text, text– spoken audio, video, still images, music, scores,

umerical data, sensor data, sensus data, etc.

IMPACT Closing Event - The Hague 18

Page 19: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Multi-issues for DL metadata (2)

• Multiple perspectives– cover more than local context– cover more than one domain perspective– cover more than one language

• Multiple values due to uncertainty– multiple human annotators– automatic labeling extracted from potentially

noisy data– dynamics in collection/context IMPACT Closing Event - The Hague 19

Page 20: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Scholarly use • Comparative perspective

– Quantitative and qualitative issues• Need for enhanced content presentation:

– Multiple layers– Links to context– Links to related content

• Emerging methodological shift– Enhanced collection exploration (think of Google

n-grams)

IMPACT Closing Event - The Hague 20

Page 21: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Part III From noisy data/metadata towards metadata mining

IMPACT Closing Event - The Hague 21

Page 22: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Metadata mining: crucial steps

• Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets.

• Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning

training– Detection of collection-level semantic features – Innovative interface and interaction design

22IMPACT Closing Event - The Hague

Page 23: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

What can metadata mining bring?

• Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based)– navigation across collections, rich presentation layers

• Increased insight in relations between data collections (across media types, languages, etc.)

• Increased understanding of knowledge production as captured by metadata and annotation processing

• Support for capturing the essence of association and analogy.

There is no data like metadata! 23IMPACT Closing Event - The Hague

Page 24: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Issues for metadata models

Old • annotation interoperability (e.g., metadata

integration for content annotated with coding tools such as thesauri and ontologies)

New • how to capture fuzziness and uncertainty coming

from multiple sources and/or statistical processing• coding of change over time (e.g., metadata for the

dynamics of temporal and geo-spatial details)

24IMPACT Closing Event - The Hague

Page 25: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Issues for scholarly users

Individual level• Learn to deal with imperfection• Understand the limitations of technological

innovationCommunity level• Stay tuned with developers• Organize methodology teaching• Study emerging practises• Share success stories

25IMPACT Closing Event - The Hague

Page 26: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Issues for developers

• Learn about scholarly practises• Stay tuned with users during the entire

process• Organize structured feedback loops• Study best practises• Share responsibility for centers of expertise

IMPACT Closing Event - The Hague 26

Page 27: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Issues for e-humanities

• e-humanities is e-research• multiple media, multiple patforms• keep connecting !

IMPACT Closing Event - The Hague 27

Page 28: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Contact• email: [email protected] or

[email protected]

• url: http://hmi.ewi.utwente.nl/~fdejong

IMPACT Closing Event - The Hague 28