impact final event 26-06-2012 - franciska de jong - indexing and searching of ‘noisy’ data
DESCRIPTION
TRANSCRIPT
Indexing and searching of noisy data
Franciska de JongUniversity of Twente
cluster Human Media InteractionEnschede, The Netherlands
Erasmus University Erasmus Studio Rotterdam, The Netherlands
http://hmi.ewi.utwente.nl/~fdejong
University of Twente cluster Human Media Interaction
Enschede, The Netherlands
Erasmus University Erasmus Studio for e-research Rotterdam, The Netherlands
http://hmi.ewi.utwente.nl/~fdejong
1IMPACT Closing Event - The Hague
Overview
Part I: Noisy data analysis – other examplesPart II: Emerging scenarios of scholarly use Part III: From noisy (meta)data towards
metadata mining
IMPACT Closing Event - The Hague 2
J&M Figure 5.23
Noisy Channel for Spelling Correction
noise: limitations in spelling skills
Noisy Channel for Speech Recognition
noise: limitations in sound capturedJ&M Figure 9.2
Noisy Channel for Machine Translation
noise: loss of information through translation J&M Figure 25.15
J&M Figure 5.23
Noisy Channel for OCR
noise: loss of information through typesetting/handwriting
Decoding spoken audio
• Audio modelling: collect data on the ground truth for audio segments
• Language modelling: collect data on co-occurrence s of words
• 100 hours of speech, • Text data (500 M words)
There is no data like more data
IMPACT Closing Event - The Hague 7
After decoding
• multiple hypotheses with varying probabilities of being correct
• selection from n-best list: errors unavoidable• post-editing can be an option, but never
without extra costs – time (editors), money (editing platform)– complexity of workflow
IMPACT Closing Event - The Hague 8
Impact of noise on access tasks
• Content/metadata with a certain amount of errors
• Search with reduced accuracy:– missed hits (false negatives)– incorrect hits (false positive; ‘noise’)
• Noisy data less suited for presentation layer– pdf versus ascii– original audio versus transcript; alternatives: word
clouds, related contentIMPACT Closing Event - The Hague 9
multimediainterviewarchive
metadata
speech recognition
automatic speech transcription
speaker detection
transcripts with time stamps and semantic annotations
summarization
speech/non-speech
detection
text mining tagging
automatic metadata extraction
users: general public,
archivists, researchers
query
search engine
resultpresentation
Access to interviews: transcript generation
Optimization Strategies (1)
• Error correction: post-editing, better recognition
• Improved recognition– typically effective for core collections (WER below
20%)– less effective for the long tail
Case: interviews with Willem Frederik Hermans• With models for news: 81% WER• Aim: reduction to around 60%
IMPACT Closing Event - The Hague 11
Optimization Strategies (2)
• Dedicated /task-specific evaluation– for seach applications errors in function words are
less critical than errors in e.g. names of persons and locations
• Dedicated weigthing schemes for search tasks– assign confidences scores to fragments found and
rerank search results accordingly
IMPACT Closing Event - The Hague 12
multimediainterviewarchive
metadata
speech recognition
automatic speech transcription
speaker detection
transcripts with time stamps and semantic annotations
summarization
speech/non-speech
detection
text mining tagging
automatic metadata extraction
users: general public,
archivists, researchers
query
search engine
resultpresentation
Access to interviews: support for users
• Part II: Emerging scenarios of scholarly use
IMPACT Closing Event - The Hague 14
DLs and knowledge discovery
• Focus of attention for analysis is no longer the document alone.
• Room for statistical methods to analyse entire collections, archives, libraries.
• Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures.
• Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
Paradigm evolution
Science examples
Information studies examples
Experimental work
direct obervation interpretation/ decoding of texts
Theoretical modeling
E = mc2
a2 + b2 = c2
S → NP VPPrinciple of Compositionality
Computational modeling
changesimulation
GIS for visualisation of mobility patterns
Data-intensive computing
particle physics, astronomy
rule-based parsing of large corpora (typology studies))
text-mining: cross-document entity linking for cultural heritage libraries
16IMPACT Closing Event - The Hague
More than search: metadata extraction• For large-scale digital (distributed) collections the
potential added value of automatically generated metadata is becoming more and more apparent.
• Automatic content labeling:– not just a matter of speeding up the annotation process and
enlarging the scope of analysis, also– starting point for generating annotation layers at collection
level , and – basis for link structures for all kinds of semantic aspects of
content, such as chronological trends, topic shifts, style and authenticity.
– potentially noisy IMPACT Closing Event - The Hague 17
“Multi”-issues for DL metadata (1)
• Multi-layer– beyond tomb stone: content description at
fragment level (full text, full content, etc.)– free text annotation versus thesaurus-based
labeling• Multiple media formats
– text, text, text– spoken audio, video, still images, music, scores,
umerical data, sensor data, sensus data, etc.
IMPACT Closing Event - The Hague 18
Multi-issues for DL metadata (2)
• Multiple perspectives– cover more than local context– cover more than one domain perspective– cover more than one language
• Multiple values due to uncertainty– multiple human annotators– automatic labeling extracted from potentially
noisy data– dynamics in collection/context IMPACT Closing Event - The Hague 19
Scholarly use • Comparative perspective
– Quantitative and qualitative issues• Need for enhanced content presentation:
– Multiple layers– Links to context– Links to related content
• Emerging methodological shift– Enhanced collection exploration (think of Google
n-grams)
IMPACT Closing Event - The Hague 20
Part III From noisy data/metadata towards metadata mining
IMPACT Closing Event - The Hague 21
Metadata mining: crucial steps
• Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets.
• Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning
training– Detection of collection-level semantic features – Innovative interface and interaction design
22IMPACT Closing Event - The Hague
What can metadata mining bring?
• Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based)– navigation across collections, rich presentation layers
• Increased insight in relations between data collections (across media types, languages, etc.)
• Increased understanding of knowledge production as captured by metadata and annotation processing
• Support for capturing the essence of association and analogy.
There is no data like metadata! 23IMPACT Closing Event - The Hague
Issues for metadata models
Old • annotation interoperability (e.g., metadata
integration for content annotated with coding tools such as thesauri and ontologies)
New • how to capture fuzziness and uncertainty coming
from multiple sources and/or statistical processing• coding of change over time (e.g., metadata for the
dynamics of temporal and geo-spatial details)
24IMPACT Closing Event - The Hague
Issues for scholarly users
Individual level• Learn to deal with imperfection• Understand the limitations of technological
innovationCommunity level• Stay tuned with developers• Organize methodology teaching• Study emerging practises• Share success stories
25IMPACT Closing Event - The Hague
Issues for developers
• Learn about scholarly practises• Stay tuned with users during the entire
process• Organize structured feedback loops• Study best practises• Share responsibility for centers of expertise
IMPACT Closing Event - The Hague 26
Issues for e-humanities
• e-humanities is e-research• multiple media, multiple patforms• keep connecting !
IMPACT Closing Event - The Hague 27
Contact• email: [email protected] or
• url: http://hmi.ewi.utwente.nl/~fdejong
IMPACT Closing Event - The Hague 28