approaches to automated metadata extraction : fixrep project
Post on 23-Jun-2015
923 Views
Preview:
DESCRIPTION
TRANSCRIPT
A centre of expertise in digital information management
www.ukoln.ac.uk
UKOLN is supported by:
Approaches to automated metadata extraction : FixRep Project
Emma Tonkin
e.tonkin@ukoln.ac.uk
www.bath.ac.uk
A centre of expertise in digital information management
www.ukoln.ac.uk
Wouldn't it be nice if...
• ...computers could author our metadata for us, thus saving a lot of hassle?
• Mechanical metadata extraction vs manual metadata input
A centre of expertise in digital information management
www.ukoln.ac.uk
But...
• Automated tools are fallible
• There's never quite enough information available
• Templates change, different domains have different standards
• In short, computers are often wrong– and so are people
A centre of expertise in digital information management
www.ukoln.ac.uk
• Hybrid approach:– Get what metadata you can– Ask the user to check and clean it if
necessary
• Philosophy:– If the computer gets it wrong, we can fix
it later
The 'half a loaf' hypothesis
A centre of expertise in digital information management
www.ukoln.ac.uk
Wouldn’t it be nice if…
• …computers could fix our metadata for us?
• Or, more realistically, help us do this work for ourselves.
A centre of expertise in digital information management
www.ukoln.ac.uk
• All about ‘fixing it later’, doing what we can with what we have
• Automated metadata extraction + metadata consistency assessment
• Metadata generation, evaluation, characterisation: enabling metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in automated metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
6)Towards metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
Whatever can go wrong...
• PDFs can be:– Encrypted– Corrupted– Oddly encoded– An image file without embedded text– Occurrence: ~3-6%
A centre of expertise in digital information management
www.ukoln.ac.uk
Character sets
• Ligatures,• Accents,• Symbols -
may not always be extractable from PDFs
Image © Daniel Ullrich
A centre of expertise in digital information management
www.ukoln.ac.uk
Document formats/layouts
• Many possible formats
• Some formats not widely supported
• Document layouts vary widely, esp. by discipline
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
6)Towards metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
Whatever can go wrong... (II)
• Function following form – interface • Model adapted to suit unique user needs• Data model incompletely supported• Input validation issues• Systematic error; typos; localisation;
encoding; etc.• Lots of past work in characterising manual
input errors
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
A centre of expertise in digital information management
www.ukoln.ac.uk
Image segmentation, templating & OCR
A centre of expertise in digital information management
www.ukoln.ac.uk
Working from text
• There are a number of possible states (ie. title, author, email, affiliation, abstract)
• Directed graph with probabilities
– Markov chain: for example,
Title Author Email Affil.
A centre of expertise in digital information management
www.ukoln.ac.uk
Hidden Markov Model
• We cannot directly see these states – only the words
• But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented
• This may be expressed in terms of an HMM
• Bayesian statistics used across term appearance
A centre of expertise in digital information management
www.ukoln.ac.uk
Example parse
• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
• ...
• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
• Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
6)Towards metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
Aims
• Adaption of existing interfaces
• Enhancing rather than rewriting
• Cross-platform, accessible interface
• Simple reusable REST API, metadata as DC/XML
A centre of expertise in digital information management
www.ukoln.ac.uk
Sample interfaces
A centre of expertise in digital information management
www.ukoln.ac.uk
Sample interfaces
A centre of expertise in digital information management
www.ukoln.ac.uk
Architecture
A centre of expertise in digital information management
www.ukoln.ac.uk
Using what we know...
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
6)Towards metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
Question:
• “Do people accept ‘hybrid’ interfaces?”
• Here’s one we did earlier…
A centre of expertise in digital information management
www.ukoln.ac.uk
Hypotheses• Correcting extracted metadata is faster than
entering or cutting-and-pasting metadata.
• The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct.
• User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails.
• Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction
A centre of expertise in digital information management
www.ukoln.ac.uk
Results: Timing
• Hybrid faster under both conditions
• (Summary of mediantimes)
A centre of expertise in digital information management
www.ukoln.ac.uk
Results: Accuracy• Tested against ground-truth
• Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords.
• Manual metadata accuracy:
– Few users use cut and paste
– Capitalisation, punctuation frequently differs
– Synonyms are accidentally substituted
• Hybrid closer to ground-truth, and more complete, but results not clear-cut.
A centre of expertise in digital information management
www.ukoln.ac.uk
Qualitative results
• Most users preferred the hybrid mode
• Most perceived it to be faster than manual data entry
• Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach
• Both were good - quality
A centre of expertise in digital information management
www.ukoln.ac.uk
Discussion
• Results support hypotheses
• People prefer the hybrid interface, and found it more satisfying to use
• Accessibility issues exist, but can be overcome
• The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!
A centre of expertise in digital information management
www.ukoln.ac.uk
1)Challenges in metadata extraction
2)Manual metadata generation
3)Metadata extraction in brief
4)Practical use as part of a repository deposit workflow
5)A user study comparing manual and hybrid input
6)Towards metadata triage
A centre of expertise in digital information management
www.ukoln.ac.uk
MetRe prototype (2008)
• Characteristic classes of individual/systematic error highlighted
• Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error
• Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences
A centre of expertise in digital information management
www.ukoln.ac.uk
v
A centre of expertise in digital information management
www.ukoln.ac.uk
A centre of expertise in digital information management
www.ukoln.ac.uk
Issues
• Discipline/domain-specific issues
• Lots of information required to do this right (see metadata schema/terminology registry)
• Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)
A centre of expertise in digital information management
www.ukoln.ac.uk
Approach
• Generally dependent on heuristics over available data
• Powered by very specific functions (classifiers, validation, etc…)
• Potentially expensive, not always domain-independent
A centre of expertise in digital information management
www.ukoln.ac.uk
Future work
• More! – Data– Filters (input/output formats)– Methods– Evaluation– Service availability (mail me for
announcements!)
A centre of expertise in digital information management
www.ukoln.ac.uk
Conclusion
• Metadata creation can be supported through software
• Specific problem sets in metadata triage
• Work continues in the FixRep project
A centre of expertise in digital information management
www.ukoln.ac.uk
Conclusion (II)
• Formal Metadata Extraction/evaluation
• Metadata review process
• Accessibility metadata
• Entity extraction (named entities, geographical, temporal [k-int!])
• Repository integration
A centre of expertise in digital information management
www.ukoln.ac.uk
• Thanks!
• Comments/Questions?
• www.ukoln.ac.uk/projects/fixrep
top related