knowledge organization systems and information discovery douglas tudhope inaugural lecture
TRANSCRIPT
Acknowledgements
Research team members and collaborators
– Ceri Binding (University of Glamorgan)– Andreas Vlachidis (University of Glamorgan)
– Keith May, English Heritage (EH)
– Stuart Jeffrey, Julian Richards, Archaeology Data Service (ADS)Archaeology Department, University of York
Collaborative acknowledgements
Harith Alani Steve HarrisPaul Beynon-Davies Traugott KochDorothee Block Marianne LykkeDaniel Cunliffe Brian MatthewsEmlyn Everitt Stuart LewisKora Golub Hugh MackayRachel Heery Jim MoonChris Jones Renato SouzaIolo Jones Carl Taylor
Information Discovery
• Literal string match (eg Google) is good for some kinds of searches:
specific concrete topics
where all we want are some relevant results
- not care how many we miss!
• Google less good at more conceptual (re)search topics
where important to be sure not missed anything important
eg medical, legal, scholarly research
-------------
• Searching data and documents a recent general research focus
variously termed ... eScience, Digital Humanities, Cyberinfrastructure
- data.gov.uk a recent initiative for government data
Words are tricky!
"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean--neither more nor less." (Lewis Carroll)
• Various potential problems with literal string search
• Different words mean same thing• Same word means different things
• Trivial spelling differences can affect resultsor a particular choice of synonymor a slightly different perspective in choice of concept
- How to address this issue?
This lecture
• Brief look at the history of work on this topic at Glamorgan
• Examples from recent AHRC funded research
on cross search of different archaeological datasets and reports
- try to give a general flavour
• Discuss some current research issues
Machine readable vs machine understandable
What we say to the machine:<h1>The Cat in the Hat</h1><ul>
<li>ISBN: 0007158440</li><li>Author: Dr. Seuss</li><li>Publisher: Collins</li>
</ul>
What the machine understands:<<h1></h1><ul>
<li</li><li</li><li</li>
</ul>
(More) machine understandable
What we say to the machine:<h1>Title:The Cat in the Hat</h1><ul>
<li>ISBN: 0007158440</li><li>Author: Dr. Seuss</li><li>Publisher: Collins</li>
</ul>
What the machine understands:<<h1></h1><ul>
<li</li><li</li><li</li>
</ul>
(More) machine understandable
What we say to the machine:<h1>Title:The Cat in the Hat</h1><ul>
<li>ISBN: 0007158440</li><li>Author: Dr. Seuss</li><li>Publisher: Collins</li>
</ul>
What the machine understands:<<h1></h1><ul>
<li</li><li</li><li</li>
</ul>
Book ID
Author Publisher
---------------conceptualstructure(ontology)
(More) machine understandable
What we say to the machine:<h1>Title:The Cat in the Hat</h1><ul>
<li>ISBN: 0007158440</li><li>Author: Dr. Seuss</li><li>Publisher: Collins</li>
</ul>
What the machine understands:<<h1></h1><ul>
<li</li><li</li><li</li>
</ul>
Book ID
Author Publisher
---------------conceptualstructure(ontology)---------------vocabularies forterminology andknowledge organization
Theodor Geisel
Knowledge Organization Systems
• Knowledge Organization Systems
eg classifications, thesauri and ontologies
help semantic interoperability
• Reduce ambiguity by defining terms
and providing synonyms
• Organise concepts via semantic relationships
Knowledge Organization Systems
• Knowledge Organization Systems
- classifications, thesauri and ontologies
help semantic interoperability
• Reduce ambiguity by defining terms
and providing synonyms
Organise concepts via semantic relationships
EH Monuments Type Thesaurus
Knowledge Organization Systems
• Knowledge Organization Systems
- classifications, thesauri and ontologies
help semantic interoperability
• Reduce ambiguity by defining terms
and providing synonyms
Organise concepts via semantic relationships
EH Monuments Type Thesaurus
Origins of research
Polytechnic of Wales Research Assistantship (collaborating with Paul Beynon-Davies, Chris Jones - Carl Taylor’s PhD)
Experimental museum exhibitExtract of collections database - Pontypridd Historical and Cultural Centre
Origins of research
Polytechnic of Wales Research Assistantship (collaborating with Paul Beynon-Davies, Chris Jones - Carl Taylor’s PhD)
Experimental museum exhibitExtract of collections database - Pontypridd Historical and Cultural Centre
Hard to generalise and maintain if based on manual linking of information
dynamic implicit links
In this case based on Social History and Industrial Classification (SHIC)and indexing for place, time period
Semantic similarity measure
Source Context
Destination Context
Similarity
Coefficient
Based on comparison of sets of SHIC concepts via a computed measure of semantic closeness
Hypermedia navigation tool (find similar) rather than a formal query
General Costume, Social Organisation, Entertainment
Mens-Costume, Sporting Organisation
FACET - Faceted Access to Cultural hEritage Terminology
Subsequent EPSRC funded project
with Science Museum, National Railway Museum
and J. Paul Getty Trust - Art & Architecture Thesaurus (AAT)
Aims:• Integration of thesaurus into user interface• Semantic query expansion
FACET research question
“The major problem lies in developing a system whereby individual parts of subject headings containing multiple AAT terms are broken apart, individually exploded hierarchically, and then reintegrated to answer a query with relevance”
(Toni Petersen, AAT Director)
Example Query: mahogany, dark yellow, brocading, Edwardian, armchair
for National Railway Museum collection - eg royal carriage
FACET Web Demonstrator- how to generalise?
FACET - more sophisticated search but still a single database
How to generalise to multiple datasets and thesauri?How to connect with text documents?
STAR Semantic Technologies for Archaeological Resources
• AHRC funded project(s) with English Heritage and the ADS
Generalise previous methods to :-
• Different datasets with different structures
• Reports of excavationsADS OASIS Grey Literature Library (unpublished reports)Online AccesS to the Index of archaeological investigationS
STAR Semantic Technologies for Archaeological Resources
• Currently excavation datasets isolated with different terminology systems
• Currently no connection with grey literature excavation reports
Aims
• Cross search at a conceptual level archaeological datasets with associated grey literature
STAR Semantic Technologies for Archaeological Resources
• Need for integrating conceptual frameworkand terminology control via thesauri and glossaries
• EH (Keith May) designed an ontology describing the archaeological process
The archaeological process
• Events in the present and events in the past,
related by the place in which they occur
and the physical remains in that place
• Activities in the present investigate the remains of the past
(affecting them in the process)
Events in the presentExcavation // Drawing and PhotographySurvey // SamplingTreatments and ProcessingClassification // Grouping and PhasingMeasuring including scientific datingRecording of observationsDissemination // Interpretation // Analysis
Events in the past have results in the present• Events shaping natural environment
geological, environmental and biological processes
Events in the past have results in the present• Events shaping natural environment
geological, environmental and biological processes
• Events concerned with object production, disposal or loss(how ‘finds’ produced and later deposited in archaeological context)
Events in the past have results in the present• Events shaping natural environment
geological, environmental and biological processes
• Events concerned with object production, disposal or loss(how ‘finds’ produced and later deposited in archaeological context)
• Construction, modification and destruction events relating to human buildings
Events in the past have results in the present
• Conceptual framework to model these archaeological events(an EH extension of a standard cultural heritage ontology)
• Need to move beyond simple Who – What – Where – When modeltypically used in state of the art cultural heritage databases
Typical ‘Advanced Search’ model- does not deal with events
Typical Who - What - Where - When advanced search user interface
WhoO and O or
WhatO and O or
WhereO and O or
When--------Resources
Typical ‘Advanced Search’ limitations
Typical Who - What - Where - When model - needs more semantics
WhoO and O or
WhatO and O or
WhereO and O or
When--------Resources
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
Typical ‘Advanced Search’ limitations
Need to define relationships between entitiesand allow multiple connections
WhoO and O or
WhatO and O or
WhereO and O or
When--------Resources
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Typical ‘Advanced Search’ limitations
Assigning dates and classifying are important ‘events’ in the present- outcomes of the archaeological process (interpretations can differ)
WhoO and O or
WhatO and O or
WhereO and O or
When--------Resources
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Broader conceptual framework (ontology)
Modeling multiple interpretations – linked to underlying datawithin the ontology ‘multivocality’ in archaeology
WhoO and O or
WhatO and O or
WhereO and O or
When--------Resources
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Who made dating judgment?
Archaeological ‘find’ (eg coin)
Archaeological ‘context’ (eg hearth)
When photo was taken?When ‘find’ originally made?When ‘find’ deposited?
Broader conceptual framework (ontology)EH extension of CIDOC Conceptual Reference Model (CRM) explicit modelling of archaeological events – complicated!
STAR general architecture
STAR web services
EH Thesauri and CRM ontology
EH Thesauri and CRM ontology
Archaeological Datasets (CRM)Archaeological Datasets (CRM)
• Windows applications• Browser components• Full text search• Browse concept space• Navigate via expansion• Cross search archaeological datasets
STAR client applications
STAR datasets(expressed in terms of CRM)
Grey literature indexing (CRM)
Grey literature indexing (CRM)
Natural Language Processing (NLP)of archaeological grey literature
Extract key concepts in same semantic representation as for data.
Allows unified searching of different datasets and grey literature
in terms of same underlying conceptual structure
“ditch containing prehistoric pottery dating to the Late Bronze Age”
STAR Demonstrator – search for a conceptual pattern
An Internet Archaeology publication on one of the (Silchester Roman) datasets we used in STAR discusses the finding of a coin within a hearth.-- does the same thing occur in any of the grey literature reports?
Requires comparison of extracted data with NLP indexing in terms of the ontology.
STAR Demonstrator – search for a conceptual patternResearch paper reports finding a coin in hearth – exist elsewhere?
Current issues and goals
a) Apply research outcomes in practice (knowledge transfer)
semantic terminology services
‘rubbish example’ using the ADS Archaeology Image Bank
b) NLP challenges
negation!
Negative findings?
c) Multivocality in archaeology
broader picture of the research issues
STAR Semantic Terminology Services- concept expansion (as web service) midden
MIDDEN n dunghill, refuse heap
midden
dunghill, compost heap, refuse heap,
... muddle, mess
... dirty slovenly person
... midden mavis or midden raker --- searchers of refuse heaps(Concise Scots dictionary - Mairi Robinson, Scottish National Dictionary Association)
ADS Archaeology Image Bank ExampleNo results when search for rubbish or refuse – try midden!
Archaeologists have to plan for the future
“Research excavations, therefore, must be planned for posterity, eschewing the quick answer and setting up a framework of excavation and recording which can be handed over, extended, modified and improved over decades and in some cases, centuries.”
Techniques of Archaeological Excavation, Philip Barker (1993)
• Archaeology in particular lends itself to the reuse of (excavation) data
• Connect interpretations with the underlying data
• Revisit previous archaeological interpretations and findings
- excavations inevitably based on a limited sample
Archaeological Multivocality - more voices involved than just original project team?
• Expose (invisible) datasets for wider analysis and reuse
• Meta studies comparing different excavation projects
• Connect datasets and wider grey literature – look for wider patterns
• Open up a broader range of research questions that might be answered when we connect currently isolated excavation datasets
• Allow different communities to share data and expertise
Words are tricky!
We should have a great fewer disputes in the world if words were taken for what they are,
the signs of our ideas only, and not for things themselves. (John Locke)
• Emergent classification? – an outcome of the archaeological process
- both constructing and constraining the world
• Map between different classifications and glossaries
rather than one imposed standard?
Words are tricky!
Words are not as satisfactory as we should like them to be, but, like our neighbours,
we have got to live with them and must make the best and not the worst of them. (Samuel Butler)
• Major issues remain
• but knowledge organization systemsoffer some current assistance for moving beyond literal string searchand making the best of the words we have to use