hans uszkoreit german research center for artificial intelligence and saarland university hans...
TRANSCRIPT
Hans UszkoreitGerman Research Center for Artificial Intelligence
and Saarland University
Hans UszkoreitGerman Research Center for Artificial Intelligence
and Saarland University
Semantic Annotation and Hyperlinkingfor
Associative Digital Memories Vision, Methods, Applications
Semantic Annotation and Hyperlinkingfor
Associative Digital Memories Vision, Methods, Applications
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
The Vision of the Semantic Web The Vision of the Semantic Web
A new development in web technology is aimed at structuring some of the rich knowledge contained in unstructured data. The envisaged result will be a growing layer of formalized knowledge above and associated with the wealth of unstructured data.A multitude of ontologies will provide the conceptual texture for annotating rich unstructured content. The result will be a semantically structured densely associated web of knowledge.
A new development in web technology is aimed at structuring some of the rich knowledge contained in unstructured data. The envisaged result will be a growing layer of formalized knowledge above and associated with the wealth of unstructured data.A multitude of ontologies will provide the conceptual texture for annotating rich unstructured content. The result will be a semantically structured densely associated web of knowledge.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Structure of the Semantic WebStructure of the Semantic Web
The well-known layer cake of the Semantic Web proposedby Tim Berners-Lee employs... • XML for markup,
• relational ontologies as the basis for describinginformation resources,
• RDF coded in XMLas the language for suchsemantic descriptions,
• a logic language such asOWL coded in RDFas the format for further logical descriptions suchas rules and constraints.
The well-known layer cake of the Semantic Web proposedby Tim Berners-Lee employs... • XML for markup,
• relational ontologies as the basis for describinginformation resources,
• RDF coded in XMLas the language for suchsemantic descriptions,
• a logic language such asOWL coded in RDFas the format for further logical descriptions suchas rules and constraints.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language TechnologySemantic Web and Language Technology
I will first point out five central issues for language technology resulting from the visison of the Semantic Web.
I will then briefly argue that there are no feasible models for realizing the semantic web through creation or evolution.
Next, I will argue that there is a less ambitious stage of a semantically enriched web, that can be realized gradually.
This vision is built on the notion of associative digital memories lying in between digital repositories and digital knowledge.
Then I will describe the language and web technologies needed for realizing such digital memories.
I will first point out five central issues for language technology resulting from the visison of the Semantic Web.
I will then briefly argue that there are no feasible models for realizing the semantic web through creation or evolution.
Next, I will argue that there is a less ambitious stage of a semantically enriched web, that can be realized gradually.
This vision is built on the notion of associative digital memories lying in between digital repositories and digital knowledge.
Then I will describe the language and web technologies needed for realizing such digital memories.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language Technology 1
Semantic Web and Language Technology 1
The employment of language technology for the construction of useful ontologies:
One of the shortcomings of hand-crafted AI ontologies has been their artificial nature. Useful ontologies do rarely meet the high aesthetic standards of philosophers or domain-specialized theoreticians. Can data-oriented language technology facilitate the detection of useful ontologies that reflect the needs and daily tasks of their users?
The employment of language technology for the construction of useful ontologies:
One of the shortcomings of hand-crafted AI ontologies has been their artificial nature. Useful ontologies do rarely meet the high aesthetic standards of philosophers or domain-specialized theoreticians. Can data-oriented language technology facilitate the detection of useful ontologies that reflect the needs and daily tasks of their users?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language Technology 2
Semantic Web and Language Technology 2
The exploitation of Semantic Web ontologies for LT applications such as information extraction:
• Domain modelling is a serious bottleneck for many language technology applications. Can the Semantic Web movement help us by providing well-designed ontologies for a multitude of knowledge domains?
The exploitation of Semantic Web ontologies for LT applications such as information extraction:
• Domain modelling is a serious bottleneck for many language technology applications. Can the Semantic Web movement help us by providing well-designed ontologies for a multitude of knowledge domains?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language Technology 3
Semantic Web and Language Technology 3
The challenge of (partially) automating the detection and annotation of concepts:
• One of the major shortcomings of the original Semantic Web vision is its reliance on extensive hand annotation of large volumes of digital resources. As we know from daily experience, content developers (authors) do not even exploit the modest means for encoding meta-information that is provided by HTML. They do not have the time and patience to find and insert the most useful hyperlinks. How can one expect that the web will become semantified by human annotation?
The challenge of (partially) automating the detection and annotation of concepts:
• One of the major shortcomings of the original Semantic Web vision is its reliance on extensive hand annotation of large volumes of digital resources. As we know from daily experience, content developers (authors) do not even exploit the modest means for encoding meta-information that is provided by HTML. They do not have the time and patience to find and insert the most useful hyperlinks. How can one expect that the web will become semantified by human annotation?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language Technology 4
Semantic Web and Language Technology 4
The utilization of the Semantic Web as a resource for machine learning in NLP:
• Supervised learning from hand-annotated texts plays a major role in language technology research and development. Will the Semantic Web movement create large volumes of annotated texts? Can these texts be used for machine learning techniques that improve topic detection, information extraction, question answering and other language technologies? Can systems for automatic annotation be trained in a bootstrapping fashion?
The utilization of the Semantic Web as a resource for machine learning in NLP:
• Supervised learning from hand-annotated texts plays a major role in language technology research and development. Will the Semantic Web movement create large volumes of annotated texts? Can these texts be used for machine learning techniques that improve topic detection, information extraction, question answering and other language technologies? Can systems for automatic annotation be trained in a bootstrapping fashion?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Is the vision realistic?Is the vision realistic?
Authors make little use of the available means of annotation/markup such as
~ hyperlinks~ metainformation
The enrichment of the available volumes of digital information is a huge task.
Authors make little use of the available means of annotation/markup such as
~ hyperlinks~ metainformation
The enrichment of the available volumes of digital information is a huge task.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Semantic Web and Language Technology 5
Semantic Web and Language Technology 5
The relationship between the Semantic Web and multilinguality:
• The planned dense semantic markup will facilitate cross-lingual navigation and information retrieval. Will the semantic web really contribute to overcoming language barriers by making information better accessible across languages? Will contents in all languages be annotated and crosslinked at the same time and in comparable proportions? What is the role of language technology in this process? Will the Semantic Web help to reduce the knowledge gap among or will this gap be widened?
The relationship between the Semantic Web and multilinguality:
• The planned dense semantic markup will facilitate cross-lingual navigation and information retrieval. Will the semantic web really contribute to overcoming language barriers by making information better accessible across languages? Will contents in all languages be annotated and crosslinked at the same time and in comparable proportions? What is the role of language technology in this process? Will the Semantic Web help to reduce the knowledge gap among or will this gap be widened?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
the concept of digital memory
automatic semantic hyperlinking
personal digital memories
collective digital memories
conclusions
the concept of digital memory
automatic semantic hyperlinking
personal digital memories
collective digital memories
conclusions
outlineoutline
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
digital libraries
digital archives
digital knowledge
digital memories
digital libraries
digital archives
digital knowledge
digital memories
more than metaphors?more than metaphors?
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
(associative) memory(associative) memory
stored information
associatively interconnected
immediately accessible by association
grounded in experience
Special form of memory: episodic memory
stored information
associatively interconnected
immediately accessible by association
grounded in experience
Special form of memory: episodic memory
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
knowledgeknowledge
stored information
strongly semantically interconnected
immediately accessible
suited for inferencing
grounded in more basic knowledge and
perception
stored information
strongly semantically interconnected
immediately accessible
suited for inferencing
grounded in more basic knowledge and
perception
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
associationsassociations
neighborhood in a high-dimensional space
accessibility paths
connections in a graph
neighborhood in a high-dimensional space
accessibility paths
connections in a graph
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
hyperlinkshyperlinks
the concept behind the success of the Internet
hypertext: associately interconnected text
hypermedia: associately interconnected medial representation of information
association is more than a reference it is an access mechanism
the concept behind the success of the Internet
hypertext: associately interconnected text
hypermedia: associately interconnected medial representation of information
association is more than a reference it is an access mechanism
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
THE ONE-CLICK APPROACHTHE ONE-CLICK APPROACH
New wireless voice technology introduced Posted at 5:09 PM PT, Feb 8, 1999
By Stephen Lawson, InfoWorld Electric
NTT Labs on Monday brought Dick Tracy into the enterprise, introducing a wireless voice and data system that can use a wrist radio at the Demo 99 conference.
AirWave technology, demonstrated for the first time in the United States at this week's confe- rence in Indian Wells, Calif., is based on a wireless PBX. Small, handheld phones -- and a wrist radio that looks like an oversized watch -- can be used to make voice calls and exchange data around a building or campus. The handheld phones can be switched to a public cellular mode to become conventional cell phones.
Company representatives touted the system as offering higher voice quality than a typical PBX. Airwave is based on NTT's Personal Handyphone System, which is currently deployed by more than 600 users in Japan, according to the company.
Modems built in to both devices allow users to plug in a notebook or portable device for dial-up data connections as fast as 64Kbps. Users can exchange files or e-mail, or access a LAN or the Internet. There is no airtime charge for AirWave communications in the building or campus. AirWave systems are scheduled to be available through distribution partners by the end of this
year, priced as low as $400 per user.
NTT Labs, the research and development arm of NTT Corp., in Tokyo, can be reached at www.nttlabs.com.
Company InfoHomepageOther News ProductsIndicatorsContact ExpertsContacts Accounts
Company InfoHomepageOther News ProductsIndicatorsContact ExpertsContacts Accounts
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Language TechnologyLanguage Technology
recognition of domain-relevant named entities with statistical and rule-based methods
tolerance with respect to morphological and syntactic variation
recognition of synonyms
exploitation of thesauri and ontologies with conceptual relations
recognition of syntactic functions and thematic roles for appropriate anchor specification
annotation of documents with hyperlink designators
recognition of domain-relevant named entities with statistical and rule-based methods
tolerance with respect to morphological and syntactic variation
recognition of synonyms
exploitation of thesauri and ontologies with conceptual relations
recognition of syntactic functions and thematic roles for appropriate anchor specification
annotation of documents with hyperlink designators
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Web TechnologyWeb Technology
Hyperlinks need to be:
relational
typed
external
possibly multidirectional
Hyperlinks need to be:
relational
typed
external
possibly multidirectional
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
functional hyperlinksfunctional hyperlinks
today's hyperlinks are
functional
unidirectional
untyped
today's hyperlinks are
functional
unidirectional
untyped
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
relational hyperlinksrelational hyperlinks
Relational Hyperlink {person, homepage
person, email-address}
Relational Labelled Hyperlink
{person, „homepage“, homepage
person, „email“, email-address}
Relational Typed Labelled Hyperlink
person: {person, „homepage“, homepage
person, „email“, email-address}
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Relational Hyperlinks as TypesRelational Hyperlinks as Types
FunRef1 target1FunRef2 target2. .. .. .FunRefn targetntype
HomepagehomepagesStocks get–from–NYSE
News
CNR–briefscnr–bulletinPaperball paperballReuters get–reutersOlderNewscnr–archivenewscompany
Key Account type–of–accountMarketingContactka–managerAccountAccess secure–connect–ka–DBka–customer
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Link OntologiesLink Ontologies
Link Ontologies
and link DBs
Link Ontologies
and link DBs
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Customized OntologiesCustomized Ontologies
Ontologies can be customized by
Extension
Expansion
Overwriting
Merging
Ontologies can be customized by
Extension
Expansion
Overwriting
Merging
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
RecursionRecursion
A type can have an attribute a with the value
person :=
name: title: string first_name: string other_given_names: string last_name: string aka.: string
...father: person...
A type can have an attribute a with the value
person :=
name: title: string first_name: string other_given_names: string last_name: string aka.: string
...father: person...
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
RecursionRecursion
The embedded type can can be expanded:
person :=
name: title: stringfirst_name: stringother_given_names: stringlast_name: stringaka.: string
...father: name: name
...father: person...
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Multiple InheritanceMultiple Inheritance
location
palace
building
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
EqualityEquality
person :=
name: title: stringfirst_name: stringother_given_names: stringlast_name: 1stringaka.: string
...father: name: last_name: 1string
...father: person ...
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
ExtensionExtension
New attributes are added
examples:
• new attribute: restored• new attribute for: citation in <Coleman et
al.>
New attributes are added
examples:
• new attribute: restored• new attribute for: citation in <Coleman et
al.>
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
ExpansionExpansion
An atomic type is expanded into an AVM
location: Berlin is expanded into an address
technique: oil_on_canvas
is expanded into canvas:paints:layers:etc.
An atomic type is expanded into an AVM
location: Berlin is expanded into an address
technique: oil_on_canvas
is expanded into canvas:paints:layers:etc.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
OverwritingOverwriting
The value of attributes can be overwritten
for corrections
alternative pointers to information sources
alternative representations
The value of attributes can be overwritten
for corrections
alternative pointers to information sources
alternative representations
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
MergingMerging
The attributes of concepts from two ontologies can be mergedontologies from different disciplinesor from a discipline and a metadata initiativeequality can be employed to state identity between valuesexample: the value of "creator" in the Dublin Corecan be set equal to "author" in BibTex bibliography format
The attributes of concepts from two ontologies can be mergedontologies from different disciplinesor from a discipline and a metadata initiativeequality can be employed to state identity between valuesexample: the value of "creator" in the Dublin Corecan be set equal to "author" in BibTex bibliography format
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Problem: AmbiguityProblem: Ambiguity
Im Jahr 1942 wurde von Essen in einer kleinen Stadt in Südschweden geboren.
In the year 1942 von Essen was born in a small town in
the south of Sweden.
"Essen" may be
• the name of a city• the plural of "Esse" meaning smokestack• the word for food• a family-name, • the name of a Bank "Von Essen Bank"
Im Jahr 1942 wurde von Essen in einer kleinen Stadt in Südschweden geboren.
In the year 1942 von Essen was born in a small town in
the south of Sweden.
"Essen" may be
• the name of a city• the plural of "Esse" meaning smokestack• the word for food• a family-name, • the name of a Bank "Von Essen Bank"
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Problem: Polysemy, Aspects, ViewsProblem: Polysemy, Aspects, Views
Often a descriptor or designator can be used in different aspects of meaning.
One of the sources of this type of uncertainty is systematic polysemy.
Another one is the aspects or views associated with a context or a user type.
Often a descriptor or designator can be used in different aspects of meaning.
One of the sources of this type of uncertainty is systematic polysemy.
Another one is the aspects or views associated with a context or a user type.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
PolysemyPolysemy
The assembly takes five minutes.
The assembly is in Building Five.
The iBook has a G3 processor and a DVD drive.
The PowerBook can be checked ouat but the iBook is
currently in use by the project BABEL.
CNN has a special at 5 p.m.
Then he became Senior Vice President of CNN.
The assembly takes five minutes.
The assembly is in Building Five.
The iBook has a G3 processor and a DVD drive.
The PowerBook can be checked ouat but the iBook is
currently in use by the project BABEL.
CNN has a special at 5 p.m.
Then he became Senior Vice President of CNN.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
AspectsAspects
The iBook has a G3 processor and a DVD drive.
The iBook is reduced by 15% in our clearance sale.
Peter Norman will answer your questions.
The new Department Chair is Peter Norman.
The iBook has a G3 processor and a DVD drive.
The iBook is reduced by 15% in our clearance sale.
Peter Norman will answer your questions.
The new Department Chair is Peter Norman.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
The Ultimate Information ManagementThe Ultimate Information Management
Provide:
the right information
to the right people
in the right time
and in the right form
Provide:
the right information
to the right people
in the right time
and in the right form
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
decision triggersdecision triggers
All kinds of forms requiring
• Approvals,
• Recommendations,
• Selections
Examples:
• Application for a Building Permit
• Credit application
• Request for a comment on a hiring decision
Good decision triggers contain information relevant for the
decision or references to such pieces of information
All kinds of forms requiring
• Approvals,
• Recommendations,
• Selections
Examples:
• Application for a Building Permit
• Credit application
• Request for a comment on a hiring decision
Good decision triggers contain information relevant for the
decision or references to such pieces of information
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Short Pieces of Information (e.g., translation into Spanish)Regular Hyperlink (e.g., homepage)DB Access (e.g., lookup of account status)Start of a Process (e.g., start a credit check)Notification of a person (e.g., send query to expert)Search out of context (e.g., search in inter-, intra-. extranet)
Short Pieces of Information (e.g., translation into Spanish)Regular Hyperlink (e.g., homepage)DB Access (e.g., lookup of account status)Start of a Process (e.g., start a credit check)Notification of a person (e.g., send query to expert)Search out of context (e.g., search in inter-, intra-. extranet)
Possible Targets Possible Targets
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
first step: densely hyperlinked textsfirst step: densely hyperlinked texts
in the ideal case: every meaningful unit carries typed
relational hyperlinks
words, names, symbols, pictures, elements of pictures,
in the ideal case: every meaningful unit carries typed
relational hyperlinks
words, names, symbols, pictures, elements of pictures,
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Possible TargetsPossible Targets
Short Pieces of Information (e.g., translation into Spanish)Regular Hyperlink (e.g., homepage)DB Access (e.g., lookup of account status)Start of a Process (e.g., start a credit check)Notification of a person (e.g., send query to expert)Search in a context (e.g., search in inter-, intra-. extranet)
Short Pieces of Information (e.g., translation into Spanish)Regular Hyperlink (e.g., homepage)DB Access (e.g., lookup of account status)Start of a Process (e.g., start a credit check)Notification of a person (e.g., send query to expert)Search in a context (e.g., search in inter-, intra-. extranet)
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
ApplicationsApplications
Enrichment of specialized Web Sitesexample: SOG and Saarland Online
Enrichment of Portalsexample: LT World
Email Processingexample: MailMinder Extension
Legacy Code example: Dresdner Bank HyperCode
Information and Knowledge Managementno example yet
Associative Digital Memories
Enrichment of specialized Web Sitesexample: SOG and Saarland Online
Enrichment of Portalsexample: LT World
Email Processingexample: MailMinder Extension
Legacy Code example: Dresdner Bank HyperCode
Information and Knowledge Managementno example yet
Associative Digital Memories
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
second step: grounding in experience second step: grounding in experience
From densely associative hypertexts to episodically enriched memoriescalendar, biography, timelinemediasituations
From densely associative hypertexts to episodically enriched memoriescalendar, biography, timelinemediasituations
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 1: personal digital memoryexample 1: personal digital memory
calendar: 2000-12000 calender entries
email: 20.000-100.000 messages
addresses: 100-2000 addresses
photographs: 1000-30000 pictures
written papers, reports, reviews: 100-1000 documents
music: 500-5000 titles
talks: 50-500 slide sets
read electronic papers: 200-2000 documents
visited web-pages: 20.000-100.000 pages
calendar: 2000-12000 calender entries
email: 20.000-100.000 messages
addresses: 100-2000 addresses
photographs: 1000-30000 pictures
written papers, reports, reviews: 100-1000 documents
music: 500-5000 titles
talks: 50-500 slide sets
read electronic papers: 200-2000 documents
visited web-pages: 20.000-100.000 pages
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 1: personal digital memory
example 1: personal digital memory
establish a set of relevant entities and concepts
• dates
• persons
• themes
• locations
• functions
• sources
find connections among email, calendar, photographs, addresses, papers,
establish a set of relevant entities and concepts
• dates
• persons
• themes
• locations
• functions
• sources
find connections among email, calendar, photographs, addresses, papers,
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 1: personal digital memory
example 1: personal digital memory
entry points
• words
• names
• topics
• dates
•
dynamically adapt the links to new entities and concepts
entry points
• words
• names
• topics
• dates
•
dynamically adapt the links to new entities and concepts
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
long term vision: third level long term vision: third level
First Level: structure the personal information space, i.e., information that is already on your machine, e.g., texts, correspondence, direct messages (SMS etc.), calendar, graphics
Second Level: add personal archives by digitizing additional information: e.g., photographs, personal sound recordings, musical records, movie clips, etc.
Third Level: add episodal memory (life records), create extensive sound (and image) archives of selected episodes of your daily life, including meetings and pictures of people, sights documents
Not manageable without dense associative hyperlinking
First Level: structure the personal information space, i.e., information that is already on your machine, e.g., texts, correspondence, direct messages (SMS etc.), calendar, graphics
Second Level: add personal archives by digitizing additional information: e.g., photographs, personal sound recordings, musical records, movie clips, etc.
Third Level: add episodal memory (life records), create extensive sound (and image) archives of selected episodes of your daily life, including meetings and pictures of people, sights documents
Not manageable without dense associative hyperlinking
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 1: personal digital memory
example 1: personal digital memory
related projects
• LifeBits (Microsoft)
• Haystack (MIT)
and relevant but less related
• LifeLog (DARPA)
related projects
• LifeBits (Microsoft)
• Haystack (MIT)
and relevant but less related
• LifeLog (DARPA)
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 1: personal digital memory
example 1: personal digital memory
related projects
• LifeBits (Microsoft)
• Haystack (MIT)
and relevant but less related
• LifeLog (DARPA)
related projects
• LifeBits (Microsoft)
• Haystack (MIT)
and relevant but less related
• LifeLog (DARPA)
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
example 2: collective/social memories
example 2: collective/social memories
historical memories
memories of scientific developments
a combination of both
historical memories
memories of scientific developments
a combination of both
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
large digital archives
first attempts to annotate and interlink
linguistic data collection and annotation
projects Perseus and Archimedes
annotation by markup (XML stand-off markup)
large digital archives
first attempts to annotate and interlink
linguistic data collection and annotation
projects Perseus and Archimedes
annotation by markup (XML stand-off markup)
Humanities & Social Sciences Humanities & Social Sciences
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
from individual to collective research
individual research groups joint projects community research
working on shared interpreted data creates new collaboration communities
by e-science new paradigms of humanities research are emerging
from individual to collective research
individual research groups joint projects community research
working on shared interpreted data creates new collaboration communities
by e-science new paradigms of humanities research are emerging
Humanities & Social Sciences Humanities & Social Sciences
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
interpretation by multilayer stand-off annotation
interpretation of data does not destroy data
interpretations can refer to other interpretations
replicability of results by sharing of data and
disclosure
of interpretations
interpretation by multilayer stand-off annotation
interpretation of data does not destroy data
interpretations can refer to other interpretations
replicability of results by sharing of data and
disclosure
of interpretations
Humanities & Social Sciences IIHumanities & Social Sciences II
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Disciplines in the humanities exhibit large overlaps in a primary sources and data.
We expect that many digitized sources can be exploited by more than one discipline.
Moreover, we can expect that also the scientific interpretations of the data may be used in parts across traditional disciplines.
Disciplines in the humanities exhibit large overlaps in a primary sources and data.
We expect that many digitized sources can be exploited by more than one discipline.
Moreover, we can expect that also the scientific interpretations of the data may be used in parts across traditional disciplines.
Humanities & Social Sciences IIIHumanities & Social Sciences III
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Topics and data are largely culture-dependent and often language-specific.
Digital data collections can be anywhere
• Wittgenstein Archives in Bergen• Corpus of historical English in Helsinki• Roman Texts at Tufts University• Japanese Grammar in Saarbruecken
By shared methodology, new forms of transcultural and
multilingual research will become possible.
Topics and data are largely culture-dependent and often language-specific.
Digital data collections can be anywhere
• Wittgenstein Archives in Bergen• Corpus of historical English in Helsinki• Roman Texts at Tufts University• Japanese Grammar in Saarbruecken
By shared methodology, new forms of transcultural and
multilingual research will become possible.
Humanities & Social Sciences IVHumanities & Social Sciences IV
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Natural Sciences vs. HumanitiesNatural Sciences vs. Humanities
Data in the natural sciences and engineering (NS&E) are often so-called structured data such as databases, time sequences of measurements or matrices of numeric values.Scientific data in NS&E can also be very large two - four dimensional images (pictures, spacial images, videos, VR scenes), cases of so-called unstructured data.Most data in the humanities are so-called unstructured data: texts, pictures, sound files, films...
For the researcher in the humanities, these unstructured data possess much more structure–on the conceptual level– than the structured data in databases.
Data in the natural sciences and engineering (NS&E) are often so-called structured data such as databases, time sequences of measurements or matrices of numeric values.Scientific data in NS&E can also be very large two - four dimensional images (pictures, spacial images, videos, VR scenes), cases of so-called unstructured data.Most data in the humanities are so-called unstructured data: texts, pictures, sound files, films...
For the researcher in the humanities, these unstructured data possess much more structure–on the conceptual level– than the structured data in databases.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Densely Associated ContentDensely Associated Contentvidete nunc quo adfectent iter apertius quam antea. nam superiore parte legis
quem ad modum Pompeium oppugnarent, a me indicati sunt;
nunc iam se ipsi indicabunt. iubent venire
agros Attalensium atque Olympenorum quos populo
Romano Servili, fortissimi viri, victoria adiunxit, deinde agros in Macedonia regios qui partim T.
Flaminini, partim L. Pauli qui Persen vicit virtute
parti sunt, deinde agrum optimum et fructuosissimum
Corinthium qui L. Mummi imperio ac felicitate ad vectigalia populi
Romani adiunctus est, post autem agros in Hispania apud
novam duorum Scipionum eximia virtute possessos;
tum vero ipsam veterem Carthaginem vendunt quam P.
Africanus nudatam tectis ac moenibus sive ad notandam
Carthaginiensium calamitatem, sive ad testificandam
nostram victoriam, sive oblata aliqua religione ad
aeternam hominum memoriam consecravit.
Attalia (Attaleia, Antalya)Coastal city in Pamphylia
MapOther ReferencesHistoryComments
My Comments
Attalia (Attaleia, Antalya)Coastal city in Pamphylia
MapOther ReferencesHistoryComments
My Comments
All meaningful units are associated via semantic links with related information distributed all over the digital global knowledge base.
All meaningful units are associated via semantic links with related information distributed all over the digital global knowledge base.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
CULTURAL MEMORYCULTURAL MEMORY
Cultural Heritage can be preserved, shown,
Cultural Heritage
What you have inherited from your fathers, you must acquire in
order to possess.
(Was Du ererbt von deinen Vätern hast, erwirb es, um es zu
besitzen. J.W. v. Goethe)
Memory is much more than storage.
Memory is associative.
Associative memory is the basis for retrieval.
Conceptually structured associative Memory is the basis for
inferencing and learning.
Cultural Heritage can be preserved, shown,
Cultural Heritage
What you have inherited from your fathers, you must acquire in
order to possess.
(Was Du ererbt von deinen Vätern hast, erwirb es, um es zu
besitzen. J.W. v. Goethe)
Memory is much more than storage.
Memory is associative.
Associative memory is the basis for retrieval.
Conceptually structured associative Memory is the basis for
inferencing and learning.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Interdisciplinary InterpretationInterdisciplinary Interpretation
interconnect documents from
literature
political history
history of arts
geography
sociology
etc
interconnect documents from
literature
political history
history of arts
geography
sociology
etc
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
linking of data and interpretationlinking of data and interpretation
linking of interpretation and meta-interpretation
creates a history of interpretation
add comments, criticism, approval, further
evidence
link from and to your own work
linking of interpretation and meta-interpretation
creates a history of interpretation
add comments, criticism, approval, further
evidence
link from and to your own work
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
The True ChallengeThe True Challenge
The main challenge for the humanities and social sciences is ...the semantic interlinking of their contents across languages, disciplines, cultures, media and formats,to prepare and share treasures of human thought and cultural heritage via the web as data for research and education,to help semantically structure the types of content exhibiting the most complex inherent conceptual structure,to accept an active role in the most exciting process in contem-porary IT affecting human mind, culture and society.
The main challenge for the humanities and social sciences is ...the semantic interlinking of their contents across languages, disciplines, cultures, media and formats,to prepare and share treasures of human thought and cultural heritage via the web as data for research and education,to help semantically structure the types of content exhibiting the most complex inherent conceptual structure,to accept an active role in the most exciting process in contem-porary IT affecting human mind, culture and society.
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Conclusion IConclusion I
there is an important step in between digital
information and digital knowledge
human memory is a scarce resource
this is true for personal and social memory
we do not need machines that think for us
but we could surely use some memory
extensions
to be useful these have to be adapted to human
ways of information storage and access
there is an important step in between digital
information and digital knowledge
human memory is a scarce resource
this is true for personal and social memory
we do not need machines that think for us
but we could surely use some memory
extensions
to be useful these have to be adapted to human
ways of information storage and access
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Conclusion IIConclusion II
Associative Memories are the natural next step beyond digital content
Digital Content Digital Memories Digital Knowledge
Building associative memories from large document collections cannot be done with todays web technology
Here we need the contributions of language and knowledge technologies
No wire in the head
Associative Memories are the natural next step beyond digital content
Digital Content Digital Memories Digital Knowledge
Building associative memories from large document collections cannot be done with todays web technology
Here we need the contributions of language and knowledge technologies
No wire in the head
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
SProUT MotivationSProUT Motivation
A platform for development of multilingual and domain adaptive shallow text processing and information extraction systems
Trade-off between efficiency and expressiveness Modularity (Fine-grained modeling of linguistic components into clear-cut modules)
Portability and industrial standards
A platform for development of multilingual and domain adaptive shallow text processing and information extraction systems
Trade-off between efficiency and expressiveness Modularity (Fine-grained modeling of linguistic components into clear-cut modules)
Portability and industrial standards
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
SPROUT ComponentsSPROUT Components
Linguistic Processing Resources
Tokenizer (easily adaptable for indoeuropean languages)
Gazetteer
Morphology Component (6 languages, Japanese under consideration)
Interface to MMorph
Core Tools
JTFS (Implementation without PET)
FSM Toolkit (Adaptation of the FSM Interface to the requirements of the Grammar Interpreter)
Tries for NLP processing
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
System Architecture
FINITESTATE TOOLKIT
REGULARCOMPILER
SHALLOWGRAMMAR
INTERPRETER
JTFS
SHALLOWGRAMMAR
EXTENDEDOPTIMIZED
FSTREPRES.
LEXICALRESOURCES
INPUTDATA
STRUCTUREDOUTPUT DATA
G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G
STREAM OFTEXT ITEMS
…. [..] [..] [..] ….
LINGUISTICPROCESSINGRESOURCES
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
System ComponentsSystem Components
Linguistic Processing Resources
Tokenizer (easily adaptable for indoeuropean languages)
Gazetteer
Morphology Component (8 languages)
Named Entity Recognition (6 languages)
Core Tools
JTFS (Implementation without PET)
FSM Toolkit
Regular Compiler
Shallow Grammar Interpreter
Tries for NLP processing
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
TFS and TFS-XML
TFS as data interchange format in SProUT
unification and subsumption check as basic operations for evaluation
compact XML encoding of typed feature structures (following TEI-SGML)
exchange format for linguistic resources:
• grammars
• feature structure tree banks
exchange format for visualization
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
TFS-XML: Example
<FS type="pred_argument"><F name="PRED"> <FS type=„übernehmen"/> </F><F name="AGENT"><FS coref="1" type="argument">
<F name="NAME"> <FS type="Maria_Müller"/> </F>
</FS></F><F name="THEME"><FS coref="2" type="argument">
<F name="NOM"> <FS type="Vorsitz"/> </F></FS>
</F></FS>
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Morphological Resources
English 200,000 entries (Mmorph (Multext))
German 830,000 entries (Mmorph (Multext))
French 225,000 entries (Mmorph (Multext))
Spanish 570,000 entries (Mmorph (Multext))
Italian 330,000 entries (Mmorph (Multext))Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague)
Chinese Shanxi-Tokenizer
Japanese ChaSen
Asian langauge resources
Indo-European language resources
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Parser
MMorph
Architecture
MMorph fullform lexica are stored as trieExternal modules (Asian and Czech) are integrated via Client/Server
Tokenizer Shanxi ChaSen
Czech Morph
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
A Type-Driven Method for Compacting Mmorph
A Type-Driven Method for Compacting Mmorph
redundant and spurious ambiguous readings German Mmorph: 5.8 readings per wordform in DNF
compacts Mmorph by deletion of redundant readings
substitution of special readings through more general ones using
type generalization and subsumption checking
generation of a type hierarchy
the average number of readings in German is now 1.6
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Compacting an MMorph Entry
Compacting an MMorph Entry"evaluierten" = "evaluieren" Adjective[ gender=masc number=singular case=gen|dat|acc] "evaluierten" = "evaluieren" Adjective[ gender=fem|neutrum number=singular case=gen|dat]"evaluierten" = "evaluieren" Adjective[ gender=masc|fem|neutrum number=singular case=nom|gen|dat|acc]"evaluierten" = "evaluieren" Adjective[ gender=masc|fem|neutrum number=plural case=nom|gen|dat|acc]
compacting
"evaluierten" = "evaluieren" Adjective[ gender=fem_masc _neutrum number=singular_plural case=acc_dat_gen_nom]
plural_singular
plural singular
fem_ masc_neutrum
fem_masc fem_neutrum masc_neutrum
fem masc neutrum
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
A SProUT Grammar RuleA SProUT Grammar Rule
*
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
UnificationUnification
Matched input structure Extended Rule StructureAfter Match
Fully Unified Structure
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Title of SlidTitle of Slid
Item 1 Item 2
COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Multilingual Named Entitity GrammarsMultilingual Named Entitity Grammars
Languages
• English, French, German, Spanish
• Chinese, Japanese Grammar Style
• MUC-7 named entity classes with some variations
• NAMEX: person, location, organisation
• TIMEX: time point, time span (instead of date, time)
• NUMEX: percentage, money
• Named entity types with internal attribute-value structures, e.g.,span := timex & [FROM point,
TO point ].
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Future WorkFuture Work
Extension of NE grammars to other languages, e.g., Czech, Polish
Grammar Evaluation with JTACO
Efficiency isssues
• Experiments with different search strategies
• Grammar processing optimization
Extension of XTDL expressiveness
• Functional operators
• Seek operator
SEMANTIC ANNOTATION AND HYPERLINKING • EUROLAN 2003© 2003 H. Uszkoreit
Thank you for your attention...
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
DIGITAL EXTENSIONS OF PERSONAL AND COLLECTIVE
MEMORY
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
OUTLINE
MOTIVATION
COMPONENTS OF MEMORY
FUNTIONALITIES
SIDE EFFECTS
REALIZATION
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
MANMADE MEMORY EXTENSIONS
Extension of short-term memory
scratch pad, abakus
Extension of medium-term memory
note pad
Extension of long term memory
written records, sound recordings, books, storage of data in databases and documents
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
MOTIVATION HUMAN DESIRE TO REMEMBER
DIARY MEMOIRS FAMILY COMMUNICATION
EXPLOSIVE GROWTH OF DIGITIZED PERSONAL INFORMATION CORRESPONDENCE (EMAIL, LETTERS) WRITINGS (REPORTS, PAPERS) VISUAL MEMORIES (PHOTOGRAPHS, VIDEOS) ORGANIZERS (CALENDERS, TO-DO-LISTS, ADDRESSES) WEB MATERIAL (BOOKMARKS, VISITED PAGES)
COMPETITIVE ADVANTAGE SETTLING OF DISPUTES BRUSHING UP DETERIORATED KNOWELDGE LEGAL ISSUES
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
MEMORY EXTENSION
Memory consists of
Recording Preparation Integration Storage Restructuring Recall
From Storage to Memory to Knowledge
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
EXTENDED PERSONAL MEMORY
First Level: structure the personal information space, i.e., information that is already on your machine, e.g., texts, correspondence, direct messages (SMS etc.), calendar, graphics
Second Level: add personal archives by digitizing additional information: e.g., photographs, personal sound recordings, musical records, movie clips, etc.
Third Level: add episodal memory (life records), create extensive sound (and image) archives of selected episodes of your daily life, including meetings and pictures of people, sights documents
Not manageable without dense associative hyperlinking
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
EXTENDED COLLECTIVE MEMORY
First level: intranet with databases
Second level: add archives that were kept on paper, films, or sound recordings.
Third level: integrate workflow and intranet, produce records of
meetings, processes, transactions
Not manageable without dense associative hyperlinking
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
MEMORY EXTENSION
Memory consists of
Recording Preparation Integration Storage Restructuring Recall
From Storage to Memory to Knowledge
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
RECORDING
TYPING IN (PAPERS, MAIL, PRESENTATIONS)
DOWNLOADING (FILES, MAIL, WEB PAGES)
COPYING (CDS, CD-ROMS, DVDS)
SCANNING IN OF IMAGES (PHOTOS, ART)
SCANNING IN AND OCR OF TEXTS
SOUND RECORDING (DICTATION, CONVERSATION)
VIDEO RECORDING (TV, SEEN, MEETINGS, SELF)
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
PREPARATION
SOURCE PROPERTIES
TOKENIZATION
POS TAGGING
CLASSIFICATION
NAMED ENTITY RECOGNITION
RELATION DETECTION
TIME MARKING
META DATA PROCESSING
Zur Anzeige wird der QuickTime™ Dekompressor “Foto - JPEG”
benötigt.
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
INTEGRATION
META-DATA INDEXING
IR-TYPE INDEXING
CONCEPTUAL INDEXING
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
META-DATA INDEXING
ORIGIN (ADDRESS, URL)
KIND OF INFO (TYPE, FORMAT, SIZE)
TIME (CREATED, CHANGES ACCESSED)
THEMATIC (KEYWORDS, AUTHOR, THEME)
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
IR-TYPE INDEXING
FULL-TEXT INDEXING STEMMING, MORPHOLOGY, THESAURI? MULTI WORD TERMS TRANSLATION?
CLASSIFICATION
SUMMARIZATION
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
CONCEPTUAL INDEXING
CONSTRUCTION OF A DYNAMIC CONCEPTUAL INDEX
IDENTIFICATION OF MAJOR CLASSESPEOPLE, TIMES, EVENTS, LOCATIONS, DOCUMENTS, THEMES, FUNCTIONS, SOURCES, ORGANIZATIONS
ONTOLOGICAL STRUCTURING
LINK DBs
NAMED ENTITY RECOGNITION
LINKING
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
ISSUES: PERSONAL MEMORY
GENERAL AND PERSONAL ONTOLOGIES
WEIGHT ASSIGNMENT AND RESTRUCTURING
EXTERNAL LINKS
THE ROLE OF THE TIMELINE
PREACTIVATION & PREFETCHING
SEARCH FROM A CONTEXT
PRIVACY & SECURITY
SAFETY & TRUST
INTEREST MODELS
LANGUAGE MODELS
POSSESSION, IMMORTALITY
HANS USZKOREIT SAARBRÜCKEN 17.01.2003
REALIZATION PLAN
STARTING IN PARALLEL:
THEORETICAL CONSIDERATIONS
BRAIN STORMING AND SPECIFICATIONDiscussion Group:Callmeier, Eisele, Erbach, Schäfer, Siegel, Uszkoreit
IMPLEMENTATION OF MOCK-UP AND SYSTEM
RECORDING AND PREPARATION OF DATA (Calendar, Email, Pictures, Papers, WebPages)RA Jobs
DEFINITION AND ACQUISITION OF PROJECTSSFB preparations, EU project for cognition call
Zur Anzeige wird der QuickTime™ Dekompressor “Foto - JPEG”
benötigt.