semantic search: different meanings
DESCRIPTION
Semantic Search: different meanings. Semantic search: different meanings. Definition 1: Semantic search as the problem of searching documents beyond the syntactic level of matching keywords Hakia , PowerSet , SearchMonkey - PowerPoint PPT PresentationTRANSCRIPT
Semantic Search: different meanings
Semantic search: different meanings
• Definition 1: Semantic search as the problem of searching documents beyond the syntactic level of matching keywords– Hakia, PowerSet, SearchMonkey
• Definition 2: Semantic search as the problem of searching large semantic web datasets– Watson, PowerAqua, Swoogle, Sindice, SWSE
Facing keyword-based search problems
• Relations between search terms: – “books about recommender systems” vs. “systems that
recommend books”• Polisemy
– “mouth” as part of the body vs. “mouth” as part of a stream
• Synonymy– “movies” vs. “films”
• Documents about individuals where query keywords do not appear: – “English banks”, individual “Abbey”
Several attempts from the IR community
• Early 80s: elaboration of conceptual frameworks and their introduction in IR models– Taxonomies (categories + hierarchical relations) ,
e.g., The ODP (Open Directory Project)– Thesaurus (categories + fixed hierarchical &
associative relations), e.g., WordNet (used by linguistic approaches)
– Algebraic methods such as LSA • Limitations: The level of conceptualization is
often shallow (specially at the level of relations)
The emergence of the SW
• Late 90s: introduction of ontologies as conceptual framework (classes + instances (KBs) + arbitrary semantic relations + rules) – Semantic search: Exploiting ontologies as a richer
conceptualizations & formal languages to enhance traditional keyword-based document retrieval
– Semantic search: Need to search this emergent and continuously growing structured information space (the Web of Data)
• DPLP, Geonames, DBPedia, BBC Music,... (http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataSets)
The Web of Data 2007
2008 2009
Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis
LOD cloud May 2007
Figure from [4]
Facts:• Focal points:
• DBPedia: RDFized vesion of Wikipiedia; many ingoing and outgoing links
• Music-related datasets• Big datasets include FOAF, US Census data• Size approx. 1 billion triples, 250k links
Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis
LOD cloud September 2008
Facts:• More than 35 datasets interlinked• Commercial players joined the cloud, e.g.,
BBC• Companies began to publish and host
dataset, e.g. OpenLink, Talis, or Garlik.• Size approx. 2 billion triples, 3 million links
Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis
LOD cloud March 2009
Facts:• Big part from Linking Open Drug cloud and the
BIO2RDF project• Notable new datasets: Freebase, OpenCalais,
ACM/IEEE• Size > 10 billion triples
Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis
The LOD clouds
Extracted from: Linked Data Tutorial (Florianópolis) http://www.slideshare.net/ocorcho/linked-data-tutorial-florianpolis
Commercial interest by publishers
Commercial interest by search engines
• 2007 Yahoo! Presents Search Monkey
Commercial interest by search engines
• July-2008 Microsoft buys Powerset
Commercial interest by search engines
• April 2010 Facebook announced the use of the Open Graph protocol
Commercial interest by search engines
• May-2009 Google announces Rich Snippets and it’s official use of RDFa and Microformats
Commercial interest by search engines
• July-2010 Google buys Metaweb (the company behind FreeBase)
Commercial interest by search engines• November-2010 Google announced the
support of the GoodRelations vocabulary for Google Rich Snippets.
Challenges
• Exploiting this new information space for semantic search purposes opens new research challenges:– Scalability– Heterogeneity– Uncertainty
Scalability
Effective exploitation of the linked data requires infrastructure that scales to a large and ever growing collection of interlinked data!
Heterogeneity
Dbpedia:Rudi_Studer
Dblp:Studer:Rudi.html
SW:/en/rudi_studer
Dblp:~ley/db/../author
SW:Person
Dbpedia:Professor
SCHEMA-LEVEL DATA-LEVEL
Align Reconcile,Combine
Effective exploitation of the data web requires an effective mechanism for • finding the relevant data sources• integrating data sources• combining elements from different data sources
Uncertainty
• Incomplete Representation of User’s Needs and content meanings– User cannot completely specify the need – The semantic information in the search space is
incompleteEffective exploitation requires• match user’s needs to data in an imprecise way • rank the results• be flexible enough to adjust to changes in constraints!
“Find action films directed by some Hong Kong film director and starring Chinese martial actors”
The Search Space: different representations
The search space: different representations
• Unstructured search space– The Web of documents (textual and multimedia
content)• Structured search space
– The Web of data (ontologies + Knowledge Bases)• Hybrid search space
– Unstructured content is enriched with metadata• Embedded annotations • Not embedded annotations
The unstructured search space
• The Web of human-understandable content.• The Web of documents and links
– <a href="http://creativecommons.org/licenses/by/3.0/">CC License</a>
Documents
Searchspace
Search engines
The structured search space• The Web of machine understandable content.• The Web of objects and relations
– <a rel="license" href="http://creativecommons.org/licenses/by/3.0/"> Creative Commons License </a>
objects
Searchspace
Search engines
The hybrid search space
• Enriching documents with metadata
Objects
Documents
How to interlink documents and data?
Searchspace
Two ways of interlinking metadata and documents
• Information Extraction• By relying on Web publishers
– More on the section Data on the (Semantic) Web
Search engines