semantic search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf ·...

Semantic Search engines

Existing Solutions

Linked Data

How can I get my dataset into the diagram?

• There must be resolvable http:// (or https://) URIs.

• They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).

• The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)

How can I get my dataset into the diagram?

• The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links.

• Access of the entire dataset must be possiblevia RDF crawling, via an RDF dump, or via a SPARQL endpoint.

Why Linked Data?

• Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links)

• Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)

Semantic Search Engines

Document-Centric Semantic Search Engines

Watson

• http://kmi-web05.open.ac.uk/WatsonWUI/

• Parsing: Jena

• Repository: Jena?

• Reasoning: NO

• Keyword based search, SPARQL endpoint

http://kmi-web05.open.ac.uk/WatsonWUI/




Watson - Schema

Swoogle

• http://swoogle.umbc.edu/

• Crawler: 3 Custom Crawlers

– Google Crawler (.rdf, .owl files)

– Focused Crawler

– Extracted URIs crawler

• Repository: Jena

• Index: Lucene

• Keyword based search

http://swoogle.umbc.edu/

Swoogle Architecture

Data Analysis

• Classification of Semantic Web Documents

– Databases – Makes assertions about individuals

– Ontologies – Defines new terms

• Compute rank of SWDs

• Search ordering: Swoogle PR – analogy to GPR

Entity-Centric Semantic Search Engines

Falcons

• http://iws.seu.edu.cn/services/falcons/

• Reasoning/Ontology matching: Falcon-ao

• Search ordering: TF-IDF in combination with popularity of ontologies

• Classes recommendation: Ordering according to their popularity

• Keyword search: Based on the indexed texts extracted from Virtual Documents

http://iws.seu.edu.cn/services/falcons/

Falcon Screenshot

Falcon-ao

• Linguistic Matching for Ontologies– Virtual Documents (names,

labels, comments)– Levenshtein edit distance– Vector Space Model + cosine

similarity of VDs

• Graph Matching for Ontologies– Similarity of two entities comes

from the accumulation of similarities of involved statements

– Similarity of two statements comes from the accumulation of similarities of involved entities

SWSE

• http://swse.deri.org/• Crawler: MultiCrawler• Repository: YARS2 – storing quadruples (subject,

predicate, object, context)• Ontology matching: URIs, IFPs• Reasoning: Future work (Scalable Authoritative

OWL Reasoner - SAOR)• Search ordering: ReConRank (Page Rank for

Linked Data)• Keyword based search: Lucene

http://swse.deri.org/

SWSE Architecture

• Consolidate – find synonymous identifiers

• Rank – links-based analysis, scores assignment

Sindice.com

• http://www.sindice.com

• Crawler: SindiceBot

– robots.rdf – semantic site maps

– crawling pingthesemanticweb.com

• 3 Indexes:

– URI index

– IFP index

– Keyword index

http://www.sindice.com/

Sindice Architecture

• Crawler:

– Apache Nutch

– Hadoop

– MapReduce

• Reasoner: OWLIM Reasoner

• Keyword based search: Solr

• http://www.sig.ma

http://www.sig.ma/

Sindice Architecture

Basic structure

Structured datacrawler

Unstructured datacrawler

Documents repository

Data extractor

Indexer

Entity repository

Other apps using API

Searcher

Sorter

Basic structure

Crawler

Documents repository

(Cache)

Data extractor(Parser)

Indexer

Entity repository

Other apps using API

Searcher

Sorter

Ping

Scheduler

Basic structure

Crawler

Indexer

SERQL

Searcher

SorterOWLIM

Ping

Scheduler

Flat Files?

Sesame

Crawling Problems

• Locating resources (not so big problem nowadays)

• Re-Crawl Timing

• Life data sources

• Automatically generated data sources

Storage Problems

• Ontology matching – structural and linguistic methods are not 100 % accurate

• Reasoning

– Tradeoff quality vs. scalability

– Data sources credibility (spamming)

• Indexing – tradeoff quality vs. scalability

– Keyword search vs. SPARQL

Searching Problems

• Extent of some queriesSELECT ?s ?o

WHERE { ?s rdf:type ?o }

– Stop words

– Top-k results

• Results ordering

– Application of Page Rank – prone to spamming

– Resources credibility

Semantic web Crawler

• Slug

– Simple – starts from a given set of documents and follows extracted URIs

– Bugs

• MultiCrawler

– No downloadable version

– Description in a paper

• Apache Nutch based solution

Java Triplestores I

• YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/)

• Jena (http://jena.sourceforge.net/)– TDB storage (access via API)– SDB storage (SPARQL endpoint)

• Sesame (http://www.openrdf.org/)– Sesame Server– SERQL

• Virtuoso (http://virtuoso.openlinksw.com)– Unified storage engine (XML, SQL, RDF, Free Text)– Berlin Benchmark

http://sw.deri.org/2004/06/yars/

http://sw.deri.org/2004/06/yars/

http://jena.sourceforge.net/

http://www.openrdf.org/



http://virtuoso.openlinksw.com/

Java Triplestores II

• JRDF

– 2008 triplestore across Hadoop

– Currently no support for OWL

• Mulgara

– SPARQL, TQL

– Connection API