semantic search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf ·...
TRANSCRIPT
![Page 1: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/1.jpg)
Semantic Search engines
Existing Solutions
![Page 2: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/2.jpg)
Linked Data
![Page 3: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/3.jpg)
How can I get my dataset into the diagram?
• There must be resolvable http:// (or https://) URIs.
• They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).
• The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)
![Page 4: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/4.jpg)
How can I get my dataset into the diagram?
• The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links.
• Access of the entire dataset must be possiblevia RDF crawling, via an RDF dump, or via a SPARQL endpoint.
![Page 5: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/5.jpg)
Why Linked Data?
• Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links)
• Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)
![Page 6: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/6.jpg)
Semantic Search Engines
![Page 7: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/7.jpg)
Document-Centric Semantic Search Engines
![Page 8: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/8.jpg)
Watson
• http://kmi-web05.open.ac.uk/WatsonWUI/
• Parsing: Jena
• Repository: Jena?
• Reasoning: NO
• Keyword based search, SPARQL endpoint
![Page 9: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/9.jpg)
Watson - Schema
![Page 10: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/10.jpg)
Swoogle
• http://swoogle.umbc.edu/
• Crawler: 3 Custom Crawlers
– Google Crawler (.rdf, .owl files)
– Focused Crawler
– Extracted URIs crawler
• Repository: Jena
• Index: Lucene
• Keyword based search
![Page 11: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/11.jpg)
Swoogle Architecture
![Page 12: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/12.jpg)
Data Analysis
• Classification of Semantic Web Documents
– Databases – Makes assertions about individuals
– Ontologies – Defines new terms
• Compute rank of SWDs
• Search ordering: Swoogle PR – analogy to GPR
![Page 13: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/13.jpg)
Entity-Centric Semantic Search Engines
![Page 14: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/14.jpg)
Falcons
• http://iws.seu.edu.cn/services/falcons/
• Reasoning/Ontology matching: Falcon-ao
• Search ordering: TF-IDF in combination with popularity of ontologies
• Classes recommendation: Ordering according to their popularity
• Keyword search: Based on the indexed texts extracted from Virtual Documents
![Page 15: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/15.jpg)
Falcon Screenshot
![Page 16: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/16.jpg)
Falcon-ao
• Linguistic Matching for Ontologies– Virtual Documents (names,
labels, comments)– Levenshtein edit distance– Vector Space Model + cosine
similarity of VDs
• Graph Matching for Ontologies– Similarity of two entities comes
from the accumulation of similarities of involved statements
– Similarity of two statements comes from the accumulation of similarities of involved entities
![Page 17: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/17.jpg)
SWSE
• http://swse.deri.org/• Crawler: MultiCrawler• Repository: YARS2 – storing quadruples (subject,
predicate, object, context)• Ontology matching: URIs, IFPs• Reasoning: Future work (Scalable Authoritative
OWL Reasoner - SAOR)• Search ordering: ReConRank (Page Rank for
Linked Data)• Keyword based search: Lucene
![Page 18: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/18.jpg)
SWSE Architecture
• Consolidate – find synonymous identifiers
• Rank – links-based analysis, scores assignment
![Page 19: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/19.jpg)
Sindice.com
• http://www.sindice.com
• Crawler: SindiceBot
– robots.rdf – semantic site maps
– crawling pingthesemanticweb.com
• 3 Indexes:
– URI index
– IFP index
– Keyword index
![Page 20: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/20.jpg)
Sindice Architecture
• Crawler:
– Apache Nutch
– Hadoop
– MapReduce
• Reasoner: OWLIM Reasoner
• Keyword based search: Solr
• http://www.sig.ma
![Page 21: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/21.jpg)
Sindice Architecture
![Page 22: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/22.jpg)
Basic structure
Structured datacrawler
Unstructured datacrawler
Documents repository
Data extractor
Indexer
Entity repository
Other apps using API
Searcher
Sorter
![Page 23: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/23.jpg)
Basic structure
Crawler
Documents repository
(Cache)
Data extractor(Parser)
Indexer
Entity repository
Other apps using API
Searcher
Sorter
Ping
Scheduler
![Page 24: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/24.jpg)
Basic structure
Crawler
Indexer
SERQL
Searcher
SorterOWLIM
Ping
Scheduler
Flat Files?
Sesame
![Page 25: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/25.jpg)
Crawling Problems
• Locating resources (not so big problem nowadays)
• Re-Crawl Timing
• Life data sources
• Automatically generated data sources
![Page 26: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/26.jpg)
Storage Problems
• Ontology matching – structural and linguistic methods are not 100 % accurate
• Reasoning
– Tradeoff quality vs. scalability
– Data sources credibility (spamming)
• Indexing – tradeoff quality vs. scalability
– Keyword search vs. SPARQL
![Page 27: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/27.jpg)
Searching Problems
• Extent of some queriesSELECT ?s ?o
WHERE { ?s rdf:type ?o }
– Stop words
– Top-k results
• Results ordering
– Application of Page Rank – prone to spamming
– Resources credibility
![Page 28: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/28.jpg)
Semantic web Crawler
• Slug
– Simple – starts from a given set of documents and follows extracted URIs
– Bugs
• MultiCrawler
– No downloadable version
– Description in a paper
• Apache Nutch based solution
![Page 29: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/29.jpg)
Java Triplestores I
• YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/)
• Jena (http://jena.sourceforge.net/)– TDB storage (access via API)– SDB storage (SPARQL endpoint)
• Sesame (http://www.openrdf.org/)– Sesame Server– SERQL
• Virtuoso (http://virtuoso.openlinksw.com)– Unified storage engine (XML, SQL, RDF, Free Text)– Berlin Benchmark
![Page 30: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffbf4dc16a4b2714b62a104/html5/thumbnails/30.jpg)
Java Triplestores II
• JRDF
– 2008 triplestore across Hadoop
– Currently no support for OWL
• Mulgara
– SPARQL, TQL
– Connection API