sindice.com: a semantic web search enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfthe...

Post on 25-Jun-2020

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sindice.com:

A Semantic Web Search Engine

Giovanni Tummarello Renaud DelbruEyal Oren Richard Cyganiak

Digital Enterprise Research InstituteNational University of Ireland, Galway

November 23, 2007

Richard Cyganiak Sindice.com 1 of 25

The Semantic Web is a reality

I Many Gigs of RDF dumps

I 30+ public SPARQL endpoints

I Linked Data, 5+ different browsers

I RDFa

Richard Cyganiak Sindice.com 2 of 25

The Semantic Web is a reality

SWConference

Corpus

DBpedia

RDF Book Mashup

DBLPBerlin

Revyu

Project Guten-berg

FOAF

Geo-names

Music-brainz

Magna-tune

Jamendo

World Fact-book

DBLPHannover

SIOC

Sem-Web-

Central

Euro-stat

ECS South-ampton

BBCLater +TOTP

Fresh-meat

Open-Guides

Gov-Track

US Census Data

W3CWordNet

flickrwrappr

Wiki-company

OpenCyc

NEW! lingvoj

Onto-world

NEW!

NEW!NEW!

Richard Cyganiak Sindice.com 3 of 25

The Semantic Web is a reality

We don’t worry about running out of data

Richard Cyganiak Sindice.com 4 of 25

Sindice.com

Richard Cyganiak Sindice.com 5 of 25

Sindice.com

Richard Cyganiak Sindice.com 6 of 25

Sindice API

I http://sindice.com/query/lookup?uri=...

I http://sindice.com/query/lookup?keyword=...

I http://sindice.com/query/lookup?property=...

&object=...

I Ask for HTML, plain text, RDF/XML or JSON viacontent negotiation

Richard Cyganiak Sindice.com 7 of 25

Scenario (1)

I Tom surfs to http://dbpedia.org/resource/Busan

I Tom wants more than just DBpedia’s information

I Tom’s Tabulator has a Sindice plugin

I Tom presses ‘lookup on Sindice’

I Tom gets a top-ten list of Busan sources

I Tom selects his two trustworthy sources

I Tom’s Tabulator downloads this data

I Tom continues his happy data-surfing

Richard Cyganiak Sindice.com 8 of 25

Scenario (2)

I Tom goes eating in Busan

I Tom likes the food and reviews the restaurant

I Tom’s review site pings Sindice with the update

I Within an hour, others can find this info

I Tom continues his happy fish-eating

Richard Cyganiak Sindice.com 9 of 25

Sindice: discover Semantic Web resources

Richard Cyganiak Sindice.com 10 of 25

Indexing approach

I IR viewpoint: SW is bunch of documents

I DB viewpoint: SW is bunch of triples

I We take IR viewpoint: we index all identifiers andprovide simple lookups; no RDF queries

I Clients can browse/download/display RDF datathemselves; we tell them where to find it

Richard Cyganiak Sindice.com 11 of 25

Sindice functionality (operators)

I index : url → ∅I lookup : uri → {url}I lookup : text → {url}I lookup : ifp × value → {url}

Natural data structure: inverted index over documents

Richard Cyganiak Sindice.com 12 of 25

Sindice architecture

Richard Cyganiak Sindice.com 13 of 25

Index lookup

I Index retrieval

I Ranking phase

I Result generation

Richard Cyganiak Sindice.com 14 of 25

Graph processing

1. Fetch RDF data

2. Extract and index full-text literals

3. Extract and index mentioned URIs

4. Extract graph metadata (size and length)

5. Graph expansion and inferencing

6. Extract labels

7. Extract and index mentioned IFP pairs

Richard Cyganiak Sindice.com 15 of 25

Graph processing

1. Fetch RDF data

2. Extract and index full-text literals

3. Extract and index mentioned URIs

4. Extract graph metadata (size and length)

5. Graph expansion and inferencing

6. Extract labels

7. Extract and index mentioned IFP pairs

Richard Cyganiak Sindice.com 15 of 25

IFP processing

Richard Cyganiak Sindice.com 16 of 25

Graph processing: IFP extraction

I OWL reasoning needed to find IFPs, butcomputationally expensive

I Desireable: reasoning cache to reuse computation

I Undesireable: global trust in all statements

Richard Cyganiak Sindice.com 17 of 25

Solution: quarantained reasoning cache

I Recursively fetch all mentioned schemas

I Compute closure of schemas union

I Query and store all properties that are an IFP

I {foaf:name, dc:title, foaf:homepage, foaf:mbox} →{foaf:mbox}

I For any document that uses same properties you knowthe set of possible IFPs

Richard Cyganiak Sindice.com 18 of 25

Sindice components

I Hadoop (parallel processing)

I HTable (document cache)

I Solr (document index)

I Sesame & OWLIM (reasoning)

I Ruby on Rails (frontend)

I pingthesemanticweb.com

Richard Cyganiak Sindice.com 19 of 25

Tool for data providers: Semantic Sitemap

I Sitemap protocol exposes “deep web” to crawlers

I Semantic sitemap adds Semantic Web data

I http://sw.deri.org/2007/07/sitemapextension/

I Used by: Geonames, DBLP, Uniprot, DBpedia,data.semanticweb.org

Richard Cyganiak Sindice.com 20 of 25

Semantic Sitemap: example

<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9xmlns:sc=http://sw.deri.org/2007/07/sitemapextension/scschema.xsd>

<sc:dataset>

<sc:datasetLabel>Product Catalog for Example.com</sc:datasetLabel>

<sc:linkedDataPrefix>http://example.com/products/</sc:linkedDataPrefix><sc:sparqlEndpoint>http://example.com/sparql</sc:sparqlEndpoint><sc:dataDumpLocation>http://example.com/all.rdf</sc:dataDumpLocation>

<changefreq>weekly</changefreq>

</sc:dataset></urlset>

Richard Cyganiak Sindice.com 21 of 25

What about other search engines?

I We do not answer queries but refer to data sources

I We have IFP lookup using OWL reasoning

I We have semantic sitemap for data-dumps

I We support linked data (input and output)

I We have fully open client APIs

I We have Hadoop infrastructure

I We have live, continuous, updates

I Simplicity, efficiency, scalability

Richard Cyganiak Sindice.com 22 of 25

Credits

I Giovanni Tummarello

I Eyal Oren

I Michele Catasta

I Renaud Delbru

I Holger Stenzhorn

I Adam Westerski

I OpenLink Software

Richard Cyganiak Sindice.com 23 of 25

Upcoming as we speak . . .

I Validator API

I Trust assessment API

I SW Pipes and widgets platform

I Entity-based API (Okkam)

I Growing hardware cluster (possibly 100 nodes)

Richard Cyganiak Sindice.com 24 of 25

Summary

I Sindice: lookup service for Semantic Web resources

I Lookup: resource by URIs, IFPs, keyword

I Architecture: Based on Hadoop, Solr and OWLIM

I Data: DBLP, DBpedia, Uniprot, Geonames and more

I 20M+ documents, 80M+ URIs, 4M+ IFPs, 2B+ triples

Richard Cyganiak Sindice.com 25 of 25

top related