Sindice.com:
A Semantic Web Search Engine
Giovanni Tummarello Renaud DelbruEyal Oren Richard Cyganiak
Digital Enterprise Research InstituteNational University of Ireland, Galway
November 23, 2007
Richard Cyganiak Sindice.com 1 of 25
The Semantic Web is a reality
I Many Gigs of RDF dumps
I 30+ public SPARQL endpoints
I Linked Data, 5+ different browsers
I RDFa
Richard Cyganiak Sindice.com 2 of 25
The Semantic Web is a reality
SWConference
Corpus
DBpedia
RDF Book Mashup
DBLPBerlin
Revyu
Project Guten-berg
FOAF
Geo-names
Music-brainz
Magna-tune
Jamendo
World Fact-book
DBLPHannover
SIOC
Sem-Web-
Central
Euro-stat
ECS South-ampton
BBCLater +TOTP
Fresh-meat
Open-Guides
Gov-Track
US Census Data
W3CWordNet
flickrwrappr
Wiki-company
OpenCyc
NEW! lingvoj
Onto-world
NEW!
NEW!NEW!
Richard Cyganiak Sindice.com 3 of 25
The Semantic Web is a reality
We don’t worry about running out of data
Richard Cyganiak Sindice.com 4 of 25
Sindice.com
Richard Cyganiak Sindice.com 5 of 25
Sindice.com
Richard Cyganiak Sindice.com 6 of 25
Sindice API
I http://sindice.com/query/lookup?uri=...
I http://sindice.com/query/lookup?keyword=...
I http://sindice.com/query/lookup?property=...
&object=...
I Ask for HTML, plain text, RDF/XML or JSON viacontent negotiation
Richard Cyganiak Sindice.com 7 of 25
Scenario (1)
I Tom surfs to http://dbpedia.org/resource/Busan
I Tom wants more than just DBpedia’s information
I Tom’s Tabulator has a Sindice plugin
I Tom presses ‘lookup on Sindice’
I Tom gets a top-ten list of Busan sources
I Tom selects his two trustworthy sources
I Tom’s Tabulator downloads this data
I Tom continues his happy data-surfing
Richard Cyganiak Sindice.com 8 of 25
Scenario (2)
I Tom goes eating in Busan
I Tom likes the food and reviews the restaurant
I Tom’s review site pings Sindice with the update
I Within an hour, others can find this info
I Tom continues his happy fish-eating
Richard Cyganiak Sindice.com 9 of 25
Sindice: discover Semantic Web resources
Richard Cyganiak Sindice.com 10 of 25
Indexing approach
I IR viewpoint: SW is bunch of documents
I DB viewpoint: SW is bunch of triples
I We take IR viewpoint: we index all identifiers andprovide simple lookups; no RDF queries
I Clients can browse/download/display RDF datathemselves; we tell them where to find it
Richard Cyganiak Sindice.com 11 of 25
Sindice functionality (operators)
I index : url → ∅I lookup : uri → {url}I lookup : text → {url}I lookup : ifp × value → {url}
Natural data structure: inverted index over documents
Richard Cyganiak Sindice.com 12 of 25
Sindice architecture
Richard Cyganiak Sindice.com 13 of 25
Index lookup
I Index retrieval
I Ranking phase
I Result generation
Richard Cyganiak Sindice.com 14 of 25
Graph processing
1. Fetch RDF data
2. Extract and index full-text literals
3. Extract and index mentioned URIs
4. Extract graph metadata (size and length)
5. Graph expansion and inferencing
6. Extract labels
7. Extract and index mentioned IFP pairs
Richard Cyganiak Sindice.com 15 of 25
Graph processing
1. Fetch RDF data
2. Extract and index full-text literals
3. Extract and index mentioned URIs
4. Extract graph metadata (size and length)
5. Graph expansion and inferencing
6. Extract labels
7. Extract and index mentioned IFP pairs
Richard Cyganiak Sindice.com 15 of 25
IFP processing
Richard Cyganiak Sindice.com 16 of 25
Graph processing: IFP extraction
I OWL reasoning needed to find IFPs, butcomputationally expensive
I Desireable: reasoning cache to reuse computation
I Undesireable: global trust in all statements
Richard Cyganiak Sindice.com 17 of 25
Solution: quarantained reasoning cache
I Recursively fetch all mentioned schemas
I Compute closure of schemas union
I Query and store all properties that are an IFP
I {foaf:name, dc:title, foaf:homepage, foaf:mbox} →{foaf:mbox}
I For any document that uses same properties you knowthe set of possible IFPs
Richard Cyganiak Sindice.com 18 of 25
Sindice components
I Hadoop (parallel processing)
I HTable (document cache)
I Solr (document index)
I Sesame & OWLIM (reasoning)
I Ruby on Rails (frontend)
I pingthesemanticweb.com
Richard Cyganiak Sindice.com 19 of 25
Tool for data providers: Semantic Sitemap
I Sitemap protocol exposes “deep web” to crawlers
I Semantic sitemap adds Semantic Web data
I http://sw.deri.org/2007/07/sitemapextension/
I Used by: Geonames, DBLP, Uniprot, DBpedia,data.semanticweb.org
Richard Cyganiak Sindice.com 20 of 25
Semantic Sitemap: example
<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9xmlns:sc=http://sw.deri.org/2007/07/sitemapextension/scschema.xsd>
<sc:dataset>
<sc:datasetLabel>Product Catalog for Example.com</sc:datasetLabel>
<sc:linkedDataPrefix>http://example.com/products/</sc:linkedDataPrefix><sc:sparqlEndpoint>http://example.com/sparql</sc:sparqlEndpoint><sc:dataDumpLocation>http://example.com/all.rdf</sc:dataDumpLocation>
<changefreq>weekly</changefreq>
</sc:dataset></urlset>
Richard Cyganiak Sindice.com 21 of 25
What about other search engines?
I We do not answer queries but refer to data sources
I We have IFP lookup using OWL reasoning
I We have semantic sitemap for data-dumps
I We support linked data (input and output)
I We have fully open client APIs
I We have Hadoop infrastructure
I We have live, continuous, updates
I Simplicity, efficiency, scalability
Richard Cyganiak Sindice.com 22 of 25
Credits
I Giovanni Tummarello
I Eyal Oren
I Michele Catasta
I Renaud Delbru
I Holger Stenzhorn
I Adam Westerski
I OpenLink Software
Richard Cyganiak Sindice.com 23 of 25
Upcoming as we speak . . .
I Validator API
I Trust assessment API
I SW Pipes and widgets platform
I Entity-based API (Okkam)
I Growing hardware cluster (possibly 100 nodes)
Richard Cyganiak Sindice.com 24 of 25
Summary
I Sindice: lookup service for Semantic Web resources
I Lookup: resource by URIs, IFPs, keyword
I Architecture: Based on Hadoop, Solr and OWLIM
I Data: DBLP, DBpedia, Uniprot, Geonames and more
I 20M+ documents, 80M+ URIs, 4M+ IFPs, 2B+ triples
Richard Cyganiak Sindice.com 25 of 25