introduction to rdf, jena, sparql, and the “semantic web” michael grobe pervasive technology...

65
Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

Upload: egbert-thompson

Post on 11-Jan-2016

227 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

Introduction to RDF, Jena, SparQL, and the “Semantic Web”

Michael Grobe

Pervasive Technology Institute

Indiana University

October 12, 2009

Page 2: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

2

This presentation in perspective

This is actually one of a series of presentations on Linked Data Web and semantic technologies:

- Introduction to ontologies- This on RDF, Jena, SparQL, and the “Semantic Web”- Using inference and OWL

In general, these Semantic technology topics seem “deceptively simple,” but are fraught with complications, limitations, and qualifications…especially when the casual user attempts to compare them with relational data approaches to the same or similar problems.

Page 3: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

3

TopicsSimple introduction to the semantic approach - sentences as triples and graphs - sentence components encoded using URIs - serializing sentences using the Resource Description Format (RDF) - storing semantically encoded data in triplestores - browsing information encoded in RDF

Accessing and querying semantic data - Introduction to SparQL - Free-standing query clients: Twinkle, RDF-gravity, Explorator - Jena: software for manipulating triples

Preeminent semantic resources - DBpedia - Bio2RDF “semantic web atlas of postgenomic knowledge” - Queries using Virtuoso SparQL and iSparQL endpoints

Ontologies: what are they and how are they used?

Discussion of the semantic approach

Page 4: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

4

From raw data to sentences

Here is some information that might be useful to you:

Smith 21Smith Jones

Do you get it?

Would it help to see the data tables? Perhaps you could guess what I’m trying to say if you look at column names.

What’s missing here: the “relationships” between the separate pieces of “data”.

In natural languages these relationships are established by using “predicates” to form sentences that connect these components,

. . . as in the sentences on the next slide:

Page 5: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

5

Sentences

. . . some information in sentence form:

Smith has age 21. Jones has age 45. Blake has age 12. George has age 21. Smith has favorite friend Jones. Jones has favorite friend Smith. Blake has favorite friend Blake. George has favorite friend Smith.

where each sentence has the form:

Subject Predicate Object

also known as

Entity Property Value

and these elements are known together as a “triple”.

Page 6: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

6

A “Sentence base”

We can put these triples into one or more files to build a “sentence base” to hold these sentences.

To help with manipulation and searching, each grammatical component is stored and accessed separately, so that each sentence retains its triple form:

Subject Predicate Object Smith has age 21 Jones has age 45 Blake has age 12 George has age 21 Smith has favorite friend Jones Jones has favorite friend Smith Blake has favorite friend Blake George has favorite friend Smith

Page 7: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

7

Query sentences

We can query such information with queries like:

“Someone has friend Smith?”

where “Someone” acts like a “variable” and “resolves” as the list:

Jones George

because the pattern “Someone has friend Smith” matches both triples:

Jones has favorite friend Smith George has favorite friend Smith

Page 8: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

8

Query sentences

We can interpret a more complicated query like:

"Someone has favorite friend Smith and has age 21?”

as a pair of requirements:

"Someone has favorite friend Smith?” and

"Someone has age 21?“

where we mean “that same someone” has both characteristics . . .

in which case Someone will resolve as "George“, since George is the only “Someone” who satisfies both requirements via the following triples:

George has age 21 George has favorite friend Smith

Note that in both example we have used triple “patterns” to query the triple store

Page 9: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

9

Using graphs used to represent sentences

If we want to complicate things, we can also represent the same information in “graph form” as with these 2 graphs that represent the 2 kinds of information in the collection of sentences:

Graph #1: Person ages Graph #2: Favorite Friends

Typically we don’t really want to complicate these issues, but the semantic web literature often “thinks” in graph terms and some applications display results as visual graphs.

Page 10: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

10

Using graphs to represent sentences

Here the 2 graphs are combined using named edges to represent 2 kinds of information associated with the same 4 persons.

Graph #3: Person ages (:age) and favorite friends (:fav)

Each arc represents the “predicate” of a sentence, connecting a “subject” with an “object”. (Note that a subject may have >= 0 arcs of each type.)

Page 11: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

11

Using URIs and URLs to identify predicates and metadata!

Now if it hadn’t already happened someone would come up with the idea to use URLs to point to Web documents that describe the “exact” meaning of each predicate, or “metadata”.

For example, “http://CelebrityMagazine.com/fav” could contain a definition of “favorite friend”, and other documents would define “BFF”, “long-time-friend”, “family-friend”, “friends with benefits”, etc,

And, in fact, these definitions could themselves refer to other definitions like some “superset” of relationships such as:

http://CelebrityMagazine.com/personal_relationships

or the personal_relationships file could include a collection of subset definitions that we might refer to like:

http://CelebrityMagazine.com/personal_relationships#fav

using the # convention for targeting a specific location within a URL.

Note that this form of metadata is not the only useful form of metadata, but it is clearly integrated with the data in a unique fashion.

The basic triplet structure of each sentence provides another (implicit) form of metadata.

Page 12: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

12

The sentences as a set of 8 triples (2 for each person)

|-------------------------------------| | Subject | Predicate | Object | ======================================= | “Blake” | example:fav | “Blake” | | “Blake” | info:has_age | "12" |

| “Jones” | example:fav | “Smith” | | “Jones” | info:has_age | "35" |

| “George” | example:fav | “Smith” | | “George” | info:has_age | "21" |

| “Smith” | example:fav | “Jones” | | “Smith” | info:has_age | "21" | ---------------------------------------

Here the abbreviation “example:” stands for

http://CelebrityMagazine.com/personal_relationships#

and the abbreviation “info” stands for some imaginary web page that defines age, let’s say

http://demographicstats.org/characteristics#”.

Page 13: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

13

Representing sentence components using URIs

To specify exactly which person named “Blake”, “Smith”, etc. we are referring to, we can again use URIs.

------------------------------------------------------------------------------| Subject | Predicate | Object |===============================================================================| <http://fake.host.edu/blake> | example:fav | <http://fake.host.edu/blake> || <http://fake.host.edu/blake> | info:has_age | "12" | | <http://fake.host.edu/jones> | example:fav | <http://fake.host.edu/smith> || <http://fake.host.edu/jones> | info:has_age | "35" | | <http://fake.host.edu/george> | example:fav | <http://fake.host.edu/smith> || <http://fake.host.edu/george> | info:has_age | "21" | | <http://fake.host.edu/smith> | example:fav | <http://fake.host.edu/jones> || <http://fake.host.edu/smith> | info:has_age | "21" |-------------------------------------------------------------------------------

Here the abbreviation “example:” stands for

http://CelebrityMagazine.com/personal_relationships#

and the abbreviation “info” stands for some imaginary web page that defines age, let’s say

http://demographicstats.org/characteristics#”.

Page 14: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

14

Triplestore summary and outrageous claims

Sentences are composed of subject, predicate, object “triples”.

Subjects and predicates are specified as URIs that may be dereferenceable, and predicate URLs may provide metadata describing the meaning of the predicate.

A collection of triples can be represented as a “graph”, and may be known as a “graph.”

Sentences are stored in “triplestores” or “quad stores” (when they are members of identifiable graphs whose names give the 4th component).

Triples will contain URIs that:

- identify and/or name “resources”: subjects and/or objects, and

- serve to identify and/or reference predicate definitions,and object data types (as in “25”^^xsd:int), and

One way to think about this, is that triplestores do NOT contain “data”, but rather “sentences”, “information”, “assertions” (not necessarily true or correct assertions), “units of thought” (Mons), or maybe “little chunks o’ meaning”.

One might also say that the semantic approach transcends the data/meta-data dichotomy because the triple format provides implicit metadata, and because predicates can link to their definitions.

Page 15: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

15

Triples may be serialized in various forms

There are several ways to convert such triples into a serialized, or text-based, form. Here is the simplest. It is the N3 (for Notation 3) form of a standard known as “Turtle” (for “Terse RDF Triple Language”), with each line holding 3 URIs, and ending with a “.”

@prefix example <http://CelebrityMagazine.com/personal_relationships#> .@prefix info <http://demographicstats.org/characteristics#> .

<http://fake.host.edu/blake> example:fav <http://fake.host.edu/blake> .<http://fake.host.edu/blake> info:has_age "12" . <http://fake.host.edu/jones> example:fav <http://fake.host.edu/smith> .<http://fake.host.edu/jones> info:has_age "35" . <http://fake.host.edu/george> example:fav <http://fake.host.edu/smith> .<http://fake.host.edu/george> info:has_age "21" . <http://fake.host.edu/smith> example:fav <http://fake.host.edu/jones> .<http://fake.host.edu/smith> info:has_age "21" .

Page 16: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

16

Triples may be serialized in various forms:

Another serialization format is the standard Resource Description Format (RDF), which is used in this encoding of the Smith information (with non-dereferenceable URIs):

<rdf:RDF   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"   xmlns:example="http://fake.host.edu/example-schema#">

 <example:Person rdf:about=“http://fake.host.edu/smith”>   <example:name>Smith</example:name>   <example:age>21</example:has_age> <example:fav rdf:resource=“http://fake.host.edu/jones”/> </example:Person>          

</rdf:RDF>

Note: There exist other, “standard” schemas for encoding personal information, such as the Friend of a Friend (FOAF) schema.

Page 17: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

17

Dereferenceable URI version of the Smith RDF triple

Here is the same information encoded with “dereferenceable” URIs, URIs that can actually be accessed and from which content can be downloaded:

<rdf:RDF   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"   xmlns:example="http://fake.host.edu/example-schema#">

 <example:Person rdf:about=“http://discern.uits.iu.edu:8421/smith”>   <example:name>Smith</example:name>   <example:age>21</example:has_age> <example:fav rdf:resource=“http://discern.uits.iu.edu:8421/jones”/> </example:Person>          

</rdf:RDF>

Page 18: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

18

Browsing RDF documents

Here is a view of the Smith RDF file from within Firefox using the Tabulator plug-in:

You can click on the jones.rdf link to see the Jones record, and browse from there, or choose the Person link to examine its definition (if its dereferenceable).

Page 19: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

19

The “Semantic Web”

In general, if URIs are dereferenceable they can link into a “Gigantic Global Graph”, usually know as the “Linked Data Web” or the “Semantic Web.”

“If HTML and the Web make all online documents look like one huge book, RDF, schema, and inference languages will make all the data (sic) in the world look like on huge database.” --TimBL

Page 20: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

20

Documents in RDF format may be interrogated:

- by physical inspection (for anyone willing to read XML)

- by using an RDF browser (like the Tabulator plug-in, etc.)

- by writing programs (in Jena, for example) that read RDF files, construct the represented graphs internally, and then

- access graph triples in sequential order,- select triples according to specified content, and/or- apply SparQL queries and access results in sequential order

- using command-line tools that apply SparQL queries, and/or

- using GUI interfaces accepting SparQL queries- written in text, or- represented graphically

- using SparQL endpoints that accept queries embedded in URLs

Page 21: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

21

A SparQL example

If http://discern.uits.iu.edu:8421/all-persons.rdf

contains all the triples listed earlier, then this SparQL query should find all the triples related to “smith”:

select $p $ofrom <http://fake.host.edu:8421/all-persons.rdf>where{ <http://discern.uits.iu.edu:8421/smith.rdf> $p $o .}

Intuitively, this query asks “Smith has what relationship(s) to whom/what?”and should identify these 2 value pairs:

<http://fake.host.edu/example-schema#fav> <http://discern.uits.iu.edu:8421/jones.rdf> <http://fake.host.edu/example-schema#age> "21”

$p, $o are variable names that were each assigned a value as the query was “satisified.” Variable names may also start with “?”.

Page 22: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

22

Another SparQL example

If http://discern.uits.iu.edu:8421/all-persons.rdf

contains all the triples listed earlier, then this SparQL query simply asks for a list of all those triple values:

select *from <http://discern.uits.iu.ed:8421/all-persons.rdf>where{ $sub $pred $obj .}

Intutitively, this query asks “Who has what relationship to whom?”

$sub, $pred, and $obj will each be assigned one or more values as the query is satisified and all three will be printed (*).

(Note that “$sub $pred $obj .” is a triple pattern in the Turtle/N3 format.)

Page 23: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

23

Results of the single (unified) file SparQL query

--------------------------------------------------------------------------| sub | pred | obj |==========================================================================| http://...8421/blake.rdf | example:fav | http://...8421/blake.rdf || http://...8421/blake.rdf | example:has_age | "12" | | http://...8421/jones.rdf | example:fav | http://...8421/smith.rdf || http://...8421/jones.rdf | example:has_age | "35" | | http://...8421/george.rdf | example:fav | http://...8421/smith.rdf || http://...8421/george.rdf | example:has_age | "21" | | http://...8421/smith.rdf | example:fav | http://...8421/jones.rdf || http://...8421/smith.rdf | example:has_age | "21" |--------------------------------------------------------------------------

where “…” indicates “discern.uits.iu.edu:”.

Page 24: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

24

A “distributed” SparQL query against 4 separate RDF files

The next query searches 4 dereferenceable files holding the same data broken into 4 files, one for each subject:

select *from <http://discern.uits.iu.edu:8421/smith.rdf>from <http://discern.uits.iu.edu:8421/jones.rdf>from <http://discern.uits.iu.edu:8421/george.rdf>from <http://discern.uits.iu.edu:8421/blake.rdf>where{ $sub $pred $obj .}

The results of this query will be the same as the results for the single file query (though order my vary due to remote URL access latency).

Page 25: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

25

Use SparQL to find the predicates

This SparQL example query simply asks for a list of all the unique predicates that occur in all the triples:

select distinct $pfrom <http://discern...8421/friend-network.rdf>where{ $s $p $o .}

If you don’t use “distinct” you will get multiple occurrences of the same predicate.

This can be very useful when you are trying to figure out what predicates are available to interrogate a triplestore that you don’t know much about.

Page 26: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

26

SparQL (incomplete) basic syntax :

SELECT some_variable_list FROM <some_RDF_source_URI> WHERE { { some_n3_triple_pattern . another n3_triple_pattern . }

Notes:

- the “<“ and “>” characters are required.

- other commands in place of SELECT are: CONSTRUCT, ASK and DESCRIBE.

- * is a valid variable list, specifying any variable included in a triple pattern, and may be preceded by DISTINCT, which will prevent duplicate triples.

- there may be multiple FROM clauses, whose targets will be combined and treated as a single store.

- a “.” separating multiple triple patterns is intuitively similar to a natural language “and”, but actually behaves like an SQL natural join.

- the term WHERE is optional, and may be omitted.

SparQL reference: http://www.dajobe.org/2005/04-sparql/SPARQLreference-1.8.pdf

Page 27: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

27

Optional clauses in SparQL queries

Clauses permitted within a “where” clause:

optional { triple_pattern }: identifies a triple that need not appear in an RDF target but whose absence will not prohibit a pattern match.

filter: restricts variable matches in the preceding triple to specified filter patterns, as in:

{ $s $p $date FILTER ( $date > "2005-01-01T00:00:00Z"^^xsd:dateTime ) }or { $s $p $d FILTER ( xsd:dateTime( $d ) < xsd:dateTime( "2005-01-01T00:00:00Z“ ) ) }or { ?s ?p ?name FILTER regex( ?name, "^smi", “some_flag“ ) }

union: “where” clauses may be constructed as

{ triple_pattern_1 } UNION { triple_pattern_2 }

and any RDF element matching either of these triples will be included in the resulting output.

Clauses permitted following the “where” clause:

order by [DESC|ASC| ] ( variable_list )limit n: print up to n return values.offset n: start output with the nth return value.group by: implemented by some SparQL implementations.

Page 28: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

28

Some useful SparQL pattern patterns

Display two property values of some entity (<some_URI>) on the same line:

select *where { <some_URI> <some_predicate> ?o . <the_same_URI> <some_other_predicate> ?o1 .}

Example using the friend information and PREFIX statements:

PREFIX example: <http://CelebrityMagazine.com/personal_relationships#>

PREFIX info: <http://demographicstats.org/characteristics#> select *where { <http://fake.host.edu/smith> example:fav ?favorite . <http://fake.host.edu/smith> info:has_age ?age .}

Page 29: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

29

Some more useful SparQL pattern patterns

Merge results of 2 pattern matches into a single output column:

select *where { { <some_URI> <some_predicate> ?o . } UNION { <some_other_URI> <some_other_predicate> ?o . }}

Example:

PREFIX example: <http://CelebrityMagazine.com/personal_relationships#>

PREFIX info: <http://demographicstats.org/characteristics#>

select *where { { <http://fake.host.edu/smith> example:fav ?values .} UNION { <http://fake.host.edu/smith> info:has_age ?values . }}

Page 30: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

30

Some more useful SparQL pattern patterns

Slowly find all triples whose object components mention “hexokinase”:

select *where { ?s ?p ?o . FILTER regex( $o, "hexokinase" ) .}

Quickly find all entries with object components mentioning hexokinase, but works only through a Virtuoso SparQL endpoint when applied to indexed graphs (and will return nothing when applied to a non-indexed graph):

select *where { ?s1 ?p1 ?o1 . ?o1 bif:contains "hexokinase" .}

Page 31: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

31

SparQL desktop client: Twinkle (version of the upward paths query)

Page 32: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

32

SparQL desktop client: RDF-gravity (using the friend data)

Page 33: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

33

SparQL desktop client: Explorator RDF explorer

The Explorator can download (extracts from) multiple RDF resources, and manipulate them in combination. Here with the Russian lakes example.

This approach provides an interface using a set algebra model of data manipulation. (See Araujo, et al. and http://139.82.71.60:3000/explorator)

Page 34: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

34

Jena

The Java-based Jena package from HP Labs allows users to manipulate and query RDF graphs. You can write a program that uses Jena classes to

- retrieve and parse an RDF file containing a graph or a collection of graphs, - store it in memory, - examine each triple in turn, examine one component (say, the subject) of

each triple in turn, or examine only triples that meet specified criteria, and, - write a serialized version of a graph to a file or STDOT.

For example, one might examine each stored triple searching for a specific reference URI, or for a specific literal value, as with a search for triples containing a specific value, “21”^^xsd:age, in their object portions.

An RDF graph is stored in Jena as a “model”, and a Jena model is created by a factory, as in:

Model m = ModelFactory.createDefaultModel();

Once a model has been defined, Jena can populate it by reading data from files, backend data bases, etc. in various formats, and once it has been populated, Jena can perform set operations on pairs of populated models and/or search models for specific values or combinations (patterns) of values.

Page 35: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

35

JenaFor example, there are several methods for creating iterators over a model to access specific components. Iterators may be built by - listing the components of each triple:

- model.listSubjects();- model.listObjects();

- comparing a specific component with a specified value, as in:

model.listSubjectsWithProperty( Prop p, RDFNode object );

which will get you a collection of subjects possessing property/predicatep and specific value object )

- comparing all components against specific values in 2 steps:

- construct a “selector” possessing specific values s, p and o:

Selector selector = new SimpleSelector( subject, predicate, object )

- and then build the statement list:

model.listStatements( selector );

Page 36: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

36

Preeminent Linked Data resources:

The DBpedia and Bio2RDF

The “DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web” (http://dbpedia.org/About)

DBpedia currently holds over 200 million triples, harvested by scraping DBpedia Infoboxes included within the Wikipedia.

The DBpedia is currently housed in a OpenLink Virtuoso Universal Database, which can store relational, object, XML, and semantic information.

Details at: http://dbpedia.org/About

Page 37: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

37

Bio2RDF: “Atlas of postgenomic knowledge”

Bio2RDF integrates (extracts from) some 40 biomedical information resources (such as GO, Uniprot, etc.) recoded in RDF (>2 Gtriples):

- currently runs over the Virtuoso Universal Database server at http://atlas.bio2rdf.org

but each resource has its own SparQL endpoint, in addition to the endpoint accessing the unified triplestore:

http://atlas.bio2rdf.org/sparql

- a list of included resources is at (http://www.freebase.com/view/user/bio2rdf/public/sparql)

and includes links to the SparQL endpoint for each resource, as well as descriptions of the resource contents and triple counts.

- raw text N3 formats for this data use around 1 TB, but install in much less space within Virtuoso (perhaps 100 GB).

- there is also a Bio2RDF proxy service that takes queries andrelays them to multiple distributed servers (examples later).

Page 38: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

38

Resources included in Bio2RDF

(downloadable from http://quebec.bio2rdf.org/download/n3/)

GO KEGGOMIM HGNCPUbMed INOHGeneID IProClassUniProt MGIUniRef CellMapUniParc BioPAXKegg Pathway InterProCPATH PfamReactome PROSITEBiocyc ProteinMeSH SIDPDB CIDCPD: Kegg Ligand for chemical compound PubChemGL: Kegg Ligand for carbohydrate structure UniSTSEC HomologeneRN Kegg Ligand for chemical reaction DBpediaDR: Kegg Ligand for drugs OBO CheBITaxonomy: NEWT AffymetrixPID Biocarta

Page 39: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

39

Bio2RDF resources

(Edge width is proportional to link density.)

Page 40: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

40

SparQL endpoints

Triplestores like the Virtuoso Universal Database Server publish “SparQL endpoints” that will take SparQL queries through several interfaces.

For example you can query the DBpedia through a Virtuoso SparQL endpoint at

http://dbpedia.org/sparql

by sending SparQL queries:

- encoded in URLs addressed to the triplestore endpoint, like

http://dbpedia.org/sparql?query=SELECT distinct * WHERE { $s $p $o . $o bif:contains “Goethe_Johann_Wolfgang” . }

- entered into Web forms that present text areas into which one can enter queries, as on the next pages

Page 41: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

41

The SparQL interface to DBpedia

Page 42: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

42

The iSparQL Advanced interface to DBpedia

Page 43: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

43

The iSparQL QBE interface to DBpedia (close up)

Here is the same query in graphical form as constructed using the iSparql QBE interface:

Page 44: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

44

The iSparQL QBE interface to DBpedia

Page 45: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

45

Results from the iSparql text and/or QBE queries

Page 46: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

46

Using SparQL to get RDF extracts

Suppose you want to build a local RDF triplestore from DBpedia containing only the Goethe entries, or import these entries into some other desktop client like the Explorator.

Documents returned by SparQL select queries are usually not RDF documents. They may not have triples, and they are usually structured for display or storage in HTML, Excel or some other format.

You can use the CONSTRUCT command (in place of SELECT) within a SparQL query to build a proper RDF formatted response:

construct{ <http://dbpedia.org/resource/Johann_Wolfgang_von_Goethe> $p $o } where{ <http://dbpedia.org/resource/Johann_Wolfgang_von_Goethe> $p $o .}

The structure of the triple to be created is specified in the “construct” clause.

Note that construct queries like these can be embedded in URLs to SparQL endpoints.

Page 47: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

47

Ontologies

The term “ontology” is used in different ways by different people.

Pidcock writes that “People use the word to mean different things, e.g.: glossaries and data dictionaries, thesauri and taxonomies, schema and data models, and formal ontologies and inference.”

And Uschold writes “An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. . .This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.”

Page 48: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

48

The DBpedia ontology

The DBpedia ontology is “shallow, cross-domain” ontology. In

http://www4.wiwiss.fu-berlin.de/dbpedia/dev/ontology.html

it appears as a tree with maximum depth of 4.

The main level class is a “Thing”, and the first sublevel classes are: Person, Organization, Anatomical structure, Place, Species, etc.

The next level persons are Scientist, College Coach, Monarch, Politician, etc.

Some classes are also assigned “properties”. For example, a species may have Order and Family properties (even though an organism’s Order and Family could be inferred from its position in the (ontology that is the) evolutionary tree.

Page 49: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

49

Wikipedia Infoboxes

The DBpedia gets its information from the Wikipedia “Infoboxes”, such as this one for Johann Wolfgang von Goethe that appears on his Wikipedia page.

Infobox contents are mapped to DBpedia ontology classes and properties, which are used as RDF predicates.

Here the Goethe “resource” is:

http://dbpedia.org/resource/

Johann_Wolfgang_von_Goethe

and you know how to find all the predicates and objects by now?

Page 50: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

50

The DBpedia ontology

Here is a query to find all the “Places” known to DBpedia:

select distinct * where { $s a <http://dbpedia.org/ontology/Place>} limit 1000

And a query to find every “person’s” birth info:

select $s $o where { $s a <http://dbpedia.org/ontology/Person> . $s <http://dbpedia.org/property/birth> $o } limit 1000

where the predicate “a” is a short form of

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

Page 51: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

51

The DBpedia “faceted browser”

DBpedia ontology and property components are displayed in the left column and can be used is used to define filters for viewing content.

Page 52: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

52

GO: An example biomedical ontology (or 3)

Consider the Gene Ontology, very widely used in, and probably crucial for, bioinformatics and biological research.

The Gene Ontology actually has 3 major components, or separate sections for defining terms related to

- Biological Process,

- Cellular Component (physical structures or locations within biological cells), and

- Molecular Function,

each of which defines several thousand terms.

Page 53: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

53

A small portion of the Molecular Function portion of

the Gene Ontology Directed Acyclic Graph (DAG)

Page 54: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

54

Find parents of GO:0004003 in the example GO DAG using a SparQL query

select *where{ <http://bio2rdf.org/go:0004003> <http://bio2rdf.org/ns/go#is_a> $parent .}

Result:

-----------------------------------| parent |===================================| <http://bio2rdf.org/go:0008094> || <http://bio2rdf.org/go:0008026> || <http://bio2rdf.org/go:0003678> |-----------------------------------

Page 55: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

55

Find all 3-element paths up from GO:0004003

PREFIX go: <http://bio2rdf.org/ns/go#>

select

*

where

{

<http://bio2rdf.org/go:0004003>

go:is_a

$a .

$a go:is_a $b .

$b go:is_a $c .

}

Note the use of the PREFIX to define an abbreviation that will be substituted for the string “go:”.

Page 56: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

56

Find all 3-element paths up from GO:0004003 using the example GO DAG

a b c

http://bio2rdf.org/go:0008026 http://bio2rdf.org/go:0004386 http://bio2rdf.org/go:0008047

http://bio2rdf.org/go:0008026 http://bio2rdf.org/go:0016887 http://bio2rdf.org/go:0008047

http://bio2rdf.org/go:0003678 http://bio2rdf.org/go:0004386 http://bio2rdf.org/go:0008047

http://bio2rdf.org/go:0008094 http://bio2rdf.org/go:0016887 http://bio2rdf.org/go:0008047

Page 57: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

57

Find all 3-element paths up from GO:0004003 using SQL

select a.parent_id, b.parent_id, c.parent_id from GO.molecular_function_DAG a join GO.molecular_function_DAG b on a.parent_id = b.child_id join GO.molecular_function_DAG c on b.parent_id = c.child_id where a.child_id like ‘GO:0004003’

This query is posed as a series of joins on the GO.molecular_function_DAG just as the SparQL version uses structures like:

$a go:is_a $b .$b go:is_a $c .

where go:is_a is analogous to the DAG table, the “.” specifies a “join”, and $b, appearing on two separate lines, implicitly specifies an equality requirement.

Page 58: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

58

Auer and Lehmann asked:“What DO Innsbruck and Leipzig have in common?”

. . .or to be more exact:

What query will reveal what properties 2 entities have in common?

select * where { < . . . Innsbruck> ?p ?o . < . . . Leipzig> ?p ?o . }

will direct the resolver will find every characteristic of each city and see which characteristic is shared by both cities.

This doesn't have an equivalent in SQL because you can't treat table and variable names as variables in SQL.

(You can of course get around this by using system tables, or by storing all your data “normalized” as a single table containing 3 columns, which might not be a bad idea in some unusual circumstances.)

Page 59: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

59

Auer and Lehmann asked:

“What DO Innsbruck and Leipzig have in common?”

. . .or to extend this train of thought:

What query will reveal what properties Innsbruck and Leipzig do NOT have in common?

And can these ideas be extended to notions of “semantic similarity” or “semantic distance” between resources.

Or extended to a notion of “semantic clustering”?

We might want to ask questions like:

Which cities are most like Innsbruck?Which cities are most unlike Innsbruck?Which cities are more like Innsbruck than any other city?How can we cluster cities into functional groups?

If we can sell gadgets in Innsbruck, in what other cities might we market the same gadgets?

If we have a Pubmed article of interest, what other articles should we read?

Page 60: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

60

What do go:0004145 and go:0004059 have in common?

select * where { <http://bio2rdf.org/go:0004145> $predicate ?object . <http://bio2rdf.org/go:0004059> $predicate ?object . }

----------------------------------------------------------------| predicate || object ||--------------------------------------------------------------|| http://bio2rdf.org/ns/go#is_a || http://bio2rdf.org/go:0008080 |---------------------------------------------------------------|| http://www.w3.org/1999/02/22-rdf-syntax-ns#type || http://bio2rdf.org/ns/go#Term ||--------------------------------------------------------------|| http://www.w3.org/1999/02/22-rdf-syntax-ns#type || http://bio2rdf.org/ns/go#molecular_function |----------------------------------------------------------------

So, this query reveals that both classes are subclasses of go:0008080 and members of the Molecular function component of GO.

Page 61: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

61

Evaluating the semantic approach?

The semantic approach is complicated, often produces ugly-looking and slow results, and new tools emerge like Topsy . . .

. . . but it allows users to do some things more easily than they can be done using the relational approach:

- Information stored in sentences is easier for (some) users to understand, and extract relevant portions.

- Data merged with metadata makes metadata easy to find. - Being sentence-based, SparQL may be more intuitive (and more

declarative?) than SQL, and may more easily support the use of ontologies and inference.

- Distributed information is can be more easily utilized; users can access multiple RDF documents in a single SparQL query, and even browse distributed RDF sources as part of the LDW or GGG.

- Information resources can often be more easily integrated. Since no unified storage schema is required, RDF versions of multiple resources can be manipulated within the same triplestore, and ontologies may be exploited in a more natural fashion.

- Some types of queries are much more easily composed than they could be in SQL (Leipzig and Innsbruck).

Page 62: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

62

Conclusions?

The usefulness of the semantic approach is difficult to evaluate, but it is safe to say the relational model is not going away.

Use/value depends on who’s doing what, using what information, over what platforms, and how usage patterns will vary over time (and they are!).

The semantic approach appears to be especially useful for integrating information resources and for finding connections/relationships, but integrating resources is not straightforward (see Satoo, et al. and Antezana, et al. for examples), nor is quantifying connectivity.

You may need to differentiate between the semantic approach itself, and the distributed capabilities of the Semantic Web. (Do the RDF warehouses contradict the underlying intent to support distributed information resources?)

Where and how should metadata/semantics be injected into the “data stack”? (caBIG does it differently.)

Where and how should ontologies be applied in information management? (Using ontologies is mostly orthogonal to RDF proper, but see Renear, et al.)

What kinds of relational/semantic technology integrations are possible? Which will prove synergistic?

If we have an RDF version of Wikipedia, can we have an ontology-enabled, RDF version of Pubmed? (Consider Enju and TexFlame.)

Page 63: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

63

A long term role for the semantic approach?

This from the Oracle Semantic Technologies Center (circa 2001!):

“By the end of this short paper, the reader should understand the overall superiority of Semantic Web technologies and be able to describe why it is very likely that they will be embedded in the fabric of nearly all data-intensive software within several years.” -- Jeff Pollock, Oracle Corporation

http://www.w3.org/2001/sw/sweo/public/BusinessCase/BusinessCase.pdf

Page 64: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

64

A long term role in scholarly communication?

The Concept Web, according to their Web site:“a dynamic, interactive fabric of concepts and their relationships. The Concept Web is constructed from, inter alia, research literature, Internet databases and other web sites together with off-line resources.”

Mission of the Alliance:“To enable an open collaborative environment to jointly address the challenges associated with high volume scholarly and professional data production, storage, interoperability and analyses for knowledge discovery.“

Specific goals:“The development and refinement of ways to capture information in Semantically Rich Triples,” and to store, manage, and query such information.

“The big issue we have here is this perverse situation in publishing, or in formal scholarly communication, where researchers take data, convert it into narrative form, and then employ really complex text-mining tools based on complex natural language processing . . . to try and turn this stuff into data again.” (Bilder, 2009)

Page 65: Introduction to RDF, Jena, SparQL, and the “Semantic Web” Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009

65

For more information, see:

• Antezana, Erik, et al., "BioGateway: a semantic systems biology tool for the life sciences”, BMC Bioinformatics, 2009.http://www.biomedcentral.com/1471-2105/10/S10/S11/abstract

• Auer, Soren and Jens Lehmann, "What do Innsbruck and Leipzig have in common? Extracting Semantics from Wiki Content”, European Semantic Web Conference (ESWC), 2007.

• Bilder, Geoffry, Conceptweblog Conference Podcast, Concept Web Alliance Inaugural Meeting, May 2009. http://conceptweblog.wordpress.com/conferences/

• Bizer, Christian, Tom Heath, Tim Berners-Lee, “Linked Data--The story so far.”http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf

• Herper, Matthew, Forbes Magazine, November 10, 2008. http://www.forbes.com/forbes/2008/1110/090.html

• Grobe, Michael, “RDF, Jena, SparQL, and the “Semantic Web”, SIGUCCS, 2009.

• Marajo S.; Schwabe D., Barbosa S. - Experimenting with Explorator: a Direct Manipulation Generic RDF Browser and Querying Tool. Visual Interfaces to the Social and the Semantic Web (VISSW 2009), Sanibel Island, Florida - February 2009. http://smart-ui.org/events/vissw2009/papers/VISSW2009-Araujo.pdf

• Nolin, Marc-Alexandre, et al., “Bio2RDF Network of Linked Data”, SWC submission, 2008.http://www.cs.vu.nl/~pmika/swc-2008/Bio2RDF-Bio2RDF_submission.pdf

• Sahoo, SS, et al., "An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence", J Biomed Inform. 2008 Oct;41(5):752-65.

• Renear, Allen H, et al., “Strategic Reading, Ontologies, and the Future of Scientific Publishing", Science, 2009.