semantic publishing with nanopublications

30
Semantic Publishing with Nanopublications Tobias Kuhn http://www.tkuhn.ch @txkuhn ETH Zurich Research Colloquium Stanford Center for Biomedical Informatics Research 12 March 2015

Upload: tobias-kuhn

Post on 17-Jul-2015

281 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Semantic Publishing with Nanopublications

Semantic Publishing with Nanopublications

Tobias Kuhn

http://www.tkuhn.ch

@txkuhn

ETH Zurich

Research ColloquiumStanford Center for Biomedical Informatics Research

12 March 2015

Page 2: Semantic Publishing with Nanopublications

Information Overload:>1M new Scientific Articles Per Year

• How can we helpscientists to stayup-to-date?

• How can we efficientlyretrieve and aggregatepublished scientificresults?

• How should we designthe system for thefuture of scientificcommunication?

Image from: Kuhn et al. Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4. 2014.Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 2 / 30

Page 3: Semantic Publishing with Nanopublications

Nanopublications:Provenance-Aware Semantic Publishing

assertion

provenance

publication info

nanopublication

http://nanopub.org / @nanopub org

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 3 / 30

Page 4: Semantic Publishing with Nanopublications

Nanopublications:Provenance-Aware Semantic Publishing

Nanopub0001

Assertion:

opm:wasDerivedFrom d:DataSourceX

Provenance:

ns1:mosquito ns2:malaria

ns3:transmission

Publication Information:

dc:created “2013-01-01”pav:createdBy p:Isabelle_Dubois

http://nanopub.org / @nanopub org

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 4 / 30

Page 5: Semantic Publishing with Nanopublications

Example: Nanopublication Converted from BEL

sub:assertion {

sub:_3 a rdf:Statement ;

rdf:subject schem:Adenosine%20triphosphate ;

rdf:predicate belv:decreases ; rdf:object sub:_1 ;

occursIn: obo:UBERON_0001134 , species:9606 .

sub:_1 a go:0003824 ; hasAgent: sub:_2 .

sub:_2 a Protein: ; geneProductOf: hgnc:12517 .

sub:assertion rdfs:label "a(SCHEM:\"Adenosine triphosphate\") -| cat(p(HGNC:UCP1))" .

}

sub:provenance {

sub:assertion prov:hadPrimarySource pubmed:9703368 ;

prov:wasDerivedFrom beldoc: , sub:_4 .

beldoc: dce:description "Approximately 61,000 statements." ;

dce:rights "Copyright (c) 2011-2012, Selventa. All rights reserved." ;

dce:title "BEL Framework Large Corpus Document" ;

pav:authoredBy sub:_5 ; pav:version "20131211" .

sub:_4 prov:value "UCP1 contains six potential transmembrane a-helices (72) and acts under the form of a homodimer (73). Its uncoupling activity is increased by FFA (7477) and by long chain fatty acyl CoA esters (78, 79), and decreased by purine nucleotide di- or tri-phosphates (12, 74)." ;

prov:wasQuotedFrom pubmed:9703368 .

sub:_5 rdfs:label "Selventa" .

}

sub:pubinfo {

this: dct:created "2014-07-03T14:34:13.226+02:00"^^xsd:dateTime ;

pav:createdBy orcid:0000-0001-6818-334X , orcid:0000-0002-1267-0234 .

}

https://github.com/tkuhn/bel2nanopubTobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 5 / 30

Page 6: Semantic Publishing with Nanopublications

Vision: Changing Scholarly Communication

NowNarrative articles at the center

FutureNanopublications at the center

Images from Mons et al. The value of data. Nature genetics, 43(4):281–283, 2011

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 6 / 30

Page 7: Semantic Publishing with Nanopublications

Open Problems

Three important open problems for nanopublications are:

• Identifiability: How can we ensure the immutability ofnanopublications and provide verifiable identifiers?

• Persistence: How can we ensure that publishednanopublications can be reliably retrieved by others and remainpermanently available?

• Representation: How can we deal with results that have nostraightforward RDF representation?

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 7 / 30

Page 8: Semantic Publishing with Nanopublications

Problem:Replication and Re-Use of Research Results

Exemplary Situation: Sue publishes a script that should alloweverybody to replicate her scientific analysis:

# Download data:

wget http://some-third-party.org/dataset/1.4

# Analyze data:

...

Problems: What if the third party silently changes that version of thedataset? What if the resource becomes unavailable at this location?What if the web site later gets hacked and the data manipulated?

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 8 / 30

Page 9: Semantic Publishing with Nanopublications

Identifiability Problem

http://some-third-party.org/dataset/1.4

m ?

Given a URI for a digital artifact, there is no reliable standardprocedure of checking whether a retrieved file really represents thecorrect and original state of that artifact.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 9 / 30

Page 10: Semantic Publishing with Nanopublications

Trusty URIs

Basic idea: Use of cryptographic hash values calculated on digitalartifacts.

Requirements:

• To allow for the verification of entire reference trees, the hashshould be part of the reference (i.e. the URI)

• To allow for meta-data, digital artifacts should be allowed tocontain self-references (i.e. their own URI)

• Format-independent hash for different kinds of content

• The complete approach should be decentralized and open

• We want to use them right away

Example:http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 10 / 30

Page 11: Semantic Publishing with Nanopublications

Trusty URIs: Range of Verifiability

With the hash as a part of the URI, the “range of verifiability”extends to referenced artifacts (if they also use trusty URIs):

http://...RAcbjcRI...

http://...RAQozo2w...

http://...RABMq4Wc...

http://...RAcbjcRI...

http://...RAQozo2w...

http://.../resource23

http://.../resource23...

http://...RAUx3Pqu...

http://.../resource55

http://...RABMq4Wc...

http://.../resource55http://...RARz0AX-...

...

http://...RAUx3Pqu......

http://...RARz0AX......

range of verifiability

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 11 / 30

Page 12: Semantic Publishing with Nanopublications

Verifiable — Immutable — Permanent

XWhether or not a given resource is the one a given trusty URI issupposed to represent can be verified with perfect confidence.

(assuming that the trusty URI for the required artifact is known, e.g. becauseanother artifact contains it as a link)

Kuhn, Dumontier. Making Digital Artifacts on the Web Verifiable and Reliable. IEEE TKDE. To appear. / Kuhn,Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 12 / 30

Page 13: Semantic Publishing with Nanopublications

Verifiable — Immutable — Permanent

A trusty URI artifact is immutable, as any change in the contentalso changes its URI, thereby making it a new artifact.

(as soon as your trusty URI has been picked up by third parties, e.g. cached orlinked from other resources, every change will be noticed)

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 13 / 30

Page 14: Semantic Publishing with Nanopublications

Verifiable — Immutable — Permanent

�Trusty URI artifacts are permanent, as they can be retrieved fromthe cache of third-party websites if otherwise no longer available.

(if there are search engines and web archives regularly crawling and caching theartifacts on the web)

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 14 / 30

Page 15: Semantic Publishing with Nanopublications

Calculating Trusty URIs is Fast and Reliable

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 15 / 30

Page 16: Semantic Publishing with Nanopublications

Nanopublication Server Network

Nanopublicationswith Trusty URIs

Publication

Retrieval

Propagation / Archiving

http://npmonitor.inn.ac

Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 16 / 30

Page 17: Semantic Publishing with Nanopublications

Decentralized — Open — Real-time

• No a central authority: Everybody can set up a server and jointhe network

• No restrictions on publication: Everybody can uploadnanopublications

• No delay between submission and publication: Nanopublicationsare made public immediately

• No updates: If a nanopublication is modified, that makes it anew nanopublication (enforced by trusty URIs)

• No queries: Only simple identifier-based lookup

Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 17 / 30

Page 18: Semantic Publishing with Nanopublications

High Performance and High Availability

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 18 / 30

Page 19: Semantic Publishing with Nanopublications

Defining Datasets with Nanopublication Indexes(which are themselves Nanopublications)

appends

has sub-index

has element

(a) (b)

(c) (f)

(d) (e)

Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 19 / 30

Page 20: Semantic Publishing with Nanopublications

Using Nanopublication Datasets

Once published in the network, nanopublication indexes can be cited:

[7] Nanopubs converted from OpenBEL’s Small and Large Corpus 20131211.Nanopublication index, 4 March 2014,http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw

Researchers can then fetch and reuse the data in a reliable andprefectly reproducible manner:

# Download data:

GetNanopub.sh -c RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw

# Analyze data:

...

Existing data can be recombined in new indexes; and researchers canunambiguously refer to the used datasets for new results:

this: prov:wasDerivedFrom nps:RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw

Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 20 / 30

Page 21: Semantic Publishing with Nanopublications

Real-Time and Replicable Data Mining

authors

users

curators

structured data sources

unstructured data sources

ocean ofnanopublications

nanopublication portals

in silico experiments

network analyses

aggregated views

hypothesis generation

...bots

1

2

3

45

6

Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 21 / 30

Page 22: Semantic Publishing with Nanopublications

How can we Represent Actual Scientific Resultsin RDF?

Example:

[...] the risk of developing neurodegenerative disease in idiopathicREM sleep behavior disorder is substantial, with the majority ofpatients developing Parkinson disease and Lewy body dementia.[PubMed 19109537]

This is difficult to represent in RDF, but it would correspond to twonanopublication assertions:

• “The risk of developing neurodegenerative disease in idiopathic REMsleep behavior disorder is substantial.”

• “The majority of patients with idiopathic REM sleep behavior disorderwho develop a neurodegenerative disease develop Parkinson diseaseand Lewy body dementia.”

Nanopublication provenance and metadata, on the other hand, wouldbe relatively easy to produce.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 22 / 30

Page 23: Semantic Publishing with Nanopublications

English Sentences as Anchors to StructureScientific Discourse

Malaria is transmitted by mosquitoes.

Mosquitoes transmit malaria.

Malaria is transmitted by female mosquitoes.

study A

same meaning

more specific meaning of

provides evidence for provides counter-

evidence against

study Bstudy C

study DMalaria is transmitted by moscitos.

corrected version of

Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 23 / 30

Page 24: Semantic Publishing with Nanopublications

AIDA Sentences: Criteria

Single English sentences that are:

• Atomic: a sentence describing one thought that cannot befurther broken down in a practical way

• Independent: a sentence that can stand on its own, withoutexternal references like “this effect” or “we”

• Declarative: a complete sentence ending with a full stop thatcould in theory be either true or false

• Absolute: a sentence describing the core of a claim ignoring theuncertainty about its truth and how it was discovered (no“probably” or “evaluation showed”); typically in present tense

Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 24 / 30

Page 25: Semantic Publishing with Nanopublications

Controlled Natural Language (CNL)

AIDA is a kind of Controlled Natural Language.

More about that in my talk next week (20 March 2014)in the Protege Seminar.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 25 / 30

Page 26: Semantic Publishing with Nanopublications

AIDA Nanopublications

Nanopublications can be combined with AIDA sentences to allow forinformal and underspecified assertions:

Malaria is transmitted by mosquitoes.

Malaria is transmitted by mosquitoes.

ns1:mosquito

ns3:transmission ns1:mosquito

Malaria is transmitted by mosquitoes.

ns2:malaria

ns3:transmission

They can be linked regardless of their degree of formality:

ns1:mosquito

Malaria is transmitted by mosquitoes.

ns2:malaria

ns3:transmission

Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 26 / 30

Page 27: Semantic Publishing with Nanopublications

AIDA Sentences can be Efficiently Created

Manual creation of AIDA sentences from existing abstracts:

163total 100%

114perfect 70%

7typo etc. 4%

10inaccurate 6%

32not AIDA 20%

0 20 40 60 80 100 120 140 160sentences

Automatic extraction of AIDA sentences from existing GeneRIF dataset:

250total 100%

177perfect 71%

8typo etc. 3%

65not AIDA 26%

0 25 50 75 100 125 150 175 200 225 250sentences

Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 27 / 30

Page 28: Semantic Publishing with Nanopublications

Examples: AIDA Sentences for Claims byFamous Computer Scientists

• Computers are more flexible and more powerful if data and programinstructions are both stored side by side in the same memory. [John vonNeumann]

• The problem of determining the provability of arbitrary propositions infirst-order logic is undecidable. [Alonso Church / Alan Turing]

• There is no general algorithm that can decide for all possible programs andinputs whether the program will terminate or run forever. [Alan Turing]

• Automated reasoning can be suitably implemented as a search tree with thehypothesis at the root and by applying heuristics to trim branches. [HerbertSimon and Allen Newell]

• Single-layer perceptrons cannot learn the XOR relation. [Marvin Minsky andSeymour Papert]

• The use of GOTO statements in programming is harmful to programstructure. [Edsger Dijkstra]

• Hypertext representations are more flexible, more general, and more naturalthan classical text structure for computer-supported information systems.[Ted Nelson]

https://github.com/tkuhn/aida/

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 28 / 30

Page 29: Semantic Publishing with Nanopublications

Summary

• With Trusty URIs, digital artifacts on the Web can be madeimmutable and verifiable

• With a decentralized network of nanopublication servers, wecan build a reliable and trustworthy infrastructure for datapublishing

• With AIDA sentences, we can formally represent provenanceand metadata of informal or underspecified assertions

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 29 / 30

Page 30: Semantic Publishing with Nanopublications

Thank you for your attention!

Questions?

Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 30 / 30