rdf4pmc, rdfizing! pubmedcentral! - bioontology.org · 2013-02-06 · rdf4pmc, rdfizing!...

RDF4PMC, RDFizing PubMed Central

Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1 1Florida State University 2Universitat Jaume I

2/6/13

Biotea, R

DF4P

MC

1

Outline •  The Biotea project •  Why SemanNc Web Technologies? •  RDF4PMC in a nutshell •  Architecture •  RDFizaNon process

•  PMC RDFizaNon •  Content enrichment •  Some numbers for RDF4PMC •  Architecture

•  Using the data •  SPARQL •  Bio2RDF integraNon •  Web services •  A first prototype

•  Challenges and Lessons •  Currently working on… •  Future Work •  Conclusions •  Acknowledgments

2/6/13

Biotea, R

DF4P

MC

2

Biotea

•  Methodologies, methods and techniques supporNng semanNc enrichment of scholarly communicaNon

•  Once enriched, then how is this changing our user experience?

2/6/13

Biotea, R

DF4P

MC

3

Scholarly data and documents are of most value when they are interconnected rather than

independent ChrisNne L. Borgman

Biotea

•  How are publicaNons connected to each other? •  Pu\ng together explicit asserNons from different papers to form new implicit asserNons

•  SemanNc Web Techno logy supporNng scho lar ly communicaNon, Literature Based Discovery and the Search-‐Retrieval-‐and-‐InteracNng-‐with-‐the-‐Document (SRID) processes

2/6/13

Biotea, R

DF4P

MC

4

Scholarly data and documents are of most value when they are interconnected rather than independent

ChrisNne L. Borgman

Why SWT for research documents •  Generates an adaptable open approach, the data becomes the plaaorm

•  The SW delivers an integraNve plaaorm •  Makes it easier for the community to build over the plaaorm •  Simplifies programmaNc access to informaNon

•  Retrieve all papers that have a component X (CHEBI) and the cellular locaNon in GO terms •  As simple as relaNng terminologies

•  Delivers Social Network ready content

2/6/13

Biotea, R

DF4P

MC

5

RDF4PMC in a nutshell

• Delivers an interoperable, interlinked, and self-‐describing document model in the biomedical domain.

• A network of interconnected documents •  SemanNc infrastructure for PMC • An interface to the Web of Data • A knowledge model for biomedical literature – easily extendible

2/6/13

Biotea, R

DF4P

MC

6

RDF4PMC in a nutshell

•  RDFizing biomedical literature by orchestraNng ontologies such as •  DoCO, BIBO, DC, FOAF, W3CPROV, and others

•  Datasets are available •  RDF for metadata and content •  RDF for annotaNons from text-‐mining

•  RDFizator will be available •  Adding other ontologies and annotators is possible • Working with XML from other sources is possible

2/6/13

Biotea, R

DF4P

MC

7

RDFReactor

PMC RDFization

PMC NXML

RDF GeneraNon BIBO

References Enrichment

Metadata+ Content + References

2/6/13

Biotea, R

DF4P

MC

8

2/6/13

Biotea, R

DF4P

MC

9

AutomaNc AnnotaNon

RDF GeneraNon

Annotations: Content Enrichment


Web service Web service

Enriched RDF

2/6/13

Biotea, R

DF4P

MC

10

2/6/13

Biotea, R

DF4P

MC

11

RDF4PMC, some numbers

2/6/13

Biotea, R

DF4P

MC

12

RDF4PMC Server Architecture

Master Server

Web & SPARQL Server

(development)

Web & SPARQL Server

(produc<on)

RDF DB Master

RDF DB Slave

RDF DB Slave

PMC RDFiza<on

Import scripts + RDF files

Consuming the data: SPARQL

2/6/13

Biotea, R

DF4P

MC

14

SPARQL query Query expressed in natural

language

SELECT ?pmid ?<tle ?secTitle ?text

WHERE {

?ar<cle a bibo:Document ;

bibo:pmid ?pmid ;

dcterms:<tle ?<tle .

?sec<on a doco:Sec<on ;

dcterms:isPartOf ?ar<cle ;

dcterms:<tle ?secTitle .

FILTER (regex(str(?secTitle), "introduc<on", "i")).

?para a doco:Paragraph ;

dcterms:isPartOf ?sec<on ;

cnt:chars ?text .

FILTER (regex(str(?text), "cancer", "i")).

} LIMIT 50

à

Retrieving PubMed

idenNfier, arNcle Ntle,

secNon Ntle, and

paragraphs for those

arNcles containing the

term “cancer” in any

secNon whose Ntle

includes “introducNon”

Consuming the data: SPARQL

2/6/13

Biotea, R

DF4P

MC

15

SPARQL query Query expressed in natural

language

SELECT dis<nct ?pmid

WHERE {

?ar<cle a bibo:AcademicAr<cle ;

bibo:pmid ?pmid .

?annota<on a aot:ExactQualifier ;

ao:annotatesResource ?ar<cle ;

ao:hasTopic <h[p://purl.obolibrary.org/obo/CHEBI_60004> .

}

à

Retrieving PubMed idenNfier

for those arNcles that have

been semanNcally annotated

with the biological enNty

CHEBI:60004. The semanNc

annotaNon comes from the

occurrence of the term

“mixture” in any paragraph

of the retrieved arNcles.

CHEBI:60004 A mixture is a chemical substance composed of mulNple molecules, at least two of which are of a different kind

Bio2RDF Integration

2/6/13

Biotea, R

DF4P

MC

16

BIBO

Metadata & Referen

ces

Conten

t An

notaNo

ns

Consuming the data: Web services

2/6/13

Biotea, R

DF4P

MC

17

Retrieval Service

A list of terms and their related topics hqp://biotea.idiginfo.org/api/terms

A list of topics and their related vocabularies hqp://biotea.idiginfo.org/api/topics

All topics related to a term e.g., hqp://biotea.idiginfo.org/api/topics?term=cancer

All vocabularies related to a term e.g., hqp://biotea.idiginfo.org/api/vocabularies?term=cancer

All terms that start with a specific string (for autocomple<on) e.g.,hqp://biotea.idiginfo.org/api/terms?prefix=canc

All topics related to a vocabulary e.g., hqp://biotea.idiginfo.org/api/topics?vocabulary=po

RDF of ar<cles that include a term e.g., hqp://biotea.idiginfo.org/api/arNcles?term=cancer

Count of RDF of ar<cles that include a term e.g., hqp://biotea.idiginfo.org/api/arNcles?term=cancer&count=true

A list of vocabularies and their prefixes hqp://biotea.idiginfo.org/vocabularies

RDF of ar<cles that include a vocabulary e.g., hqp://biotea.idiginfo.org/api/arNcles?vocabulary=po


AutomaNcally Annotated RDF

Consuming the data: a dashboard for semantic bio-‐publications

Catalase

SPARQL

SemanNcally enriched publicaNon

2/6/13

Biotea, R

DF4P

MC

18

Consuming the data: Lirst prototype

2/6/13

Biotea, R

DF4P

MC

19

Links

Title & authors

Cloud of Bio-‐annota<ons (term + # of bio-‐en<<es)

Abstract

Paragraphs containing the annota<on selected

by the user

Graphical tools

Consuming the data: A Lirst prototype

2/6/13

Biotea, R

DF4P

MC

20

Challenges and Lessons •  Content

•  Tables and images à Links •  Inline tables à Format is lost •  Supplementary material •  Most of them follow one DTD but …

•  References •  At least 4 different styles •  Some Nmes are just plain text

•  Annotators •  Not always available •  Stop words are tricky

2/6/13

Biotea, R

DF4P

MC

21

Challenges and Lessons •  Where are the facts? How to validate the facts?

•  Delivering the expressivity of the data set to the end user is a complex issue

•  AnnotaNon is context dependent

•  Maintaining the triplet store has a learning curve of its own •  Building SW infrastructure is H A R D

2/6/13

Biotea, R

DF4P

MC

22

Currently working on: Literature Discovery Process

•  Search •  Usually string-‐based search mechanisms •  Li[le cogni<ve support

•  Retrieval •  Simple list of DB entries •  Liqle cogniNve support

•  Interac<ng with the document •  Straight into the PDF •  Zero cogniNve support •  Data availability


•  Search •  Usually string-‐based search mechanisms •  Liqle cogniNve support

•  Retrieval •  Simple list of DB entries •  Li[le cogni<ve support •  How, why and where are a set of documents similar?

•  Interac<ng with the document •  Straight into the PDF •  Zero cogniNve support


•  Search •  Usually string-‐based search mechanisms •  Liqle cogniNve support

•  Retrieval •  Simple list of DB entries •  Liqle cogniNve support

•  Interac<ng with the document •  Straight into the PDF •  Zero cogni<ve support

Future Work •  RDF

•  URI standardizaNon following similar paqerns to idenNfiers.org and Bio2RDF

•  IntegraNon into Bio2RDF •  Dataset idenNficaNon and summary (void) •  Improve data for references

•  User Experience •  Web services for data analysis •  RDF browser •  More visualizaNon tools •  SupporNng and taking advantage of the structure of the document

•  CollaboraNve element

2/6/13

Biotea, R

DF4P

MC

29

Future Work • ApplicaNon in Clinical Psychology, the MSRC case

•  From PDF to XML to RDF to Enriched Metadata for the PDF

•  The PDF is gently introduced in the WoD • Once the metadata has been enriched then

•  Rich interacNon supporNng: SEARCH-‐RETRIEVAL-‐INTERACTION WITH THE DOCUMENT (PDF)

2/6/13

Biotea, R

DF4P

MC

30

Conclusions •  We provide

•  the transformaNon into RDF from the original PMC files •  the annotaNon of the RDF •  an API which makes that data available.

•  New vocabularies as well as annotators can easily be plugged in •  Our approach is useful for both open and non-‐open access datasets •  Publishers may decide what to expose via RDF and what content to make available

•  Our approach is also applicable for PDF-‐only environments

2/6/13

Biotea, R

DF4P

MC

31

Acknowledgments •  The MSRC consorNum •  Greg Riccardi, FSU •  Oscar Corcho, UPM •  Olga Giraldo, UPM •  Bob Morris, Harvard University •  Michel DumonNer, Carleton University •  Dietrich Rebholz-‐Schuhmann, University of Zurich •  Diane Leiva, FSU •  US DoD Grant MOMRP Grant w81xwh-‐10-‐2-‐0181 •  All of those who gave us feedback about the RDFizaNon and the quality of our RDF datasets

2/6/13

Biotea, R

DF4P

MC

32

Thanks for you attention

Contacts •  Alexander García: [email protected] •  L. Jael García Castro: [email protected]

2/6/13

Biotea, R

DF4P

MC

33

rdf4pmc, rdfizing! pubmedcentral! - bioontology.org · 2013-02-06 · rdf4pmc, rdfizing!...

Documents