rdf4pmc, rdfizing! pubmedcentral! - bioontology.org · 2013-02-06 · rdf4pmc, rdfizing!...
TRANSCRIPT
RDF4PMC, RDFizing PubMed Central
Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1 1Florida State University 2Universitat Jaume I
2/6/13
Biotea, R
DF4P
MC
1
Outline • The Biotea project • Why SemanNc Web Technologies? • RDF4PMC in a nutshell • Architecture • RDFizaNon process
• PMC RDFizaNon • Content enrichment • Some numbers for RDF4PMC • Architecture
• Using the data • SPARQL • Bio2RDF integraNon • Web services • A first prototype
• Challenges and Lessons • Currently working on… • Future Work • Conclusions • Acknowledgments
2/6/13
Biotea, R
DF4P
MC
2
Biotea
• Methodologies, methods and techniques supporNng semanNc enrichment of scholarly communicaNon
• Once enriched, then how is this changing our user experience?
2/6/13
Biotea, R
DF4P
MC
3
Scholarly data and documents are of most value when they are interconnected rather than
independent ChrisNne L. Borgman
Biotea
• How are publicaNons connected to each other? • Pu\ng together explicit asserNons from different papers to form new implicit asserNons
• SemanNc Web Techno logy supporNng scho lar ly communicaNon, Literature Based Discovery and the Search-‐Retrieval-‐and-‐InteracNng-‐with-‐the-‐Document (SRID) processes
2/6/13
Biotea, R
DF4P
MC
4
Scholarly data and documents are of most value when they are interconnected rather than independent
ChrisNne L. Borgman
Why SWT for research documents • Generates an adaptable open approach, the data becomes the plaaorm
• The SW delivers an integraNve plaaorm • Makes it easier for the community to build over the plaaorm • Simplifies programmaNc access to informaNon
• Retrieve all papers that have a component X (CHEBI) and the cellular locaNon in GO terms • As simple as relaNng terminologies
• Delivers Social Network ready content
2/6/13
Biotea, R
DF4P
MC
5
RDF4PMC in a nutshell
• Delivers an interoperable, interlinked, and self-‐describing document model in the biomedical domain.
• A network of interconnected documents • SemanNc infrastructure for PMC • An interface to the Web of Data • A knowledge model for biomedical literature – easily extendible
2/6/13
Biotea, R
DF4P
MC
6
RDF4PMC in a nutshell
• RDFizing biomedical literature by orchestraNng ontologies such as • DoCO, BIBO, DC, FOAF, W3CPROV, and others
• Datasets are available • RDF for metadata and content • RDF for annotaNons from text-‐mining
• RDFizator will be available • Adding other ontologies and annotators is possible • Working with XML from other sources is possible
2/6/13
Biotea, R
DF4P
MC
7
RDFReactor
PMC RDFization
PMC NXML
RDF GeneraNon BIBO
References Enrichment
Metadata+ Content + References
2/6/13
Biotea, R
DF4P
MC
8
2/6/13
Biotea, R
DF4P
MC
9
AutomaNc AnnotaNon
RDF GeneraNon
Annotations: Content Enrichment
Metadata+ Content + References
Web service Web service
Enriched RDF
2/6/13
Biotea, R
DF4P
MC
10
2/6/13
Biotea, R
DF4P
MC
11
RDF4PMC, some numbers
2/6/13
Biotea, R
DF4P
MC
12
RDF4PMC Server Architecture
Master Server
Web & SPARQL Server
(development)
Web & SPARQL Server
(produc<on)
RDF DB Master
RDF DB Slave
RDF DB Slave
PMC RDFiza<on
Import scripts + RDF files
Consuming the data: SPARQL
2/6/13
Biotea, R
DF4P
MC
14
SPARQL query Query expressed in natural
language
SELECT ?pmid ?<tle ?secTitle ?text
WHERE {
?ar<cle a bibo:Document ;
bibo:pmid ?pmid ;
dcterms:<tle ?<tle .
?sec<on a doco:Sec<on ;
dcterms:isPartOf ?ar<cle ;
dcterms:<tle ?secTitle .
FILTER (regex(str(?secTitle), "introduc<on", "i")).
?para a doco:Paragraph ;
dcterms:isPartOf ?sec<on ;
cnt:chars ?text .
FILTER (regex(str(?text), "cancer", "i")).
} LIMIT 50
à
Retrieving PubMed
idenNfier, arNcle Ntle,
secNon Ntle, and
paragraphs for those
arNcles containing the
term “cancer” in any
secNon whose Ntle
includes “introducNon”
Consuming the data: SPARQL
2/6/13
Biotea, R
DF4P
MC
15
SPARQL query Query expressed in natural
language
SELECT dis<nct ?pmid
WHERE {
?ar<cle a bibo:AcademicAr<cle ;
bibo:pmid ?pmid .
?annota<on a aot:ExactQualifier ;
ao:annotatesResource ?ar<cle ;
ao:hasTopic <h[p://purl.obolibrary.org/obo/CHEBI_60004> .
}
à
Retrieving PubMed idenNfier
for those arNcles that have
been semanNcally annotated
with the biological enNty
CHEBI:60004. The semanNc
annotaNon comes from the
occurrence of the term
“mixture” in any paragraph
of the retrieved arNcles.
CHEBI:60004 A mixture is a chemical substance composed of mulNple molecules, at least two of which are of a different kind
Bio2RDF Integration
2/6/13
Biotea, R
DF4P
MC
16
BIBO
Metadata & Referen
ces
Conten
t An
notaNo
ns
Consuming the data: Web services
2/6/13
Biotea, R
DF4P
MC
17
Retrieval Service
A list of terms and their related topics hqp://biotea.idiginfo.org/api/terms
A list of topics and their related vocabularies hqp://biotea.idiginfo.org/api/topics
All topics related to a term e.g., hqp://biotea.idiginfo.org/api/topics?term=cancer
All vocabularies related to a term e.g., hqp://biotea.idiginfo.org/api/vocabularies?term=cancer
All terms that start with a specific string (for autocomple<on) e.g.,hqp://biotea.idiginfo.org/api/terms?prefix=canc
All topics related to a vocabulary e.g., hqp://biotea.idiginfo.org/api/topics?vocabulary=po
RDF of ar<cles that include a term e.g., hqp://biotea.idiginfo.org/api/arNcles?term=cancer
Count of RDF of ar<cles that include a term e.g., hqp://biotea.idiginfo.org/api/arNcles?term=cancer&count=true
A list of vocabularies and their prefixes hqp://biotea.idiginfo.org/vocabularies
RDF of ar<cles that include a vocabulary e.g., hqp://biotea.idiginfo.org/api/arNcles?vocabulary=po
Metadata+ Content + References
AutomaNcally Annotated RDF
Consuming the data: a dashboard for semantic bio-‐publications
Catalase
SPARQL
SemanNcally enriched publicaNon
2/6/13
Biotea, R
DF4P
MC
18
Consuming the data: Lirst prototype
2/6/13
Biotea, R
DF4P
MC
19
Links
Title & authors
Cloud of Bio-‐annota<ons (term + # of bio-‐en<<es)
Abstract
Paragraphs containing the annota<on selected
by the user
Graphical tools
Consuming the data: A Lirst prototype
2/6/13
Biotea, R
DF4P
MC
20
Challenges and Lessons • Content
• Tables and images à Links • Inline tables à Format is lost • Supplementary material • Most of them follow one DTD but …
• References • At least 4 different styles • Some Nmes are just plain text
• Annotators • Not always available • Stop words are tricky
2/6/13
Biotea, R
DF4P
MC
21
Challenges and Lessons • Where are the facts? How to validate the facts?
• Delivering the expressivity of the data set to the end user is a complex issue
• AnnotaNon is context dependent
• Maintaining the triplet store has a learning curve of its own • Building SW infrastructure is H A R D
2/6/13
Biotea, R
DF4P
MC
22
Currently working on: Literature Discovery Process
• Search • Usually string-‐based search mechanisms • Li[le cogni<ve support
• Retrieval • Simple list of DB entries • Liqle cogniNve support
• Interac<ng with the document • Straight into the PDF • Zero cogniNve support • Data availability
Currently working on: Literature Discovery Process
• Search • Usually string-‐based search mechanisms • Liqle cogniNve support
• Retrieval • Simple list of DB entries • Li[le cogni<ve support • How, why and where are a set of documents similar?
• Interac<ng with the document • Straight into the PDF • Zero cogniNve support
Currently working on: Literature Discovery Process
• Search • Usually string-‐based search mechanisms • Liqle cogniNve support
• Retrieval • Simple list of DB entries • Liqle cogniNve support
• Interac<ng with the document • Straight into the PDF • Zero cogni<ve support
Future Work • RDF
• URI standardizaNon following similar paqerns to idenNfiers.org and Bio2RDF
• IntegraNon into Bio2RDF • Dataset idenNficaNon and summary (void) • Improve data for references
• User Experience • Web services for data analysis • RDF browser • More visualizaNon tools • SupporNng and taking advantage of the structure of the document
• CollaboraNve element
2/6/13
Biotea, R
DF4P
MC
29
Future Work • ApplicaNon in Clinical Psychology, the MSRC case
• From PDF to XML to RDF to Enriched Metadata for the PDF
• The PDF is gently introduced in the WoD • Once the metadata has been enriched then
• Rich interacNon supporNng: SEARCH-‐RETRIEVAL-‐INTERACTION WITH THE DOCUMENT (PDF)
2/6/13
Biotea, R
DF4P
MC
30
Conclusions • We provide
• the transformaNon into RDF from the original PMC files • the annotaNon of the RDF • an API which makes that data available.
• New vocabularies as well as annotators can easily be plugged in • Our approach is useful for both open and non-‐open access datasets • Publishers may decide what to expose via RDF and what content to make available
• Our approach is also applicable for PDF-‐only environments
2/6/13
Biotea, R
DF4P
MC
31
Acknowledgments • The MSRC consorNum • Greg Riccardi, FSU • Oscar Corcho, UPM • Olga Giraldo, UPM • Bob Morris, Harvard University • Michel DumonNer, Carleton University • Dietrich Rebholz-‐Schuhmann, University of Zurich • Diane Leiva, FSU • US DoD Grant MOMRP Grant w81xwh-‐10-‐2-‐0181 • All of those who gave us feedback about the RDFizaNon and the quality of our RDF datasets
2/6/13
Biotea, R
DF4P
MC
32
Thanks for you attention
Contacts • Alexander García: [email protected] • L. Jael García Castro: [email protected]
2/6/13
Biotea, R
DF4P
MC
33