the progress on sagace and data integration

43
The Progress on Sagace and Data Integration Maori Ito 1

Upload: maori-ito

Post on 22-May-2015

175 views

Category:

Technology


3 download

DESCRIPTION

This slide explains the progress of Sagace ( cross search engine for for Biomedical Data & Resources in Japan) and data integration by using RDF.

TRANSCRIPT

Page 1: The Progress on Sagace and Data Integration

The Progress on Sagace and

Data Integration

Maori Ito1

Page 2: The Progress on Sagace and Data Integration

Main two topics

• Sagace –Cross Search

• RDF –Data Integration

2

Page 3: The Progress on Sagace and Data Integration

Sagace

• Search for Biomedical Data & Resources in Japan

Page 4: The Progress on Sagace and Data Integration

Features

• Focus on biomedical database• Manual Semi-automated Ranking • Refining search results with

facets• More informative search results

with metadata

Page 5: The Progress on Sagace and Data Integration

Mechanism of Search Engine

1. Crawling2. Indexing3. Query Processing4. Scoring

Page 6: The Progress on Sagace and Data Integration

6

Crawling

Databases

Crawling Program

Page 7: The Progress on Sagace and Data Integration

Indexing

• Split data convenient size and store own server

Internal Server

Indexing Data

Page 8: The Progress on Sagace and Data Integration

Query Processing and Scoring

Page 9: The Progress on Sagace and Data Integration

NIBIO

MEDALS

JCGGDB

NBDC / DBCLS

AgriTogo

Collaborate by using P2P architecture

Search System

9

Page 10: The Progress on Sagace and Data Integration

What is the most Important

thing in cross search ?

! Speed and Accuracy !

Page 11: The Progress on Sagace and Data Integration

Features

• Focus on biomedical database

• Semi-automated Ranking

Page 12: The Progress on Sagace and Data Integration

Log Analysis and reflect search results

• The members of top 8 databases are almost the same.– Patents– KEGG MEDICUS– Medicine and pharmaceutical

proceedings– Drug emergency call– Ingredients information of health food– Merck Manual– Medical Information Network Distribution

Service– The Encyclopedia of Psychoactive Drugs

12

Page 13: The Progress on Sagace and Data Integration

13

Comparison of databases

• Popular databases are Medical or Pharmaceutical “literal rich” databases.

• Top databases run away with the winnings!

• More than half databases have never clicked!

Page 14: The Progress on Sagace and Data Integration

Log data has been reflected in ranking.

• Original score -> A:12,000,B:8,000• Gather clicked data• Eliminate duplicating database in the same

day and pick up lowest denotative rank.– If the database score is lower than 12,400, add

200.– The other databases are added 100 basically. But

if the database denotative rank is lower than 10, add 200.

• Patents score is fixed 8,100.• Maximum score is 30,000.

Page 15: The Progress on Sagace and Data Integration

15

Unpopular databases

• Sagace has started the service in March 2012.

• Some databases have never clicked since then.

• Eliminate these databases.• Databases

– 272 DB -> 122 DB

Page 16: The Progress on Sagace and Data Integration

Results

• Accuracy for users must have improved.

• Reducing databases also caused speed up.

16

Page 17: The Progress on Sagace and Data Integration

Specific databases in life science

• Some databases in life science is lacked “literal information” .

• Cross search engine is suitable to show literal information.

• Metadata will help these database.

17

Page 18: The Progress on Sagace and Data Integration

18

Metadata

• If the developers mark up data with metadata…

Page 19: The Progress on Sagace and Data Integration

Metadata

• Literal information can add into search results!

Results Image

Page 20: The Progress on Sagace and Data Integration

How to mark up and reflect the results?

【 HTML 】

<div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry"> <span itemprop="dateModified">2012-10-24</span></div>

【 Result 】

Declare scope itemtype with normal html tag

Select property Content

Page 21: The Progress on Sagace and Data Integration

Win Win Win!

• Database developers can appeal rich database information.

• Users can find valuable information easily.

• Crawler program can find these metadata properly.

21

Page 22: The Progress on Sagace and Data Integration

What is schema.org?

• "Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other applications.”

• "Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.”

(http://schema.org/)

Page 23: The Progress on Sagace and Data Integration

Microdata

“You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content.”

(http://schema.org/docs/gs.html)• Finalizing the proposal of schema.org

extension is a requirement to show “rich” results for major search engines.

Page 24: The Progress on Sagace and Data Integration

Current Situation

• Define original "property" (entryID, isEntryOf, taxon, seeAlso, reference).

• Please refer to– http://sagace.nibio.go.jp/press/metadata/markup/

Page 25: The Progress on Sagace and Data Integration

6 DBs, 1 catalog and 1 DB archive applied microdata!

• DoBISCUIT(Database Of BIoSynthesis clusters CUrated and InTegrated)

• JCRB Cell Bank • Functional Glycomics with KO mice database • Glyco-Disease Genes Database• JCGGDB Report• MEDALS• Integbio Database Catalog• Life Science Database Archive

Page 26: The Progress on Sagace and Data Integration

To add biological database vocabularies into schema.org,

• “Need more people who think it is a good idea.” (by organizers @ schema.org)– [email protected] (<- ML Let’s join !)

• We need more databases and web pages that are marked up with microdata.

• I want your opinion on microdata.• Let's talk!

Page 27: The Progress on Sagace and Data Integration

http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/

Data Integration with RDF

http://www.cytoscape.org/what_is_cytoscape.html

Page 28: The Progress on Sagace and Data Integration

What is RDF?

• Resource Description Framework

28

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .drugbank:DB00316 rdfs:label "Acetaminophen”.

RDF

Page 29: The Progress on Sagace and Data Integration

29

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .

drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 .drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".

RDF

subject predicate object

Drugbank:DB00316

Drugbank_target:290

Acetaminophen

Prostaglandin G/H synthase2

rdfs:label

rdfs:label

drugbank_vocab:target

Subject

ObjectPredicate

Predicate

Predicate ObjectObject / Subject

Page 30: The Progress on Sagace and Data Integration

SPARQL(SPARQL Protocol and RDF Query Language)

• “SPARQL (pronounced "sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format.”

(http://en.wikipedia.org/wiki/SPARQL) 30

Page 31: The Progress on Sagace and Data Integration

How to use?

"Prostaglandin G/H synthase 2”

31

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 .drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".

RDF

PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:>select distinct ?v where {#distinct means exclude duplicate?s rdfs:label "Acetaminophen” ; drugbank:target ?t .?t rdfs:label ?v.}

SPARQL

Results!

What is the target of “Acetaminophen”

Page 32: The Progress on Sagace and Data Integration

32

SPARQL Endpoint

What is the target of “Acetaminophen”

e.g:http://drugbank.bio2rdf.org/sparql

Page 33: The Progress on Sagace and Data Integration

33

Results

• You can get results from the endpoint.

Page 34: The Progress on Sagace and Data Integration

34

RDFization in life science

• Many data has been rdfized already.

• Affymetrix,Drugbank, GO, OMIM, KEGG, PDB, UniProt, PubMed...

Page 35: The Progress on Sagace and Data Integration

35

Let’s try!

• Bio2RDF– http://bio2rdf.org/

• EBI RDF Platform – http://www.ebi.ac.uk/rdf/

• SPARQL endpoint– e.g:http://drugbank.bio2rdf.org/

sparql• How to learn?

– Learning SPARQL

Page 36: The Progress on Sagace and Data Integration

Pros of RDF

• Excellent with life science data

• Comparison to RDB– Easily be expanded– RDB RDF

• Excellent with No SQL too– key value

36

Page 37: The Progress on Sagace and Data Integration

37

Cons of RDF

• A bit hard to make RDF• A bit hard to create developing

environments

• Speed of SPARQL

Page 38: The Progress on Sagace and Data Integration

38

Currant situation in NIBIO

• Toxygates– Johan-san and Igarashi-san have

been developing .

• Orphan Drug Data

Page 39: The Progress on Sagace and Data Integration

Toxygates

• RDFization Open TG-Gates data.– microarray data, pathological data

(kidney, liver, grade ,... )• Linked to other database by

using RDF– KEGG pathway– GO terms– CHEMBL– DrugBank

39

Page 40: The Progress on Sagace and Data Integration

40

http://toxygates.nibio.go.jp/

Page 41: The Progress on Sagace and Data Integration

Orphan Drug

• RDFize orphan drug information in NIBIO.

41

<http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996"; drgn:designationDate "1996/4/1"; drgn:number "(8yaku A) No. 81";drgb:name "Imiglucerase"; dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosplenomegaly, bone symptoms, etc. in patients with Gaucher's disease"; drgn:designationApplicant "Genzyme Japan K.K."; drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hepatosplenomegaly, bone symptoms, etc. in patients with Gaucher's disease"; drgb:manufacturer "Genzyme Japan K.K."; eob:approvalDate "1998/3/6"; drgb:product "Cerezyme injection 200U";drgb:brand "CEREZYME_ injection"; drgn:approvedName "Imiglucerase (Genetical Recombination)";drgn:status "Approved".

Page 42: The Progress on Sagace and Data Integration

42

Let’s try and give me your idea!

• RDF data will enlarge many kinds of data in Life science.

• NBDC encouraged this movement.

Page 43: The Progress on Sagace and Data Integration

Future Perspective

• RDFize other databases in NIBIO– E.g. bioresource

• Examine the benefit• Spread RDF to many scientists • Make useful environments for

who are not familiar with computers

43