future biology is semantic information...

53
Future biology is Semantic information science Tetsuro Toyoda, Ph.D. Director of Bioinformatics And Systems Engineering Division, RIKEN, Japan 1 Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. www.base.riken.jp

Upload: others

Post on 12-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Future biology is Semantic

information science

Tetsuro Toyoda, Ph.D.

Director of Bioinformatics And Systems Engineering Division, RIKEN,

Japan 1

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. www.base.riken.jp

Nature of Life=Information

Nature of Replication

(Self-replication)

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Life evolving

through replications and selections

Replication of genome information (DNA replication)

Various replications (Fertilized eggs)

Various groups of individuals

Natural Selection

Evolution Cycle (About 3.8billion years)

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Genome is blueprint of Life (Genetic Information )

Genomic information is recorded by combined DNA blocks. DNA blocks are structured by only 4 kinds of blocks: A, T, G, and C. To map sequences of DNA blocks is Genome sequencing. Sequenced genomic information is represented by character information as shown below. ATGCTAGTCGCGTGCTAGTCGATCTCGTGCA・・・・・・ Human’s genome represented by about 3 billion letters.

Image of DNA blocks

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Biological diversity resulting from diversity of Genomic information

Sequences of DNA blocks evolve ↓means Information is evolved (Biological diversity)

DNA of human and microorganisms are consisted of the same 4 kinds of DNA blocks (G,A,C, and T). ↓ Any living organism hasn’t evolved as a material. (Common among living organisms)

Informatics viewpoint

Material viewpoint

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Phylogenetic Tree of Life

No matter the Information, Copy and Selection is the Driving Force of evolution

Phylogenetic Tree of

Open-Source Software

Information

Evolution of Open-source

programs undergo the same

Copy and Selection

data evolution process

as the Genome.

Previously, Digital and Genomic Information were separate

Recording media for information Expression of information (phenotype)

Digital In

form

ation

G

eno

mic In

form

ation

Cell

DNA

Copied DNA (Copied information)

CD、DVD

Copied CD (Copied information)

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Digital and Genomic Information merged with introduction of machine decoding of genomic

information from DNA

Digital

Info

rmatio

n

Gen

om

ic In

form

ation

DNA

CD, DVD

DNA Sequencer

DNA Synthesizer

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

“Intelligent” Cloud with DNA information

DNA Sequencer

DNA Synthesizer

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Could be a New Realm of Evolution

Natural Form of Evolution

SciNetS

Infrastructure for Collaboration

Part 2

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. scinets.org

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

SciNetS Purpose, Methods and System

SciNetS: Science for the Planet and Society

SciNetS: Science, Innovation, Education, Technology, Synergy

SciNetS: Scientist's Networking System

What is SciNetS?

What is SciNetS? Cloud Computing

and Collaboration

sinets.org

As an integrated database, SciNetS provides a versatile database container compatible with international standards. Using the semantic web, SciNetS provides a „total incubation infrastructure system‟ for life science-related Virtual laboratories.

The Cloud as a Center for Virtual Laboratory Collaboration

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. sinets.org

1 10 100 1,000

10,000

100,000

1,000,000

Number of databases to be integrated

Number of databases hosted

by RIKEN SciNeS

10,000

Hosting databases separately

has a limitation

Number of databases (Virtual Labs) hosted

is larger than the number of servers in the

system.

Databases > Server computers > System

The many database projects provided by SciNetS

→ SciNetS is expected to function as a new academic

medium for publishing individual databases as entire

sets of research results.

Available at

http://database.riken.jp

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Single system hosts numerous databases

Scale out

sinets.org

Total file size: 4.5 Terabytes

Total data files: 120 Million

Public databases: 192

incl. DBs for mammals 23

DBs for plants 29

DBs for protein 17

Private databases: 190

Data records (instances): 7.5 Million

Semantic relationships: 26 Million

SciNetS Data Cloud

sinets.org

Databases contain meta data, a Semantic Web

archives

“Meta data” (RDF)

includes semantic

links among data

items and

resources

“Data resources”

include

experimental

results or statistical

results

Mouse Phenotype Database

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. sinets.org

Upper Ontology

Gene Allele System Phenotype

ENU

Allele

BRC

Mouse

Cell

Collaborating on Ontology Basis for Classification of Databases in the Cloud

sinets.org Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Illustration by

Hiroshi Masiya

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

A Semantic web of Mouse databases

sinets.org/db/mammal

Visualizing the Semantic Web

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. sinets.org/db/mammal

InterPhenome: International Data Sharing of Mouse Phenotypes

http://www.interphenome.org/

SciNetS

for Life Sciences

Part 3

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

SciNetS Purpose, Methods and System

for Life Sciences

SciNetS: Science, Innovation, Education, Technology, Synergy

SciNetS: Scientist's Networking System

“Superbrain” solving biological problems by simulating artificial brain understanding the knowledge of biology

Insight

RIKEN Super Petaflop Computer

Superbrain

Collective

Intelligence

Limit of general search through database

Proposition: “My job is a jail”

job jail

Boolean understands that job ≠ jail

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Open-endedness in the brain

Proposition: “My job is a jail”

job jail

Restriction?

Brain tries to understand beyond the

framework of general concept Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Developing a brain type database(Superbrain)

↑ All calculations take within 1 second →Realization of high-speed web research

Research Result

Display the hit

document by

rounding up concept

units such as

ontology class,

candidate gene, etc.

Gene

1

Gene

3

Gene

5

Keyword Search by user

keywords (Diabetes)

Conditions (chromosomal region)

hit

hit

hit

Gene

1

Gene

4

Momentary

inference !

Documents including

Omics information,

literature, etc.

~10 million Documentron

Concepts Unit

~100 thousands

neuron

Document file for browse

~10 million

Documentron

Inference layer

(attractor)

Concept layer

(statistical test) Expression

layer (Document

summary )

Input layer (text search)

Out put Layer (web Service)

Concepts Unit

~100 thousands

neuron

Large scale system which includes over 10 million documents and concepts for each neuron.

New high speed technology calculates tens of thousands of spreadsheets

↑based on this table, statistical inferences such as calculation of statistic, Fisher’s exact test, can be done.

Gene X

hit

hit

hit

The technology, creating 3-dimension spreadsheets

for every gene (X=1,2,,,several tens of thousands of )

at once.

各種概念とキーワードとの関連性を網羅的かつ、高速・統計的に評価できる

Including

search word

Excluding

search word

Include

Gene X

“A “units of

Documents

“B “units of

Documents

Exclude

Gene X

“C “units of

Documents

“D “units of

Documents

Documentation files including literatures

(~10 millions)

Gene name , concept units, etc.

(~100 thousands)

Correspondence relationship

which curators associated

genes with documents.

Classifications by types of

document database

(Medline, OMIM, PPI,

Bio resources, etc.)

Keyword search by

users.

4 ↑ All calculations within 1 second →Realization of high-speed web research

(Calculating signal processing among neurons on brain-type database)

Ranking results of

candidate genes↓ Keyword : type 2 diabetes

Genome Condition: gene of no.6 chromosome

Interval

Inference of candidate gene

Rough mapping based on crossing

experiment

(~10Mbp wide interval)

Too many genes

Searching for candidate genes by

positional cloning.

Identification of causative gene variation

So far contributed to 65 achievements of searching researches on disease-causing genes.

Conditions of

compound retrieval

Search example: Inference of causative gene by positional cloning.

http://omicspace.riken.jp/PosMed/

PosMed: “Positional MEDLINE” assists your positional-cloning studies

SciNetS

for the Planet and Society

Part 4

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. genocon.org

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

SciNetS Purpose, Methods and System

for the Planet and Society

SciNetS: Science for the Planet and Society

SciNetS: Science, Innovation, Education, Technology, Synergy

SciNetS: Scientist's Networking System

Fossil Resources Biomass

Petrochemistry

process

Energy Chemical

Products Materials

Distillation

He

avy O

il

Gas o

il

Kero

sene

Ga

so

line

Nap

hth

a

Bioprocess

Syngas

Oil, e

tc

Sugar

Development

of

Innovation

Technology

21st Century - Sustainable Society

~ Carbon-neutral ~ 20th Century - Consumptive Society

Creation of

New Industry

CO2

Gly

co

syla

tion

CO

2

CO2

CO2

Bio

Materials Chemical

Products

Crude

oil

Energy

Applications

of

Renewable

Resource

Plants

↓ Search for resources Development of oilfield (Fossil resources)

↓ Search for resources Development of genome (Biological Resource)

Sustainable society supported by Green Biotechnology

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Toward an age of Rational Design using genomic data.

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Creating bio resources from information

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

genocon.org

GenoCon International Rational Genome

Design Contest offers contestants the chance to compete in technologies for rational genome design using a browser-based programming environment provided at http://genocon.org

GenoCon Open platform for rational Genome design

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Optimize genome design for plants that absorb and resolve formaldehyde to prevent sick house syndrome.

• Introduce two enzymes, extracted

from bacteria, into Calvin cycle of

plants

• Enhancement of formaldehyde-

resistance by shepherd's-purse and

Tobacco.

sh

ep

he

rd„s

-purs

e

To

ba

cco

AB: Gene introduction strain

CK: Control

PHI

HPS Hexulose 6-phosphate synthase

Hexulose 6-phosphate isomerase

Introduced

enzymes Patent applied by Prof. Izui,

Kyoto/Kinki University

Formaldehyde tolerance designed by Prof. Izui is the inaugural challenge in GenoCon

34

34 genocon.org

SciNetS provides a web-based programming interface for a user to process each data item in our database cloud, where all the programs and files the user has created are stored.

Programming on a web browser

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

genocon.org

Creating Bio Resources from Information Resources

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. genocon.org

SciNetS

Semantic JSON

PowerTool

Part 5

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. semantic-json.org

Short URL services such as http://bit.ly are often used in social media such as twitter. These offer similar domain mapping services by mapping external URLs under various domain names to internal URLs of a uniform domain with a short internal ID that corresponds to the external one in the repository.

During semantic inference, a user gains access to a Semantic-JSON repository by invoking Semantic-JSON URIs, consisting of an internal ID and a command name, along with step-by-step receipt of a fragment of Semantic Web data in the JSON format.

Semantic-JSON: Uniform Domain Mapping

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Semantic-JSON upgrades the mapping service to have semantic relationships among the internal IDs that can be used to accelerate semantic communications in social media of the future.

domain/json/command/internalID

bit.ly/internalID

semantic-json.org

Sample Operation in Mathematica

Load the Semantic-JSON package

Set the server’s URL

Obtain labels of two individuals

Obtain the shortest paths between two individuals

Similarly, Semantic-JSON is available in popular languages including Java, Ruby, Perl, Python.

Semantic-JSON supports most languages used by Bioinformaticians

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. semantic-json.org

provided by Semantic-JSON for user to process each data item in our database cloud where the user stored all the programs and files they created

Web-based programming interface

semantic-json.org

Semantic JSON Application Snapshot of the Plant Ontology tree viewer embedded in SciNetS.org.

When the user opens the web page the Tree structure is programmably drawn using Semantic-JSON and published into human readable content.

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

Ontology

Browser semantic-json.org

Semantic-JSON: API to access SemanticWeb data with JSON

Please Visit http://semantic-json.org

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. semantic-json.org

SemanticTable

PowerTool

Part 6

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

Current Cloud Table Applications don’t support Semantic Web, so can not be used for Life-Sciences

• Current Table applications only use simple Semantic Data, so do not harness the power of the Semantic web + Ontology

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

http://dabbledb.com

http://socrata.com

http://factual.com

http://google.com/fusiontables

Introduction >> Biological data formats used in life science research .

• Biologists like to use the table format for their research results

• Since most biological research results are essentially structured kinds of data, however, it is appropriate to represent these research results as Semantic Web graphs consisting of subject-property-object triples.

>> Characteristics of table data and semantic data .

SematicTable.org

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

What is the Semantic Table

• SciNetS SemanticTable is a data container with advantages of both tables and Semantic Web graphs: 1. It provides a table view for biological data convertible to TSV format (tab-separated-value) text table data popular among biologists for computer processing.. 2. It is convertible to explicit Semantic Web graph such as RDF that a computer can read and process intelligently.

What is the SciNetS Semantic Table?

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

Explicit Correspondence Semantic

Table and Semantic Web Graph

This table presents implicitly the

following semantic meanings, which

may be understood by people, but not

by computers

Yellow row - Table Header

(Class or property name of data)

Blue Row - Table Data

Green Column - Subjects

Red column - Objects

- The first row means:

“The Person named Michael is 23 years

of age and his gender is Male”

- The second row means: “ The Person

named Jennifer is 27 years of age and

her gender is Female”

For Intelligent processing by a computer,

the table data needs to be presented

with explicit semantic data as follows in

computer- readable Semantic Web

graph format.

Correspondence between Semantic Table and Semantic Web Graph

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

SemanticTable uses Ontology to represent Semantic Web data about the considered subject

Resource

B1

Resource

B2

Resource

B3

Resource

B4

Predicate

Predicate

Predicate

Predicate

Resource

C1

Resource

C4

Predicate

Predicate

Resource

C2 Predicate

Resource

C4 Predicate

Resource

A1

Resource

A2

Table about Class B

Table about Class C

Table about Class A

Class A Class B Class C

Resource

A3

SematicTable.org Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

SemanticTable for Life Sciences

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

By adding Ontology on a huge scale, we can apply SemanticTable to the life sciences

SematicTable.org

About SemanticTable.org • A "Table" is the most common type of data format used in biological research. The Semantic Web, on the other hand, where knowledge is represented by statements comprising a subject, predicate and object, is also becoming a major container for data accessible by automatic computational reasoners; i.e. computers. Here we present "SemanticTable.org", a data-mart server providing a huge amount of biological semantic-web data as simple table-style data files that are reversely convertible to the semantic-web format.

The SemanticTable.org server contains 644 tables and 7.5 million rows of SemanticTable data as of December 2010. Conversion from table data files to semantic-web data files and vice versa are supported by our web site (http://semantictable.org) and the World Wide Web Consortium (W3C) web site RDFa Distiller and Parser page at (http://www.w3.org/2007/08/pyRdfa/).

• We propose SemanticTable, based on the standard RDFa specification, as the most understandable way for biologists not familiar with semantic-web specifications to utilize semantic-web data.

About Semantic Table

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

Using Semantic data could allow

automatic drag and drop field

rearrangement by a program or

browser running on a PC.

Semantic Table Merging

SemanticTable Merge Program

Features: ・Merging Multiple Class Semantic Tables

・Subject Column Swapping

・Dumping to Text

SemanticTable

RDF

SemanticTable

RDF

SemanticTable

join

Merging SciNetS Semantic Tables

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. SematicTable.org

Join the SemanticTable Consortium

• Call for developers to join the consortium to create and author such programs;

the fundamental semantic groundwork is ready to use.

• Open collaboration for the standard will jumpstart the new platform.

• Open source programs for SemanticTable run on cloud-computing technology,

as do the programs for spreadsheet tables.

• We hope the SemanticTable format will be distributed broadly and freely for

general use, as well as for scientific use.

Bioinformatics And Systems Engineering (BASE) division,

RIKEN Yokohama Institute of Chemical and Physical Science

SciNetS.org Semantictable.org

[email protected]

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011.

End

Thank you very much!

Bioinformatics And Systems Engineering Division, RIKEN,

Japan

Copyright © Tetsuro Toyoda, March 4, 2011. Presentation at Semantic Web Conference 2011. www.base.riken.jp