chemspider – a platform to gather, host and integrate structure based data across the web

83
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 27-May-2015

1.094 views

Category:

Documents


0 download

DESCRIPTION

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.

TRANSCRIPT

Page 1: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Page 2: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Declaration

ChemSpider does NOT do toxicity prediction, yet We are building a content database for you to use

What ChemSpider does can be invaluable to those who do toxicity prediction Find “correct” chemical structures Find associated data (experimental/predicted) Link out to rich sources of information online Engage the community in sharing data

Page 3: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

A Pragmatic Vision in 2006

“Build a Structure Centric Community”

December 2006 – A project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Page 4: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Three Years of Experience Internet-based chemistry is a mess!

Most public compound databases on the web are contaminated. Including ours!

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares?

Page 5: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 6: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What is the Structure of Vitamin K?

Page 7: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

MeSH – Medical Subject Headings

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 8: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What is the Structure of Vitamin K1?

Page 9: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What is the Structure of Vitamin K1?

Page 10: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Chemical Abstracts“Common Chemistry” Database

Page 11: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Wikipedia

Page 12: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 13: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Incorrect Structures

Page 14: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Wow!

Page 15: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Lack of Stereochemistry

Page 16: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Does stereochemistry matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Page 17: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 18: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 19: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Comparative Toxigenomics Database

Page 20: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 21: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

PubChem

Page 22: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 23: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 24: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChEBI – Manual Curation

Page 25: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 26: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
Page 27: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What’s Methane?

Page 28: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What’s Methane?

Page 29: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What ELSE is Methane???

Page 30: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

The EXPERTS must get it right?!

Page 31: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 32: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

Page 33: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

Page 34: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

O

O

H

H H

HH

H

H

HH

HH

H HH H

H

HH HH H

H

H

H

H

H

H

H

Page 35: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

O

O

H

H H

HH

H

H

HH

HH

H HH H

H

HH HH H

H

H

H

H

H

H

H

H

H

H

O

OH

Page 36: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

O

OH

O

O

OH

O

O

O

N

O

O

OH

OH

OH

Page 37: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

O

OH

O

O

OH

O

O

O

N

O

O

OH

OH

OH

O O

OO

O

OH OH

OH

O

O

OH

O

OH N

Page 38: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What Sources Do You Trust?

Page 39: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

QSAR World

Page 40: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

The dataset for QSAR appears to have been generated with Name-to-Structure algorithms

Many systematic errors in the data – non-curated? Using such data for modeling is risky

Page 41: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Online Datasets

Page 42: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Internet-Based Chemistry is a Mess

Algorithms can get you so far in data cleaning

Human curation is necessary

Only the crowds can help with big data…

But, if we DID have a highly curated dataset… Reference database/dictionary of chemicals High quality data for modeling Centralized repository for models/data?

Page 43: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

www.chemspider.com

Page 44: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

We Answer Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Page 45: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Search for a Chemical…by name

Page 46: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 47: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Available Information….

Page 48: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Search for chemicals

Page 49: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChemSpider Today

24.8 million structures 400 data sources Grows daily Community annotation and curation

We curate, edit, change, enhance data daily

Page 50: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Search “Vitamin H”

Page 51: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Search “Vitamin H”

Page 52: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

“Curate” Identifiers

Page 53: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

“Curate” Identifiers

Page 54: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

“Curate” Identifiers

Page 55: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Page 56: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to articles, blogs, wikis etc Add spectral data Add photos Add MP3 files Add Videos

Page 57: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Data Validation – ONE CymarinQuestion Quality in Big Databases

Page 58: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Data Validation – Cortisol

Page 59: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Data Validation in Databases

ADNPLDHMAVUMIW     509 WQZGKKKJIJFFOK           119 RUDATBOHQWOJDD      118 Ursodeoxycholic

acid GUBGYTABKSRVRQ        89 Lactose BHQCQFFYRZLCQQ        80 Cholic acid RCINICONZNJXQF            76 Taxol KXGVEGMKQFWNSR      73 Deoxycholic acid PXGPLTODNUVGFL         71 HVYWMOMLDIMFJA       69 QGXBDMJGAMFCBF        63

Page 60: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

First request to Database Hosts!

Every public compound database host should add ONE feature – “Leave Comments”

Page 61: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Second request to Database Hosts! Show Comments

Page 62: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Page 63: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

What is a compound?

Page 64: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

The InChI Identifier

Page 65: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Linking and Modeling Bad Data

What is the value of linking bad data?

How can we model suspect data efficiently?

Commonly data are incorrect Measured data are suspect Structures associated with data are not correct Identifiers are incorrectly associated

Page 66: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Properties on the Database

Page 67: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Properties on the Database

Page 68: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Properties on the Database

Page 69: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Linked Out to Resources

Page 70: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Properties Linked Off the Database

Page 71: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

LASSO uses 23 kinds of Interactive Surface Point Descriptors and is conformation independent screens at 1 million structures/min is proven to enrich screened

databases provides scaffold hopping

Hbond Donors (5 kinds) Acceptors (5 kinds) Ambivalent H donor/acceptor Aromatic Pi-stacking (5 kinds) Hydrophobic (3 kinds) Metal ions Misc (Sulfur, Halogens)

http://dx.doi.org/10.1007/s10822-007-9164-5

SimBioSys LASSO

Page 72: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

SimBioSys LASSO

Page 73: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

LASSO Linked Out

Page 74: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Present Activities

Enhancing data model to manage more experimental properties – data available for download and modeling

Developing relationships with other software vendors and model developers for integration

Curating QSARWorld datasets for deposition

Page 75: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChemSpider Tomorrow

6 months: >1.2M compounds/month 6 months: >800,000 new uniques 6 months: >60 new data sources added

Continue the curation effort and keep cleaning Finish depositions – millions left to deposit Integrate RSC content – a massive archive! Integrate RSC publishing workflows and databases Enable the semantic web for chemistry – RDF

Page 76: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Future Activities – Data Management

Page 77: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Future Activities – Data Management

Aggregating and managing data from publications

Specifically aggregating: Data from MedChemComm Reaction Data (SyntheticPages) Spectral Data

Page 78: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Access Data Through Web Services

Page 79: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Mobile Data Access

Page 80: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

The Future of Linked Chemistry on the Internet? Public compound databases federate to build a

truly linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make

publications discoverable Public-Private databases can be linked Open Data proliferate RDF is everywhere

Page 81: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChemSpider & Toxicity Prediction

Continue the curation effort and keep cleaning Web services allow integration and data download Presently collaborating with groups to provide

access to data for modeling Intention is to provide the highest quality online

database with associated data

Page 82: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Community Contribution and Innovation “Community contribution”

best practice award”

i-Expo Innovation Award:June 2010 ALPSP Innovation Award: September 2010

Page 83: ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams