crowdsourced curation of chemistry data. how bad is online chemistry data?

54
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 26-Jun-2015

2.612 views

Category:

Technology


0 download

DESCRIPTION

This presentation was given at the Wolfram Data Summit in Washington DC on Sept 9th 2010 as part of a panel series of presentations and discussions on crowdsourcing approaches for data. It was a rant by me on the quality of what's online and questioning "who cares".

TRANSCRIPT

Page 1: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Antony WilliamsWolfram Summit, September 2010

Page 2: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

A Pragmatic Vision

“Build a Structure Centric Community”

Integrate chemistry across the internet based on “chemical structure”

A “structure-based hub” to information and data Let chemists contribute their own data Allow the community to curate/correct data

Page 3: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

www.chemspider.com

Page 4: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

We Answer Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Page 5: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Search for a Chemical…by name

Page 6: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 7: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Available Information….

Page 8: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Search for chemicals

Page 9: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

ChemSpider Today

24.8 million structures 400 data sources Grows daily Community annotation and curation

We curate, edit, change, enhance data daily

Page 10: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Linked Data on the Web

Page 11: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Three Years of Experience Internet-based chemistry is a mess!

Most public compound databases on the web are contaminated. Including ours!

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares?

Page 12: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 13: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What is the Structure of Vitamin K?

Page 14: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

MeSH – Medical Subject Headings

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 15: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What is the Structure of Vitamin K1?

Page 16: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What is the Structure of Vitamin K1?

Page 17: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Chemical Abstracts“Common Chemistry” Database

Page 18: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Wikipedia

Page 19: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 20: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Incorrect Structures

Page 21: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Wow!

Page 22: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Lack of Stereochemistry

Page 23: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Does stereochemistry matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Page 24: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 25: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 26: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 27: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

PubChem

Page 28: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 29: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 30: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

ChEBI – Manual Curation

Page 31: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 32: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 33: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Page 34: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What’s Methane?

Page 35: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What’s Methane?

Page 36: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

What ELSE is Methane???

Page 37: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

The EXPERTS must get it right?!

Page 38: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 39: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Internet-Based Chemistry is a Mess

Algorithms can get you so far

Human curation is necessary

Only the crowds can help with big data… ChemSpider is approaching 25 million compounds

Page 40: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Search “Vitamin H”

Page 41: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Search “Vitamin H”

Page 42: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

“Curate” Identifiers

Page 43: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

“Curate” Identifiers

Page 44: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

“Curate” Identifiers

Page 45: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Page 46: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to articles, blogs, wikis etc Add spectral data Add photos Add MP3 files Add Videos

Page 47: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Data Validation – Not Vitamin K1

Page 48: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Data Validation – Not Beclamethasone Dipropionate

DailyMed Article

Page 49: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Data Validation …NOT Cholesterol

Page 50: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Data Validation – ONE CymarinQuestion Quality in Big Databases

Page 51: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

First request to Database Hosts!

Every public compound database host should add ONE feature – “Leave Comments”

Page 52: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Second request to Database Hosts! Show Comments

Page 53: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Always Question Online Chemistry

Page 54: Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams