the expansive reach of chemspider as a resource for the chemistry community
DESCRIPTION
Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of the ChemSpider platform and the nature of the solutions that it helps to enable. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.TRANSCRIPT
The Expansive Reach of ChemSpider as a Resource for
the Chemistry Community
Antony WilliamsUniversity of Oregon, April 24th 2013
The World of Online Chemistry• Property databases• Compound aggregators• Screening assay results• Scientific publications • Encyclopedic articles (Wikipedia)• Metabolic pathway databases• ADME/Tox data – eTOX for example• Blogs/Wikis and Open Notebook Science
We Have …Too Much Data!!!
e-Science and Primary Data• How much data generated in a lab, that COULD go public, is
lost forever?
TotallySynthetic.com
e-Science and Primary Data• How much data generated in a lab, that COULD go public, is
lost forever?• Public Domain reference databases of value?
– Syntheses– Properties– Spectra– CIFs– Images
Collaborative Knowledge Management
e-Science and Primary Data• How much data generated in a lab, that COULD go public, is
lost forever?• Public Domain reference databases of value?
– Syntheses– Properties– Spectra– CIFs– Images
• Much of chemistry is chemical structure-based – where and how could we host these data?
RSC’s ChemSpider
Crowdsourced “Annotations”• Users can add
– Descriptions/Syntheses/Commentaries– Links to PubMed articles– Links to articles via DOIs – Add spectral data– Add Crystallographic Information Files– Add photos– Add MP3 files– Add Videos
Spectra
Chemistry Data online is messy• We have inherited errors• All public compound databases, including ours, have
errors• “Incorrect” structures – assertions, timelines etc• “Incorrect” names associated with structures• Properties• Links• Publications• ENORMOUS CHALLENGE
The Structure of Vitamin K?
MeSH
• A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
• Variants of systematic names on PubChem– 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl – 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl– 2-methyl-3-[(E)-3,7,11,15-tetramethyl– 2-methyl-3-(3,7,11,15-tetramethyl– 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Question Everything online: www.dhmo.org
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
With Great Fanfare…
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
Public Domain Databases
• Our databases are a mess…• Non-curated databases are proliferating errors• We source and deposit data between databases• Original sources of errors hard to determine• Curation is time-consuming and challenging
Stop Whining – Fix it
Crowdsourced Curation
• Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Validated Name-Structure Dictionaries• Chemical name dictionaries are used for:
• Text-mining (publications, patents)– Used to index PubMed and link to Google Patents
• Linking to other databases – think Biology!– When structures are not available drug names link
• Searching the web– Names link to structures link to InChIs
I want to know about “Vincristine”
If all algorithms work then everything on the page is correct by default except the name-structure relationship!
Vincristine: Identifiers and Properties
Vincristine: Vendors and SourcesLinked by Structure
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
ChemSpider Resources for Chemistry
Micropublishing Syntheses
ChemSpider SyntheticPages
Olympicene
So you Want a Profile???
Interactive Data
PharmaSea
• Dereplication via ChemSpider• Segregation of natural products datasets• Analytical data algorithms & integration
– Mass spec searching – predicted fragmentation
– NMR feature searching – NMR prediction– Computer-assisted structure elucidation
It is so difficult to navigate…
What’s the structure?What’s the structure?
Are they in our file?
Are they in our file?
What’s similar?What’s similar?
What’s the target?
What’s the target?Pharmacology
data?Pharmacology
data?
Known Pathways?
Known Pathways?
Working On Now?
Working On Now?Connections to
disease?Connections to
disease?
Expressed in right cell type?
Expressed in right cell type?
Competitors?Competitors?
IP?IP?
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using semantic web technologies
• Open source code, open data and open standards
• Academics, Pharma companies, Publishers….
ChemSpider Contributions
• The host of the chemistry services– Supplier of “standardized” chemical data files– Chemistry searching (structure, substructure etc)– Provider of data in RDF format – Curator and data quality checking
• Now building the Open PHACTS chemical registration system
ChemSpider Contributions
• Supplier of chemistry UI components• “Quality Police” for data checking • Chemical Validation and Standardization Platform• Nanopublications from RSC publications
Integrate to instruments and software
• Integration to analytical instrumentation vendors already in place – Agilent, Bruker, Thermo, Waters
• Also, Cheminformatics vendors link to ChemSpider– Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…
Natural Products Updates
• Names hard, Structures “Obvious”
• New content based on monthly updates of the database
• Click through to the Natural Products Updates entry
National Chemical Database Service
Chemical Database Service• National Chemical Database
Service for UK Academics
• Integrating Commercial Databases and Services
• Chemicals, analytical data, prediction algorithms
• Development of data repository
Publications - a summary of work
• Scientific publications are a summary of work– Is all work reported?– How much science is lost to pruning?– What of value sits in notebooks and is lost?
• How much data is lost?– How many compounds never reported?– How many syntheses fail or succeed?– How many characterization measurements?
Community Repository for Data• Funding agencies encourage sharing of data• Increasing availability of “Open Data”• Institutional repositories no specific domain
support • Develop a community repository for chemistry
data – private, public, embargoed• Provides data to develop models/algorithms
Community Repository for Data• Automated depositions of data• DOI’ed data objects for citation purposes• A database of reference data, but validated by
the community • National services feeding the repository –
crystallography, mass spectrometry• Integrate to blogging tools for chemistry• Integrate to Electronic Lab Notebooks as feeds
Model Building with Community Data
• Community data as a basis of model building– Consume data from available databases, community
data, new publications and build predictive algorithms for the community
– How many algorithms are reported and lost? How much repeat work is done in the domain of algorithmic development?
Pulling Data from our Archive
• Our contribution to the world of chemistry data• DERA – digitally enabling the RSC archive
– Text mining• Find chemicals, reactions, analytical data, properties
– Algorithmic checking• Validate algorithmically what we can - robots
– “Web 2.0 interfaces” for curating and validating
What if we could capture it all?Digitally Enhancing the RSC Archive
Data Validation and Curation Required
Encouraging Participation with Rewards and RECOGNITION
Manual Curation
• Integrated commenting, curating and validation platform across ALL eScience and publishing platforms
• All integrated to a central RSC profile and feeding the AltMetrics tools
Structure Review
Maybe Hybrid Man-Machine
Where we are now…
Rewards and Recognition
Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.
The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.
Future Recognition in AltMetrics?
ChemSpider
Internet Data
The Future
Commercial SoftwarePre-competitive Data
Open ScienceOpen DataPublishersEducators
Open DatabasesChemical Vendors
Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
The Future of Chemistry on the Web?• Public compound databases federate & build a
linked environment of validated data!• Data validation needs are not ignored• Publishers layer on information to make
publications discoverable• Public-Private databases can be linked• Open Data proliferate• The “Semantic Web” in action
Acknowledgments
• Valery Tkachenko and the eScience team• Our data providers, depositors, collaborators
and curators• Software providers – OpenEye, ChemDoodle,
ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Thank you
Email: [email protected] Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams