the importance of the inchi identifier as a foundation technology for escience platforms
DESCRIPTION
The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.TRANSCRIPT
![Page 1: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/1.jpg)
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at RSC
Antony Williams
Bio-IT,
Boston, April 27th 2014
![Page 2: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/2.jpg)
Without the InChI…
• ChemSpider is unlikely to have been built
• It would not have grown into one of the domains primary online chemistry resources
• The Royal Society of Chemistry would not have it as an online database, would not have a large cheminformatics team and would not be involved in a number of large scale funded projects around chemistry data
![Page 3: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/3.jpg)
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
![Page 4: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/4.jpg)
ChemSpider
![Page 5: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/5.jpg)
ChemSpider
![Page 6: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/6.jpg)
Experimental/Predicted Properties
![Page 7: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/7.jpg)
Literature references
![Page 8: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/8.jpg)
Patents references
![Page 9: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/9.jpg)
So what is Yohimbine?
![Page 10: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/10.jpg)
Of course it is out there…
Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
![Page 11: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/11.jpg)
Tell me more…
• Where can I find the molfile for Yohimbine?• Papers/Patents about Yohimbine?• What are the side effects of Yohimbine?• Where can I order Yohimbine?• What are the physicochemical properties?• Metabolic pathways?• Different synonyms of Yohimbine?• Synthesis of Yohimbine?• Side effects of Yohimbine?• Etc….
![Page 12: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/12.jpg)
Quantity!
![Page 13: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/13.jpg)
Yohimbine on ChemSpider
![Page 14: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/14.jpg)
Downsides of Overall Approach
• Meshing data together based on InChIs worked for simple molecules
• 2D layout errors inherited or limited by algorithm
• Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
![Page 15: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/15.jpg)
Yohimbine on ChemSpider..Quality?
![Page 16: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/16.jpg)
So where can we travel???
![Page 17: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/17.jpg)
So where can we travel???
![Page 18: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/18.jpg)
![Page 19: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/19.jpg)
InChI String Search via GoogleGive me InChIKeys…
![Page 20: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/20.jpg)
And where can we travel???
![Page 21: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/21.jpg)
ChemSpider
BRENDA
Wikipedia
ChEMBL
ChEBI
DrugBank
![Page 22: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/22.jpg)
Aggregator
Enzymes
Encyclopedia
Pharmacology
Curated Chemicals
Drug-Drug Target
![Page 23: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/23.jpg)
How do we build it?
• We deal in Molfiles or SDF files – with coordinates• Deposit anything that has an InChI – we support
what InChI can handle, good and bad• Standardization based on “InChI standardization”• InChIs aggregate (certain) tautomers• We link out to external sites using their IDs
![Page 24: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/24.jpg)
Downsides of InChI
• InChI was a moving target (multi versions) but overall worked as planned.
• Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”
• InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
![Page 25: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/25.jpg)
Side Effects of InChI Usage
![Page 26: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/26.jpg)
SMILES by comparison…
![Page 27: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/27.jpg)
Side Effects of InChI Usage
![Page 28: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/28.jpg)
Standardization IssuesDepiction based on molfile
![Page 29: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/29.jpg)
Standardize
Use the SRS as a guidance document for standardizationAdjust as necessary to our needs
![Page 30: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/30.jpg)
Nitro groups
![Page 31: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/31.jpg)
Salt and Ionic Bonds
![Page 32: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/32.jpg)
Ammonium salts
![Page 33: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/33.jpg)
CVSP
![Page 34: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/34.jpg)
NPC Browser Set
![Page 35: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/35.jpg)
Checking include InChI
• Many SDF files contain InChIs and SMILES – comparing the structure contained within the file with the associated InChI is useful – turned up a number of errors in checking online databases
![Page 36: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/36.jpg)
So, I’m writing an article…
![Page 37: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/37.jpg)
With these…I will lose data
![Page 38: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/38.jpg)
But linking with InChI …
![Page 39: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/39.jpg)
Structure Searching the Web
![Page 40: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/40.jpg)
Data in Publications
• This is not new, you know the story…• So much data of value is contained within a
publication and delivered in a PDF form• PDF files, and unclear licensing/copyright, limit
access to data so I can rework, reuse, repurpose, text mine etc.
• “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
![Page 41: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/41.jpg)
“Data enable” publications?
• We would LOVE to bring data out of our archive• What could we do?
• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions – and make a database!• Find data (MP, BP, LogP) and host. Build
models!• Find figures and database them• Find spectra (and link to structures)• Validate the data algorithmically
![Page 42: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/42.jpg)
RSC Archive – since 1841
![Page 43: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/43.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 44: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/44.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 45: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/45.jpg)
But names = structures
• Systematic names can be generated FROM chemical structures algorithmically
![Page 46: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/46.jpg)
But names = structures
• …and structures from systematic names
![Page 47: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/47.jpg)
But what of trivial names?
• What about trivial names, trade names, CAS numbers, multilingual names etc.?
![Page 48: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/48.jpg)
Searching that lipid in patents
![Page 49: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/49.jpg)
Aspirin on ChemSpider
![Page 50: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/50.jpg)
Work in Progress
![Page 51: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/51.jpg)
Work in Progress
![Page 52: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/52.jpg)
Work in Progress
![Page 53: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/53.jpg)
Work in Progress
![Page 54: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/54.jpg)
But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 55: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/55.jpg)
ChemSpider Reactions
![Page 56: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/56.jpg)
ChemSpider as a Foundation
• >30 million chemicals (and growing)
• ChemSpider is free to access for everyone – and the API means people program against it
• What projects can we benefit?
![Page 57: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/57.jpg)
Support grant-based services• Multiple European consortium-based grants
• PharmaSea (FP7 funded)• Open PHACTS (IMI funded)
• UK National Chemical Database Service (http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks
• Open Drug Discovery projects
![Page 58: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/58.jpg)
![Page 59: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/59.jpg)
PharmaSea
![Page 60: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/60.jpg)
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using semantic web technologies
• Open code, open data, open standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
![Page 61: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/61.jpg)
Open PHACTS
![Page 62: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/62.jpg)
All Databases We Generate…
• All databases and systems we build now include generated InChIs
• InChIs are facilitating discoverability via searching on Google (see Chris’ talk) but also for querying and linking
![Page 63: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/63.jpg)
But we are still VERY LIMITED
• RSC deals with way more than organics, inorganics, organometallics – we are building a data repository to include materials, polymers, ambiguous materials etc.
• There are many plans for InChI moving forward – Markush, polymers, organometallics etc
![Page 64: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/64.jpg)
The great promise should be obvious
• InChIs are here to stay• They will evolve, they will encompass, we
will adopt and adapt• Public and private databases will federate &
build a linked environment of validated data!• Data validation and standardization is
needed• Open Data will continue to proliferate• InChIs are in the “Semantic Web” already
![Page 65: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/65.jpg)
If InChI never existed …
• ChemSpider would never have been built
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• Cheminformatics tools would likely not be linking to public domain databases in the same way
![Page 66: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.vdocuments.net/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/66.jpg)
Thank youEmail: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams