great promise of navigating the internet using in chis
DESCRIPTION
The InChI, the International Chemical Identifier, has been the basis of both indexing and deduplication of the ChemSpider database since the inception of the platform. When the InChI was adopted we envisaged a future whereby the identifier would proliferate across journals, databases and the internet in general providing us a basis for “structure searching the internet”. This presentation will provide an overview of how the InChI has facilitated the integration of ChemSpider to chemistry on the internet, some of the surprising findings that have resulted from this work and extrapolate the influence of InChIs into the future for a chemically enabled web.TRANSCRIPT
Great promise of navigating the internet using InChIs
Antony J WilliamsACS San Diego March 2012
Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Warning…
This talk is not about Quality…it’s about quantity
Warning…
This talk is not about Quality…it’s about quantity
Drugbank was here
Data quality is a known issue
We ALL have issues!!!
It’s about what’s out there…
How to Link it…
And getting out of overwhelm…
So what is Yohimbine?
Of course it is out there…
Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
Tell me more…
Where can I find the molfile for Yohimbine? Papers/Patents about Yohimbine? What are the side effects of Yohimbine? Where can I order Yohimbine? What are the physicochemical properties? Metabolic pathways? Different synonyms of Yohimbine? Synthesis of Yohimbine? Side effects of Yohimbine? Etc….
Quantity!
Yohimbine on ChemSpider..Quality?
How do we build it?
We deal in Molfiles or SDF files – with coordinates
Deposit anything that has an InChI – we support what InChI can handle, good and bad
Standardization based on “InChI standardization”
InChIs aggregate (certain) tautomers
We link out to external sites using their IDs
Downsides of InChI
InChI was a moving target (multi versions) but overall worked as planned.
Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”
InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization IssuesDepiction based on molfile
Downsides of Overall Approach
Meshing data together based on InChIs worked for simple molecules
2D layout errors inherited or limited by algorithm
Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
Yohimbine on ChemSpider..Quality?
So where can we travel???
So where can we travel???
InChI String Search via GoogleGive me InChIKeys…
And where can we travel???
ChemSpider
BRENDA
Wikipedia
ChEMBL
ChEBI
DrugBank
Aggregator
Enzymes
Encyclopedia
Pharmacology
Curated Chemicals
Drug-Drug Target
Recognizing Compound Dilution
So much chemistry on the web….
And so much dilution – “structural uniqueness” versus “accidental ambiguity”
InChI as an easy skeleton search
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search
All aggegators suffer dilution!
Many Problems Can be Solved…
Clean up databases – structure validation, structure standardization
Warn about Valency, charge balance, depiction issues,
bond types, absent stereo, and another 100 rules (or so…)
Standardize Agree community rules to “Standardize”
Structure Validation
Structure Validation - Fixed
What needs to happen?
If we could validate Catch errors in databases (and clean) Proactively catch errors in publications/patents Reduce junk in the ether – improve QUALITY!
If we standardized Interlinking should improve
NPC Browser Set
Download, Deposit, Reprocess
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
Structure-Name Validation
NH
O
O
OO
O OO
O
O
OHO
O
CH3
OH
OH
CH3
CH3
CH3
CH3
CH3
H
O
NH2
I
I
I
OH
CH3
Choladine
Taxol
NN
Cl
Chlotrimazole
CH3
CH3
CH3
CH3
HH
HCholane
Standardize
Use the SRS as a guidance document for standardization
Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Millions of structures? Lots of Issues
ChemSpider Standardization
Entire ChemSpider database will be standardized using modified FDA rule set
Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated
Standardization procedures automatically applied to all future depositions
Identifier Dictionaries
Reciprocal curation processes…share curation with each other.
If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
A series of “added” and “removed” synonyms against InChIKeys for matching.
Proof of Concept Data Curation SharingWho wants to work with us?
Structure Validation using feed
Look for approved synonyms
Compare feed InChIKey with database InChIKey
If different, flag for inspection
It is so difficult to navigate…
What’s the structure?What’s the structure?
Are they in our file?
Are they in our file?
What’s similar?What’s
similar?
What’s the target?
What’s the target?Pharmacology
data?Pharmacology
data?
Known Pathways?
Known Pathways?
Working On Now?
Working On Now?Connections
to disease?Connections to disease?
Expressed in right cell type?Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in
pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project
Guiding principle is open access, open usage, open source- Key to standards adoption -
Guiding principle is open access, open usage, open source- Key to standards adoption -
Chemistry in Open PHACTS
Selected data slices of ChemSpider carrying pharmacological links into the “linked data cache”
ChemSpiderIDs and InChIs/InChIKeys will be in Open PHACTS and available for linking
A structure ID standard to enable further linking across the semantic web of science
Internet Data
ChemSpider and InChI
Commercial SoftwarePre-competitive Data
Open ScienceOpen DataPublishersEducators
Open DatabasesChemical Vendors
Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
The great promise should be obvious InChIs are here to stay They will evolve, they will encompass, we will
adopt and adapt Public and private databases will federate &
build a linked environment of validated data! Data validation and standardization is
needed Open Data will continue to proliferate InChIs are in the “Semantic Web” already
If InChI never existed or went away..
ChemSpider would never have been built
Database linking would suffer dramatically
The web would not be “structure searchable”
Cheminformatics tools would likely not be linking to public domain databases in the same way
And we would not have the pleasure of today…
Acknowledgments
The inspiration of the InChI Masters – Steve H., Steve S., Alan, Dmitrii, Igor
IUPAC, NIST, all adopters, supporters, challengers and users
The InChI Trust and its supporters for funding continued development
Al Gore –enabling us to search InChIs on the web
Steve Heller
Steve Heller
Thank you
Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams