importance of data standards for large scale data integration in chemistry
TRANSCRIPT
![Page 1: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/1.jpg)
Importance of data standards for large scale data integration in
chemistry
Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk,
Daniel Lowe and Carlos Coba
ACS Denver, March 2015
![Page 2: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/2.jpg)
Free and Easy
• To make it easy to “take notes” these slides will be available at:
www.slideshare.net/AntonyWilliams/
![Page 3: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/3.jpg)
Charles Holland Duell
![Page 4: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/4.jpg)
Charles Holland Duell
• 1898-1901: US Commissioner of Patents
• "Everything that can be invented has been invented."
![Page 5: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/5.jpg)
Antony John Williams (et al)
![Page 6: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/6.jpg)
Antony John Williams (et al)
• “We don’t need more standards!”
• “Of COURSE we can build a spectral database!”
• “The standards we have are good enough”
![Page 7: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/7.jpg)
A Pragmatic View to Progress
• Let’s consider progressing an NMR Spectral database for the community!
• MUST HAVES– spectra (1D/2D), associated structures, assignments
• WANTS – predict NMR spectra, spectral searching, privacy/embargos
• What would we need in terms of standards?• Molfiles and JCAMP
![Page 8: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/8.jpg)
Standards without adoption..
![Page 9: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/9.jpg)
Standards
![Page 10: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/10.jpg)
2D NMR
![Page 11: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/11.jpg)
Progress in standards
![Page 12: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/12.jpg)
Progress in standards
![Page 13: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/13.jpg)
Standards without adoption are limited in value
• If the instrument vendors don’t support or adopt the standards success is limited
• YESTERDAY discussion about publishing NMR – JCAMP
• But what is already available will work – Jeol, Bruker, Thermo, Anasazi, Agilent/Varian - imperfect but useful
![Page 15: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/15.jpg)
9400 Spectra and growinghttp://www.chemspider.com/spectra.aspx
![Page 16: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/16.jpg)
JCAMP NMR Spectra
![Page 17: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/17.jpg)
Data on ChemSpider
![Page 18: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/18.jpg)
JCAMP file downloads
• When NMR spectra are stored as JCAMP then downloads into offline packages are feasible – MestreLabs, ACD/Labs etc
• Open Data – download versus view• Store spectra locally and reuse• Java is increasingly a pain!
• Need to move to HTML5 viewing on ChemSpider, especially for Mobile Viewing
![Page 19: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/19.jpg)
Challenges with Spectra
• JCAMP is good for a lot of spectral data – IR, Raman, 1D NMR
• MS data is rarely made available in JCAMP• We would love a ratified JCAMP 6.0 for 2D
data exchange – allows third parties to build support for download
• ASSIGNED JCAMP spectra supported
![Page 20: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/20.jpg)
Proper Verification
03/25/15Advanced Chemistry Development, Inc.
(ACD/Labs)20
![Page 21: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/21.jpg)
Jmol - JSpecView
![Page 22: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/22.jpg)
ChemDoodle Components
![Page 23: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/23.jpg)
Spectral Display in the hand
![Page 24: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/24.jpg)
New Repository Architecturedoi: 10.1007/s10822-014-9784-5
![Page 25: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/25.jpg)
Compounds
![Page 26: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/26.jpg)
Reactions
![Page 27: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/27.jpg)
Analytical data
![Page 28: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/28.jpg)
Deposition of Data
![Page 29: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/29.jpg)
1,000,000 Spectra Online?
![Page 30: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/30.jpg)
ESI – Text Spectra
![Page 31: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/31.jpg)
Developing Proof-of-Concept• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543C 56536
unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8
![Page 32: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/32.jpg)
We want to find text spectra?
• We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
• What would be better are spectral figures – and include assignments where possible!
![Page 33: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/33.jpg)
MestreLabs Mnova NMR
![Page 34: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/34.jpg)
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
![Page 35: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/35.jpg)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 36: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/36.jpg)
ESI Data also contains figures
![Page 37: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/37.jpg)
Publications & “Real Spectra”
• We are turning text into spectra• We are turning figures into spectra
![Page 38: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/38.jpg)
Early Test Experiments
Input 74 supplementary data documents. 3444 pages
Output Plot2Txt extracted content from 1069 pages 1151 spectra total - >80% of peaks extracted to
within 1-2 decimal places (ppm)
![Page 39: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/39.jpg)
“Where is the real data please?”
FIGURE
DATA
![Page 40: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/40.jpg)
Manual Curation Layer
• ALL SPECTRA WILL BE STORED AS JCAMP• ChemSpider has had a manual curation layer
for >8 years• Users can annotate data on ChemSpider• We do receive useful feedback from the
community on the data and are optimistic!
![Page 41: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/41.jpg)
Extraction is the WRONG WAY
• We should NOT mine data out – digital form!• Structures should be submitted “correctly” • Spectra should be digital spectral formats,
not images• ESI should be RICH and interactive• Data should be open, available, with meta
data and provenance
![Page 42: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/42.jpg)
We can solve for Authors hereWill it be used though??? YES!
![Page 43: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/43.jpg)
Supplementary Info Data now..
![Page 44: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/44.jpg)
Data mining – it’s MINE!!!
![Page 45: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/45.jpg)
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
![Page 46: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/46.jpg)
But there ARE solutions!
![Page 47: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/47.jpg)
But there ARE solutions!
![Page 48: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/48.jpg)
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?• Convince the instrument vendors to export in
this format• Push button depositions into “containers” –
ChemSpider, NMRShiftDB, Institutional Repositories
• Encourage format support in software (read and write) – Mestre, ACD/Labs, Bruker TopSpin, etc.
![Page 49: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/49.jpg)
NMRShiftDB anyone?
![Page 50: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/50.jpg)
Standards in Large Scale Data Integration
• ALL of these are imperfect standards• Molfiles• SDF• InChI• JCAMP• But what can be done with them?
![Page 51: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/51.jpg)
Compound Data
• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF files for data deposition and interchange
• We use InChI a lot – especially for integrated searching across the web
![Page 52: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/52.jpg)
Searching the Entire Web?
![Page 53: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/53.jpg)
Searching Internet by Structure
![Page 54: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/54.jpg)
Compound Data
• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF files for data deposition and interchange
• We use InChI a lot – especially for integrated searching across the web
• There ARE data interchange problems associated with structures….
![Page 55: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/55.jpg)
USE and TEACH Standards
• Too few people are aware of the existing standards and their capabilities
• Part of the CINF mission activities should be to teach standards and this is being done
• Still too few people have heard of InChI and JCAMP for example
• Still little known about the importance of correct structure representations – kudos to people like Leah et al who TEACH THIS!
![Page 56: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/56.jpg)
USE and TEACH Standards!
![Page 57: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/57.jpg)
USE and TEACH Standards!
![Page 58: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/58.jpg)
CVSP: Validate and Standardize
![Page 59: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/59.jpg)
CVSP Rules Sets
![Page 60: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/60.jpg)
CVSP Filtering of DrugBank
![Page 61: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/61.jpg)
Compounds
![Page 62: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/62.jpg)
Reactions
![Page 63: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/63.jpg)
Use Ontologies
![Page 64: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/64.jpg)
![Page 65: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/65.jpg)
Contribute to PUBLIC Ontologies
• Yes there are “company” ontologies – but for the good of the community contribute to public ontologies and standards
• For data interchange and meshing this is soooooo beneficial!
![Page 66: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/66.jpg)
ChAMP – Stuart Chalk
![Page 67: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/67.jpg)
Use standards in APIs, endpoints and widgets
![Page 68: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/68.jpg)
Semanticize content : RDF
![Page 69: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/69.jpg)
Actions
• Support and encourage new standards• In the meantime, reawaken and modernize the
JCAMP standard• Show up and listen to Bob Hanson today• Encourage scientists to provide data
![Page 70: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/70.jpg)
Charles Holland Duell in 1902
“…all previous advances in the various lines of invention will appear totally insignificant when compared with those which the present century will witness.
I almost wish that I might live my life over again to see the wonders which are at the threshold”
![Page 71: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/71.jpg)
“Git-r-Done”
![Page 72: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/72.jpg)
Acknowledgments
• Daniel Lowe – NextMove, Reactions and Spectra • Bill Brouwer – Plot2Txt Development• Carlos Cobas and Stan Sykora– MestreLabs• The ChemSpider team – led by Richard Kidd• The RSC Data Repository team
![Page 73: Importance of data standards for large scale data integration in chemistry](https://reader034.vdocuments.net/reader034/viewer/2022052700/55a60a8a1a28abd17b8b45e7/html5/thumbnails/73.jpg)
Thank you
Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams