unlocking)chemical)informa0on)...
TRANSCRIPT
![Page 1: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/1.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Unlocking chemical informa0on from tables and legacy ar0cles
Daniel Lowe and Roger Sayle NextMove So?ware
Aileen Day and Antony Williams Royal Society of Chemistry
![Page 2: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/2.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Topics
• Chemical property extrac,on
• Applica,on of chemical property extrac,on to tables
• RSC back-‐archive mining
![Page 3: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/3.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Chemical property extraction
• Mel,ng points • Boiling points • Mass spectrum • Textual NMR spectra • Specific rota,on • Chromatography reten,on ,mes • IR/UV spectra • Ac,vity data e.g. IC50, EC50, Ki • Etc.
![Page 4: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/4.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Simple grammar and corresponding state machine
Isotope: ‘1H’|‘ 13C’ |‘ 19F’ Nmr: ‘-‐NMR’ NmrPrelog: Isotope Nmr
1 3 C
9 F
H
N M R -
![Page 5: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/5.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Melting point recognition
Term Examples of text matched FromLiterature “lit.” Mel0ngPoint “mpt”, “mel,ng point”, “m.p.” Qualifier “>”; “approximately” Value “75° C”, “200° F”, “one hundred degrees Celsius” Range “184-‐186° C”, “191.5 to 192.4° C”
MeasurementError “50±° C” OutcomeQualifier “decomp.”, “with decomposi,on”, “subl.”
FromLiterature? Mel,ngPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
M.p.: 230°C (dec.)
![Page 6: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/6.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
NMR recognition
Term Examples of text matched Isotope “1H”, “13C”, “19F” NMR “NMR”, “RMN”
NmrMethod “400 MHz, CDCl3” Peak “3.7”
PeakAnnota0on “s, 3H”
Isotope NMR NmrMethod? Peak PeakAnnota,on? (Delimiter Peak PeakAnnota,on?)*
1H NMR (300 MHz, DMSO): 7.5-‐7.8 (m, 5H), 7.9 (d, J=8Hz, 2H), 8.33 (d, J=5Hz, 2H)
![Page 7: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/7.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Recognition and parsing
• Grammar dis,nguishes parts of an en,ty of interest e.g. 25°C à25 (value) °C (unit)
• Can groups constructs together e.g. 25 to 30 (range)
![Page 8: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/8.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Example parse Tree serialised to XML
Mp: 131.9-‐132.6 °C <parse>
<quantityType quantityType="MeltingPoint">Mp</quantityType>
<measurement>
<range>
<valueOptUnit>
<decimalValue>131.9</decimalValue>
</valueOptUnit>
<rangeDelimiter>-</rangeDelimiter>
<valueOptUnit>
<decimalValue>132.6</decimalValue>
<unitContainer>
<unit unitType="Temperature" normalizationFactor="1">°C</unit>
</unitContainer>
</valueOptUnit>
</range>
</measurement>
</parse>
![Page 9: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/9.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Recognition and parsing
• Grammar dis,nguishes parts of an en,ty of interest e.g. 25°C à25 (value) °C (unit)
• Can groups constructs together e.g. 25 to 30 (range)
• However this introduces non-‐determinism e.g. aoer seeing “25” both the possibility of being in and not being in a range, need to be considered
![Page 10: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/10.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
• Same grammar can be used to generate: – Single state machine representa,on
• Parts of en,ty not dis,nguished • Extremely fast recogni,on • Allows spelling correc,on of input that is close to being a match
– Mul, state machine parser representa,on • Slower… but only needs to be run on a small amount of text
• Dis,nguishes parts of en,ty • Can group parts into a parse tree
![Page 11: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/11.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Grammar implementation details
![Page 12: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/12.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Table Extraction
![Page 13: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/13.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Melting point table
![Page 14: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/14.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
NMR table
![Page 15: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/15.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
More difficult… Against what?
Need to be looked up else where in document. Could be in text, might be in images
![Page 16: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/16.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Even More difficult…
![Page 17: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/17.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Tables in USPTO patents
![Page 18: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/18.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
…and the xml provided <row>
<entry>1</entry> <entry>N<sup>1</sup>-hydroxy-N<sup>2</sup>-{[4-(phenyloxy)phenyl]sulfonyl}-</entry> <entry>H-NMR; δ (CD3OD): 7.79 (d, 2H),</entry>
</row> <row>
<entry/> <entry>D-lysinamide</entry> <entry>7.42 (t, 2H), 7.22 (t, 1H), 7.09 (d, 2H),</entry>
</row> <row>
<entry/> <entry/> <entry>7.05 (d, 2H), 3.63 (t, 1H), 2.87 (t, 2H),</entry>
</row> <row> <entry/>
<entry/> <entry>1.57-1.68 (m, 4H), 1.44 (m, 1H),</entry>
</row> <row> <entry/>
<entry/> <entry>1.37 (m, 1H)</entry>
</row> <row>
![Page 19: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/19.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Naïve interpretation (Google patents)
Green: chemical subs,tuent Purple: chemical molecule Blue: NMR
![Page 20: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/20.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
SureChemBl
![Page 21: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/21.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
After heuristically detecting which rows are the same row
Purple: chemical molecule Blue: NMR
![Page 22: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/22.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
What could be extracted?
8056714854565090342032283161287525582148 740 568 410 329 197 197 187 171 101 96 73 40 290
100000
200000
300000
400000
500000
600000
Nam
e/Iden
tifier to prop
erty re
latio
nships
![Page 23: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/23.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Compound number determination <heading level="2" id="h-0055">EXAMPLE-14 </heading> <heading level="2" id="h-0056">2-(2,4-difluorophenoxy)-5...</heading>
<parse> <referenceType type="Example">EXAMPLE</referenceType> <referenceId>14</referenceId> </parse>
<heading level="2" id="h-0008">3. (4aS,8aR)-2-(1-Acetyl-pipe..</heading>
2-Chloro-5-iodo-1H-benzo[d]imidazole (1)
![Page 24: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/24.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
RSC-back archive mining
![Page 25: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/25.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
RSC back archive
• 1841-‐1999, 211k ar,cles (available as XML derived from OCR and PDF)
• 2000 -‐, 230k ar,cles (available as born digital XML and PDF)
• Also over 150k Electronic suppor,ng informa,on files (mostly PDF, but also Word docs, Excel files, videos etc.)
![Page 26: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/26.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy document handling
• Chemical proper,es are ooen implicitly associated with a compound by being in the same experimental sec,on
![Page 27: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/27.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
![Page 28: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/28.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy document handling
• Chemical proper,es are ooen implicitly associated with a compound by being in the same experimental sec,on
• This requires sec,on detec,on e.g. a heading and/or a paragraph where a compound is being synthesised
• In the XML for pre-‐2000 papers all sec,ons on a page run together (including page numbers!), and the text posi,on informa,on is lost.
• …so back to the source PDF
![Page 29: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/29.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Heading/Paragraph detection workflow
![Page 30: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/30.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Results (Melting points) 1841-‐1999 RSC journal ar0cles
2000-‐2015 RSC journal ar0cles
2001-‐2015 USPTO patent applica0ons
Compound-‐value associa0ons
2,155 29,996 172,886
Suspicious Values (typically mistake in the document)
70 (3.2%) 39 (0.13%) 426 (0.25%)
Unique Compounds (StdInChI)
1,830 (84.9%) 27,956 (93.2%) 95,140 (55.0%)
![Page 31: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/31.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
SDF output
![Page 32: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/32.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
F
B–
O
NH
H3C
O+
H3C
F
![Page 33: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/33.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Cl
Te
Te
F
F
Cl
F
F
![Page 34: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/34.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Results (NMR) 1841-‐1999 RSC journal ar0cles
2000-‐2015 RSC journal ar0cles
2001-‐2015 USPTO patent applica0ons
Compound-‐value associa0ons
4,972 94,610 1,295,325
Suspicious Values (typically mistake in the document)
561 (11.3%) 2,001 (2.11%) 29,775 (2.30%)
Unique Compounds (StdInChI)
2,899 48,137 655,295
![Page 35: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/35.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy text issues
• OCR errors in important compound names or data – chemical names in italics problema,c… key compounds ooen in italics!
– ° is more ooen than not misinterpreted e.g. ' o
• Tools prefer experimental sec,ons where one compound is being synthesised, qualita,vely older documents are less formalised
![Page 36: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/36.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
4-‐ChZoro-‐6-‐hydroxy-‐2-‐methyZamino~yrimidine.-‐4-‐Chloro-‐6-‐methoxy-‐2-‐methylaminopyrim-‐idine (10g.) was heated on the steam-‐bath for 30 min. with concentrated hydrochloric acid (60 c.c.). The hydvoxy-‐cmfiound which separated on cooling was collected and purified by dis-‐ solu,on in alkali,etc. as above and had m. p. 266" (decornp.) (6.6 g.) (Found c 38.3 ; H 4.1; N 26-‐2. C,H,0N3C1 requires C 37.6; H 3.8; N 26.3%).
![Page 37: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/37.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Conclusions
• Grammars facilitate rapid extrac,on and interpreta,on of chemical proper,es
• Table extrac,on is vital to extrac,ng large quan,,es of certain data e.g. ac,vity data
• Large amounts of high quality data can be extracted from journal ar,cles
• …but extrac,on from older documents remains very challenging, and over ,me represents a smaller and smaller percentage of the scien,fic literature
![Page 38: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/38.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Acknowledgements
• Igor Tetko (Mel,ng point quality feedback) • Carlos Cobas (NMR quality feedback)
Funding provided by:
![Page 39: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/39.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Sci-‐Mix 8:00pm – 10:00pm Today Hall C – Boston Conven,on
& Exhibi,on Center
6-‐aminopyrimidine-‐2,4,5-‐triolChinese (Hanzi used for each morpheme)
6-‐氨基嘧啶-‐2,4,5-‐三醇
Japanese (Phonetic translation to Katakana)6-‐‑‒アミノピリミジン-‐‑‒2,4,5-‐‑‒トリオール
Korean (Phonetic translation to Hangul)6-아미노피리미딘-2,4,5-트리올
ammonia radical pyrimidine three alcohol
amino pyrimidine tri ol
amino pyrimidine tri ol
Chemistry Enabling Chinese, Japanese and Korean Patents
![Page 40: Unlocking)chemical)informa0on) fromtablesandlegacyarclesbulletin.acscinf.org/PDFs/250nm/2015-fall_CINF74.pdf · 2015-12-08 · 250 th&ACS&Naonal&Mee,ng,&Boston&MA,&USA&17 &August2015&](https://reader034.vdocuments.net/reader034/viewer/2022042321/5f0a99b77e708231d42c6e16/html5/thumbnails/40.jpg)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Thank you for your ,me!
h}p://nextmovesooware.com h}p://nextmovesooware.com/blog