leveraging ip data for its scientific content...2018/04/03  · leveraging ip data for its...

88
Leveraging IP data for its scientific content The future of IP in the era of machine learning / cognitive computing / AI Computer curation & finding dark data Stephen Boyer Ph.D. s [email protected] [email protected]

Upload: others

Post on 24-Apr-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Leveraging IP data for its scientific content

• The future of IP in the era of machine learning / cognitive computing / AI

• Computer curation & finding dark data

Stephen Boyer [email protected]@gmail.com

Page 2: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

To be accomplished in the next hour

• the utility of IP data for advancing science

• emerging technologies [ machine learning ]

• the potential to be realized

• the challenges & opportunities

An appreciation of :

Page 3: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

“There’s gold in ‘them’ documents”

What’s in them ?

What is it good for ?

What are people doingto mine the information ?

How are they using it ?

Page 4: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

A bit of history

How we got to now !

What technologies do we have to work with ?

Page 5: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Evolving technologies relevant to patents

The Past Recent Past Present / Future

1990 - 2005 2006 - 2009 2010 - 2018

• Easy Web Access• Text Searching• Keyword• Boolean • BRS (open text) • Verity • Lucene / Solr

• Image Downloads• Tiff --> PDF

• Text Analytics • IBM UIMA

• Natural Language Processing• NLP

• Entity Identification• Co-occurrence Analysis• Visualization Tools• Citation Mapping• W3C Standards• Federated Search • Unique Entity ID’s

• InChI• GeneID’s• other

• Data Availability • Integration of open source• Google Patents

• Contextual analysis • Semantic search• Network graphs • Relationship detection• Advanced grammar analysis

• Machine Learning • Google Patents • Neural Networks • Image Analysis• OSRA /Clide• Automated analysis • Machine translation

Page 6: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Evolving Analytics, Visualization & Knowledge

Availability of bulk machine-readible data

Understanding the content of the documents Why bother ?

Making documents “machine readable “

• Sections• Tables• Citations• Data types • Etc.

Understanding the format of the documents

Page 7: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

3.1 million patent applications worldwide in 2016

Source = Francis Gurry, WIPO, Ambassador Briefing 2018

How many patent documents are there ?

Page 8: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Distribution of Global Patenting has Shifted in Recent Decades

Source = Francis Gurry, WIPO, Ambassador Briefing 2018

Page 9: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Machine learningto analyze & interpret

the components of a document

Example: Work done by Peter Starr, IBM Zurich labs

Step 1: Making documents “machine-readable”

Page 10: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

PDF parser PDF interpretation Semantic representation

PDF is the pervasive language of the enterprise

Step 1: Making documents “machine-readable”

Page 11: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Cross-mapping citation data between publications and patents

PatentsO-References

Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,

ontologies, etc.

journal articles

cited in patents

Page 12: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

PatentsO-References

Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,

ontologies, etc.

~6,000,000 / 40,000,000 patent citations map to journal articles other than patents

Citation mapping

In ~10,500 unique journals

~175,000,000 patent citations map to other patents

Cross-mapping citation data between publications and patents

Page 13: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

PatentsO-References

Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,

ontologies, etc.

~6,000,000 / 40,000,000 patent citations map to journal articles other than patents

Citation mapping

In ~10,500 unique journals

~175,000,000 patent citations map to other patents

?

Cross-mapping citation data between publications and patents

Page 14: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

PatentsO-References

Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,

ontologies, etc.

Citation mapping

In ~10,500 unique journals

?

Cross-mapping citation data between publications and patents

Funding• NIH• NSF • EU

$

Page 15: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Using computers to understand

what’s in the documents

Annotating the documents – NLPEntity identification

Visualizing the content

A brief review of work done at IBM & by a host of others

Step 2 : Computer Curation of Content

Page 16: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Evolving Analytics, Visualization & Knowledge

Availability of bulk machine-readible data

Understanding the content of the documents Why bother ?

Making documents machine- readable

• Sections• Tables• Citations• Data types • Etc.

Understanding the format of the documents

Analysis of the content

• NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization

Page 17: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Early Text Mining Technologies

entity identification

a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-

benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49

(s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g

of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-

7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24

(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is

hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by

What is this compound ??

NO

O

HO

N

N

N

O

NH2

Page 18: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Paper Words

-- - - - --- -- -- - - -- -- --- -- -

Chemical Names

Dictionary of the English language– minus –

the dictionary of desired entities

. -- -

toluene

[CC1=CC=CC=C1]

CH3

Name=Structure

SMILES String

2D Structure

methyl benzene

Computational Resources

Blue Gene – enabled

Summary of overall text analysis operations for chemistry HMM, CRF, CFG

3D structurecompute

300 properties permolecule

Page 19: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Patent document’s chemical report molecular timeline & chemical name-to-structure mouse-over

Page 20: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

• Chemical yields

• Quantities

• Physical attributes Melting points, Boiling Points

• Solvents and Temperatures

• Spectral Data

• NMR data• IR data• Mas Spec data • Assay data

Text-mining technologies identify in-document properties

Source courtesy of Dr. Roger Sayles

Page 21: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Example of text mining from patent & scientific literature

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

> 175K Compound-value associations

Page 22: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Signals / Triggers for identifying specific entities

Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

Example of extracting NMR & MS data from US patents

What about BP?

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

Page 23: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Re-creating spectral data from text data

text input

spectral ouput

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

Page 24: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

NMR data extracted by year of publication

0

500000

1000000

1500000

2000000

2500000

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Cum

ulat

ive

dist

inct

NM

R ex

trac

ted

Year of Publication

USPTO grants

USPTO applications

Documenting the increase in data with time

From 1976-2014 USPTONMR data

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

Page 25: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

1H-N

MR

freq

uenc

y

0 Mhz

50 Mhz

100 Mhz

150 Mhz

200 Mhz

250 Mhz

300 Mhz

350 Mhz

400 Mhz

450 Mhz

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing

Tracking technology improvement with timeExample of NMR

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

Year of patent filing

Page 26: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Extracting chemical reactions from text

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

Page 27: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

US20150038506

Reaction Extraction System

Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

Making dark data useful example: extracting chemical reactions from text

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

Page 28: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

US20150038506 Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

Making dark data useful

Page 29: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

The Growing Number of Chemical Reactions Derived from the Patent Literature

https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/

Millions of reaction SMILESmade publically available

thanks to Daniel Lowe & Roger Sayles

# of

che

mic

al re

actio

ns

Year of patent filing

Source = Roger Sayles & Daniel Lowe, Next Move

Page 30: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Computer curation Classifying patents from their technical content

What does this enable that could not be done before ?

Page 31: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Categorization of chemical reactions from patents

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

10 most frequent reactions

Classifying patents via its scientific content

https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/

Page 32: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

% Y

ield

Mass of Product [grams]

What does this enable that could not be done before ?Analyze scale-vs-yield

Reactions of greatest interestfor manufacturing High yield Large scale

20%

40%

60%

80%

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications

Page 33: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Results from Drs. Roger Sayles & Daniel Lowe, NextMove

16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications

What does this enable that could not be done before ?Analyzing frequency-vs-time

Suzu

ki c

oupl

ings

as

%ag

e of

re

actio

ns /

year

Year of patent filing

Page 34: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Relationships

Page 35: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Entity types identified that were associated with structures derived from patents

Source = Roger Sayles – NextMove

Page 36: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

The Number of Biological Activities Derived from Patents vs the Scientific Literature

Source = Roger Sayles

Page 37: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Evolving Analytics, Visualization & Knowledge

Availability of bulk machine readible data

Understanding the Documents

Understanding what’s In the documents Why bother ?

The format of the document • Sections• Tables• Citations• Data types • Etc…

Analysis of the content • NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization

• Integration with Other Data • Development of feature spaces• Seeing the unobvious • Learning • Predicting

Page 38: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Patent data alone is insufficient

Page 39: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

PubChem

CIDSIDAID

InChIKeyCAS

SynonymsPubMedPatents

NLM MeSH

Chemical:SynonymsMeSH DUIDisease:

MeSH DUI

FDA SRS

Drug:FDA SPLFDA NDC

Ingredient:UNII

InChIKey

NCBIProteinGeneCDD

TaxonomyPubMed

BioSystems

NLM HSDBPharmacology

ToxicityMetabolismProperties

Manufacture

VA NDF-RT

NLM RxNorm

FDA/NLMDailyMed

NCI Metathesaurus

Disease Ontology

Protein Ontology

GeneOntology

DrugBankDrug:

PubChemATC

Target:Uniprot

GeneCard

KEGG

Drug:PubChem

ATCTarget:Gene

Disease:OMIM

ChEMBL

Drug:ATC

ChEBITradeNameCompound:

Pharmacology

ChEBI

Source:IntEnzKEGG

PDBeChemChEMBL

IUPHAR-DB

Drug:Classification

Target:NomenclaturePharmacology

IBM

PatentPubMedTerminology/Ontology

Public Database

Database + Terminology

Integration with Open Source Data

Drs. Evan Bolton & Gang Fu, NIH

Page 40: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

NIH PubChem RDF – Triple & Entity Counts

https://pubchem.ncbi.nlm.nih.gov/rdf/ Drs. Evan Bolton & Gang Fu, NIH

Integration with Open Source Data

Page 41: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

What are “Cognitive Technologies” ?

Page 42: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

“Big Data, (Machine Learning, Neural Networks, Cognitive Computing, AI) is like teenage sex:

Everyone talks about it, nobody really knows how to do it,

Everyone thinks everyone else is doing it.

So everyone claims they are doing it….”

Source: Dan Ariely , Duke University

Machine Learning, Neural Networks, Cognitive Computing, AI

Page 43: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Google “ A mostly complete chart of Neural Networks “

A mostly complete chart ofNeural Networks

Page 44: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Google “ A mostly complete chart of Neural Networks “

A mostly complete chart ofNeural Networks

Page 45: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Accessible Information

Usefulness starts with access to the information.

Transformation Apply business logic,

human curation, and/or machine learning

Useful Information

Solving user problems

Making IP Data Accessible and UsefulWhat Google is doing

Slide courtesy of Ian Wetherbee , Google

The critical first step in making patent information useful Is open access to machine-readable bulk data

Page 46: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Accessible Information

Usefulness starts with access to the information.

Transformation Apply business logic,

human curation, and/or machine learning

Useful Information

Solving user problems

Making IP Data Accessible and UsefulWhat Google is doing

Slide courtesy of Ian Wetherbee , Google

The critical first step in making patent information useful Is open access to machine-readable bulk data

• Machine Classification• Document Similarity • Machine Translation• ….

http://media.epo.org/play/gsgoogle2017

Page 47: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

5

4

3

2

1

0

6 Perfect translation

humanneural (GNMT)

Phrase-based (PBMT)

English>Spanish

English>French

English>Chinese

Spanish>English

French >English

Chinese >English

Google’s machine translation Tr

ansl

atio

n qu

ality

Translation model Slide courtesy of Ian Wetherbee , Google

47,710,923 patents full-text translated

Page 48: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Accessible InformationUsefulness starts with access to the information.

The advantages of making patents accessible & useful”

Slide curtesy of Ian Wetherbee , Google

Enables the private sector to transform and improve information, benefitting the patent system

Improves the transparency into patent quality and the patent system

Improves transparency into legal rights

Empowers the public to obtain the full benefits of the disclosure

“Open machine-readable data is the critical first step in making patent information useful” *

Page 49: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

An Example

Finding compounds that might fight cancer

What are people doing with this data ?

Page 50: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Pharma asks

1. What genes regulate xyz condition ? 2. What compounds regulate those xyz genes ?

An approach to answering these questions : chemical ontologies

Other approaches include• Computational chemical modeling• Similarity Ensemble Approach (SEA) • Literature-based discovery• Experimental high through-put screening

Page 51: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Chemical Ontologies

But first some chemistry

Work done in collaboration with:

University of Alberta Prof David Wishart & Yannick FeunangOntochem Prof Lutz Weber

Page 52: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Physical • Examples: Molecular Weight, Melting point, Boiling Point

Molecular• Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole

Functional • Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide

Legal attributes • Patented for a purpose

Molecules have different types of attributes

Page 53: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Example of a chemical ontology

Consider this molecule

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 54: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acid

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 55: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 56: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 57: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 58: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 59: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 60: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 61: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 62: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 63: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 64: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 65: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene

Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Page 66: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Carboxylic acidBenzoic acid

Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene

Functional attributes Is used for the treatment of Crohn's diseaseIs used for the treatment of rheumatoid arthritisIs used for the treatment of ulcerative colitis

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy

Page 67: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O

SMILES String

ClassyFire OntoChem

ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes(9); 11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids, progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids (32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101); Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272); Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633); Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic compounds (978); Chemical entities (989); Hydrocarbon derivatives (995);

OntoChem: 17-deoxy-prednisolones (6); halohydrins (6); prednisolones (6); ethanoic acid esters (20); methyl esters (20); acetals (37); alkyl fluorides (56); cyclic ketones (61); natural product derivatives (92); fluorine compounds (126); alkene derivatives (172); polycyclic compounds (184); oxacyclic compounds (190); secondary alcohols (202); carboxylic acids (249); formic acid derivatives (559); lipophilic molecules (642); lipinski molecules (785); bioavailable molecules (867); oxygen compounds (891); small molecules (949); carbon compounds (974); hetero compounds (978);

Generating molecular attributes via SMILES

Page 68: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

ChemBL dB of 1.4 Mcompounds AND their

bio activity towards targets

UOA Classifyer (CF) SW OntoChem (OC) SW

SMILES STRINGS

ChemBL dB of 1.4 Mcompounds AND their

bioactivity towards targets IncludingCF + OC

chemical Lables

Obtain a database of chemical compounds & their SAR

OC labels CF labels

This processing was provided by Ontochem

This processing was provided by U of Alberta

This database was provided by EBI

This processing was provided by IBM

Research

We call this the CHEMBL ontology dB

Page 69: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

MDM2 Raw output Out of 1.4 M molecules ~ 558 had activity towards MDM2 but only 27 had activity less then 30 nm

Page 70: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Scoring of molecular labels for MDM2-produced training set of 27 compounds[ label cutoff = 20 , activity cutoff = 30 , corpus count cutoff = 200K ]

Score = (observed count - expected count)2 / expected count

Page 71: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

MDM2 Raw output

Classyfire (CF) OntoChem (OC)

Page 72: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Comparison of the 100 compounds identified by CF with the 100 compounds identified by OC for MDM2 with label cut off = 10 labels & assay minimum = 30 & corpus count cut off = 300K

57 of the predicted compounds are in common

Overlap based on ChemBl ID’s of predicted compounds

Page 73: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC

Page 74: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing
Page 75: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

IC50 value for Mdm2/P53 binding wascalculated (by sigmoid fitting using Prism(GraphPad Software).The results are shown below.

US 2009/ 0312310A1 This [240 page] patent application had 26 compounds with reported assay data for MDM2

Page 76: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

IC50 value for Mdm2/P53 binding wascalculated by sigmoid fitting using Prism(GraphPad Software).The results are shown below.

US 2009/ 0312310A1

Example 18Example 39

Example 93

Example 97

Example 111

Example 155

Example 126 Example 180

Example 220

Page 77: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC

Page 78: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Compound Attributes

compound 1

compound 2

compound 3

A B C D E … Y Z

1 2 0 0 4 5 2 0 7 3 0 1 1 2

………

Feature Vector

compound 1compound 2compound 3

Physical Relate Attributes

LcStructure Pka

Log P …

StructureMol File / SMILES

Functional Attributes

EC50

Target -Assay

PairOther Attributes

LD50

Target-Assay

PairTarget1 Target

2 EC50

Primary Assays Secondary Assays

MDM2

JAK3

SGLT2

---

Ki ---

Anti-target -Assay

Pair

Target Attributes

Target 1

Target 2

Target 3

A B C D E … Y Z

1 2 0 0 4 5 2 0 7 3 0 1 1 2

………

Feature Vector

Page 79: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Goal oriented learning

Cost / reward

Act

Predict an action which will reduce cost and/or increase reward

Page 80: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

BIG ISSUES

1) Obfuscation

2) Access to & integration of worldwide data• Open access to bulk machine-readable data

3) Incentives & quotas

4) Algorithms and Bias

Page 81: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

WHAT IS THIS?

= a soccer ball

= a spherical recreational device

BIG ISSUES OBFUSCATION

Page 82: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Source = Dr. George Papadatatos EMBL – EBI

European Molecular Biology Laboratory EMBL & EBI

Markush structures are daunting and the situation is getting worse

Page 83: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

BIG ISSUES

Access to and integration of WW Data

Page 84: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

• Chinese, Japanese and Korean (CJK) patents now account for over half of all national patentfilings and hence are of increasing importance to patent informatics.

To demonstrate the importance of this …

• 1,740,040 distinct compounds were extracted from ~63,000 Korean patent applications - spanning from 1990 to March 2015

• Of these ~ 230,770 compounds were novel to Korean patents when compared tocompounds derived from US data - (spanning from 1976-March 2015)

• In the period 2006-2014, 46% of compounds appeared in a KIPO filing before a USPTO filing.

The Importance of Foreign Patent Filings

Notes from Drs. D, Low & R Sayles

Page 85: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

An Example of extracting chemical entities from CJK patents

Notes from Drs. D, Low & R Sayles

Page 86: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Chemicalsfrom Chinese

Patents -

Attempts to process Chinese Patent Documents

Extracting chemical structures from Chinese patents…

Work done in collaboration with Dr R Sayles

Page 87: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Final Thoughts

Thanks to

• everyone in this room

• the scientific community

• especially those whose data was presented

• society in general

for providing us with these important

“Adjacent Possibilities”

Final thoughts

Page 88: Leveraging IP data for its scientific content...2018/04/03  · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Source – J Kreulen

IBM Almaden Research Center, San Jose, California