mining drug targets, structures and activity data

Mining Drug Targets, Structures and Activity Data Using Open Full-Text

Patent Sources and Web Tools

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for BioIT, Boston, April 2012,

Track 11, Open Source Solutions, Wednesday, 13:45

Introduction

Key Relationships Extractable from Patents and Papers

Document Assay Result Compound Target

MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGAPLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGYYVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQRQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASVGGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQDLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKAASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLMGEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSSTGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRTAAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICALFMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

2011 PMID 21569515

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)

2010 doi:10.1007/978-3-642-15120-0_9

The Good News: Patent Mining Utility

• Novel bioactive chemical structures related to drug discovery exceeding those in journals by at least five-fold.

• Encompass academic, as well as commercial, global med. chem. output.• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.• ~ 70% of data initially patent-only, some never disclosed elswhere. • Include synthetic descriptions and other useful enabling information.• Precede journal or meeting reports by ~ 1.5 to 5 years.• Can be complementary to papers (e.g. larger SAR matrix). • Intersect with papers at chemistry, target, disease, author and citation levels• IP exploitable for Neglected Tropical Disease research becoming ”open”.

The Bad News: Patent Mining Can be Tough • High-specificity retrieval of relevant documents difficult• Massive chaff-to-wheat ratio in 100s of pages• Differences in layout, house style and data location• Markush permutation• Variability in IUPAC strings and image rendering • Use of non-standard gene/protein names• Obfuscation via;

– Qualitative or binned assay results– Structure-to-data links non-obvious, patchy or absent– Less than 50% of titles include target names– The ”hiding the lead and core structures” game– Blunderbuss disease and use exemplifications– Tense ambiguity (i.e. ”could be” vs. ”was” done)

• Quality judgments dificult • Patents cite papers and patents but few papers cite patents• Document redundancy of Kind codes, patent families and equivalents• Finding drug candidate first-filings is difficult• The PDF hamburger problem and OCR noise

Reasons for Rolling-your-own Patent Chemistry and Data Extraction

• Limited budget• You are likely to be a tacit super-curator by profession• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)• Combine automated outputs with manual triage• Develop a technical understanding and comparison of vendor offerings• Commercial dbs cap the number of manually-extracted examples • Need SAR analogues for a few targets rather than many (e.g. mechanistic

enzymology or systems chemical biology)• Only require data sampling across specific disease areas• Not overly concerned about false-negatives (i.e. don’t need comprehensive

prior-art check or scoping of claims)• Open tools operate on any text or web source, not just patents• You may already have commercial text mining capability• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,

journals you subscribe to, PubMed and PMC)• You can slice-and-dice PubChem patent chemistry in ways complementary

to commercial databases

Open Sources and Tools Overview • Searching metadata, abstracts and text

– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.

• Metadata, full-text and chemical structure search - SureChemOpen • Bulk name-to-structure conversion - ChemAxon Chemicalize• Individulal name-to-structure - OPSIN• Conversion of images to structures - OSRA• Sketcher inputs – many options• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize• EPO patent number searching in PubChem• PDF24.org for cutting pages and OnlineOCR.net for sections or tables • Utopia bioentity mark-up

(those below not included in this presentation but relevant)• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,

SCRIPDB, Juristica group

(n.b. Google should give urls for all these source and tool names)

So What’s in PubChem ?

PubChem Patent-derived Content ~6 million

• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI pharmaceutical patents plus some journal extractions

• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM • ~ 3.5 million of these are Lipinski-ROF compliant• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million • ~ 70% of these are Lipinski-ROF compliant• ~ 90% of these have assay data• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)

Chemistry > Patents in PubChem

You found a CID, what are the Patent and Journal links?

PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB

Patent Links from SLING and IBM

PubChem > SureChem > Patent > Stucture > Data > Target

Target-Centric Patent Searching

Synonym Recall

• Title only BACE1 = 8• Title + abstract BACE1 = 97• Title + abstract BACE2 = 29• Title + abstract BACE = 392• Title + abstract ”Beta secretase” = 1056• Title + abstract memapsin = 87 • Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin = 1383• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin AND inhibitors = 841• Same query to PubMed (this interface) = 1031

Target Query > Patent Retrieval from Espacenet

Linking Examples to Data in the Patent

Extracting Chemical Structrures

IUPAC-to-structure: OPSIN

Result; Example 31 structure is 24 nM BACE1 inhibitor

Instalable application

Also chemical dictionary conversions

Image-to-strucuture: OSRA

• Patchy results but fixable by editing and similarity iteration in PubChem• Also an installable application• Useful to cross-check between images and IUPACs

Follow-up Searching

Structure Search in PubChemSMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)

Often see stero differences to the Derwent entry in PubChem

PubChem Similarity ”Walking”

• 2D and 3D different results• Can do multiple steps• Can ”read” CID history • Possible to ”walk” between patents • Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.

Direct Patent <> Chemistry

SureChemOpen: Patent Retrieval

• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not

bulk export)

SurChemOpen, WIPO, OPSIN and PubChem

Result 1nm (?) BACE2 inhibitor with assay and synthesis details.

SureChemOpen: Structure > Patent

Direct answers to: ”which patents contain compounds simiar to my query” and ”show me all the compounds in these patents”

Non-target Activity Data and Bulk Chemistry Extraction

Malaria Query: CiteExplore > WIPO

Example 60, sub-200nM potency, with solubilty and clearance data

Espacenet EP2391601 > ChemAxon Chemicalize.org

• Description URL from Espacenet pasted into Chemcalize.org

• Most of 74 examples converted

• Example 60 had 4 analgues in PubChem at 95% Tamimoto (e.g. CID 46852300) but no exact match

• Claims section was Markush description so no relevant structures converted

EP2391601 > Chemicalize > PubChem

• EP2391601 description text > Chemicalize SDF download > PubChem Structure Search upload = 311 structures

• Of these 206 have PubChem exact matches • Of these 176 have Thomson Pharma matches• The example cluster (Thomson/Derwent extraction) cluster is ~15• The example cluster from Chemicalize is ~ 90 • Ipso facto Chemicalize extracted at least 70 novel structures• But only 10 examples were in the highest-potency bin

Chemicalize Similarity listing PubChem Tanimoto sub-cluster

Tips and Tricks

Tables and Recalcitrant IUPACs

Find tables

Snip image

Online OCR

Word Pad

Chemicalize

• iterative fixing of OCR errors (e.g. 1 vs l)

• cross-check Mw in the document

Utopia Mark-up of Patent Introduction

Bioentity mark-up (green) via EMBL Reflect with rich call-out options

Tips for Joining Everything up • SureChemOpen is continuing to back-fill and add features.• Check the Chemicalize archive (~ 0.5 million) for unique content.• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things

(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki pages, blog posts and MeSH IUPACs).

• Check PubChem ”same connectivity” for tautomer forms in different CIDs.• Check PubChem ”similar” compounds for analogues even if you cannot track

back to a patent number.• Most PDB ligands published by companies have a patent analogue series.• Espacenet text chemicalizes well but FreePantentsOnline can be better.• Google Scholar tracks patent citations.• Full-text is good but don’t forget to eyeball the original PDF• You can ”walk” between patents by 2D/3D clusters, inventors or citations.• Less-common author/inventor names may track a journal paper back to a patent. • CiteExplore includes selectable ChEMBL structure links.• Check ChEMBL structures for SureChem links via ChemSpider.• On a good day you can paste OCR table data into Excel.• You can set SciBitely patent keyword alerts and see posts on Twitter.

Conclusions

• Roll-your-own patent mining can take you a long way.• Complementary to commerical databases.• Target-centric recall and specificity is reasonable.• Published patents are indexed and open text-extracted within weeks.• You need perspicacity to dig out SAR details.• Can cherry pick examples by potency or collate whole series• Establishing intersects between journal articles and patents is valuable.• Exemplified structures typically cover a broader range of analogue space

and SAR data than papers.• You can ”walk” between patents via citation and chemistry clustering.• PubChem already contains over 6 million patent-derived structures with

more depositions and links expected.• The increased public surfacing of chemical structres and bioactivity data

from patents will expedite medicinal chemistry, tropical disease research and chemical biology.

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan – at - hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)LinkedIN: http://www.linkedin.com/in/cdsouthanWebsite: http://www.cdsouthan.info/CDS_prof.htmPublications: http://www.citeulike.org/user/cdsouthan/publications/order/yearCitations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=enPresentations: http://www.slideshare.net/cdsouthan

mining drug targets, structures and activity data

text patent sources

patent families

dice pubchem patent

open sources

google patents

data sampling

activity data

ibm structures

Technology

novel drug targets and potential combinations in childhood...

drug targets b fredholm

will the real drug targets please stand up ?

mechanism of drug action and drug targets...

1 nuclear receptors as drug targets: a historical ...nuclear...

drug targets 2013(1)

open targets: mining gene and disease associations for...

systems pharmacology identifies drug targets for stargardt

bile acid receptors as targets for drug

text mining full text for molecular targets

identification of drug targets from side-effect similarity

drug targets - university of california, san...

identification of multiple cryptococcal fungicidal drug...

systematic identification of anti-fungal drug targets …...

how many drug targets are there?

drug disposition, drug targets, and side effects

prioritizing drug targets in complete genomes

why enzymes as drug targets?

metabolomics for finding new drug targets

drug targets – on overview - krasavin group · drug...