icic 2014 from surechem to surechembl

29
John P. Overington - EMBL-EBI Nicko Goncharoff Digital Science SureChEMBL: Open patent chemistry data

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 11-Jun-2015

598 views

Category:

Internet


1 download

DESCRIPTION

The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.

TRANSCRIPT

Page 1: ICIC 2014 From SureChem to SureChEMBL

John P. Overington - EMBL-EBI

Nicko Goncharoff – Digital Science

SureChEMBL: Open patent chemistry data

Page 2: ICIC 2014 From SureChem to SureChEMBL

EMBL-EBI’s Mission

• Provide freely available data and bioinformatics services

to all facets of the scientific community in ways that

promote scientific progress

• Contribute to the advancement of biology through basic

investigator-driven research in bioinformatics

• Provide advanced bioinformatics training to scientists at

all levels, from PhD students to independent investigators

• Help disseminate cutting-edge technologies to industry

• Coordinate biological data provision throughout Europe

Page 3: ICIC 2014 From SureChem to SureChEMBL

EMBL Member States

Austria, Belgium, Croatia, Czech

Republic, Denmark, Finland,

France, Germany, Greece,

Iceland, Ireland, Israel, Italy,

Luxembourg, the Netherlands,

Norway, Portugal, Spain,

Sweden, Switzerland and the

United Kingdom

Associate member states:

Australia, Argentina

Page 4: ICIC 2014 From SureChem to SureChEMBL

ChEMBL • The world’s largest primary

public database of medicinal chemistry data

• https://www.ebi.ac.uk/chembl

• >1.4 million compounds, >9,000 targets, >12 million bioactivities

• Truly Open Data

• CC-BY-SA license

• Many download/access formats

• myChEMBL

• myChEMBL – Linux VM, PostgresQL RDKit, KNIME…

• Semantic Web

• RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl

Page 5: ICIC 2014 From SureChem to SureChEMBL

SAR Data

Compound

Assa

y

Ki=4.5 nM

>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSY

EEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRS

RYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEG

SSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGD

EEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEAD

CGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL

TAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLK

KPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVC

KDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFY

THVFRLKKWIQKVIDQFGE

ED2=230 nM

Inhibition of

human Thrombin

PTT (partial

thromboplastin

time)

ChEMBL

Page 6: ICIC 2014 From SureChem to SureChEMBL

SureChem = SureChEMBL • December 2013 EMBL-EBI ‘acquired’ SureChem

• Existing SureChem user base

• Free (SureChemOpen)

• Paying (SureChemPro + API)

• EMBL-EBI supported existing licensees during transition

• EMBL-EBI provides an ongoing, free and open resource to

entire community

• Private, Secure, and Free

• No login system

• Rebranded as SureChEMBL

• https://www.surechembl.org

6 PDG Biotech Meeting

Page 7: ICIC 2014 From SureChem to SureChEMBL

Rebranding Complete!

7 PDG Biotech Meeting

Page 8: ICIC 2014 From SureChem to SureChEMBL

8

https://www.surechembl.org/

https://www.surechembl.org

Page 9: ICIC 2014 From SureChem to SureChEMBL

EMBL-EBI Chemistry Resources

RDF and REST API interfaces

REST API Interface

Atlas

Ligand induced

transcript response

750

PDBe

Ligand structures

from structurally

defined protein

complexes

15K

ChEBI

Nomenclature of primary and

secondary metabolites.

Chemical Ontology

24K

SureChEMBL

Chemical structures

from patent literature

16M

ChEMBL

Bioactivity data from literature

and depositions

1.5M

UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M

3rd Party Data

ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,

DrugBank, KEGG, NIH NCC,

eMolecules, FDA SRS, PharmGKB,

Selleck, ….

~55M

Page 10: ICIC 2014 From SureChem to SureChEMBL

SureChEMBL Data Pipeline

WO

EP Applications& Granted

US Applications

& granted

JP Abstracts

Patent

Offices Chemistry Database

SureChEMBL System

Patent PDFs

(service)

Application Server

Users

API

Database

Entity Recognition

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-

methylpiperazine

Image to Structure (one method)

Name to Structure (five methods)

OCR

Processed patents

(IFI Claims)

10 PDG Biotech Meeting

Page 11: ICIC 2014 From SureChem to SureChEMBL

SureChEMBL data coverage Data Description & Languages Years

EP applications Bib. data

Full text

DocDB + Original

Original (EN, DE, FR) from 1978

EP granted Bib. data

Full text

DocDB + Original

Original (EN, DE, FR) From 1980

WO applications Bib. data

Full text

DocDB + Original

Original (EN, DE, FR, ES, RU)

From 1978

From 1978

US applications Bib. data

Full text

DocDB + Original

Original (EN)

From 2001

From 2001

US granted Bib. data

Full text

DocDB + Original

Original (EN)

From 1920

From 1976

JP applications Bib. Data DocDB

PAJ - English abstracts/titles

From 1973

From 1976

JP granted Bib. data DocDB From 1994

90+ countries Bib. data DocDB From 1920

11

Page 12: ICIC 2014 From SureChem to SureChEMBL

• Structures from text: 1976 onwards

• Title, abstract, claims, description

• SureChem Chemical Entity Recognition - proprietary algorithms

• ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name-

structure conversion

• Structures from images: 2007 onwards

• CLiDE image-structure conversion

• Will extend image processing backwards using AWS Spot Pricing

compute

• USPTO offers ‘Complex Work Units’ since 2001

• CWU file types include MOL and CDX

• CWUs processed as part of pipeline: 2007 onwards

SureChEMBL Chemistry Data Coverage

12 PDG Biotech Meeting

Page 13: ICIC 2014 From SureChem to SureChEMBL

Chemical Entity Extraction

13 PDG Biotech Meeting

Page 14: ICIC 2014 From SureChem to SureChEMBL

SureChEMBL Content (September 2014)

• 15,668,225 compounds

• 12,888,125 patents

• ~80,000 new compounds extracted from ~50,000 patents

monthly

• 1–7 days for published patent to become searchable in

SureChEMBL

• System provides search access to all patents (not just

chemistry)

14 PDG Biotech Meeting

Page 15: ICIC 2014 From SureChem to SureChEMBL

Current System Capabilities

• Searching capabilities

• Free text keywords and Lucene fields

• Patent IDs & bibliographic information

• Patent authority & date

• Chemical structure

• Retrieval capabilities

• Retrieve chemistry (with additional filters)

• Retrieve patent family information

• Retrieve annotated full patent text

• Retrieve patent document as PDF

15 PDG Biotech Meeting

Page 16: ICIC 2014 From SureChem to SureChEMBL

16

https://www.surechembl.org/

Page 17: ICIC 2014 From SureChem to SureChEMBL

PDG Biotech Meeting 17

Page 18: ICIC 2014 From SureChem to SureChEMBL

PDG Biotech Meeting 18

Page 19: ICIC 2014 From SureChem to SureChEMBL

Compound Report Page

https://www.surechembl.org/chemical/SCHEMBL1895

Page 20: ICIC 2014 From SureChem to SureChEMBL

UniChem Integration

On-the-fly integration with 71M structures and from 25 data sources

Page 21: ICIC 2014 From SureChem to SureChEMBL

SureChEMBL Data Access • UniChem

• https://www.ebi.ac.uk/unichem

• Weekly updates

• Private, secure, live integration with >25 chemistry

resources

• UniChem will soon be the worlds largest chemical structure

integration resource…..

• FTP Site

• ftp://ebi.ac.uk/public

• Quarterly updates

• All SureChEMBL compounds in SDF and CSV format

• Raw data – not filtered for ‘funnies’

• Further downloads planned in future

• Blog for announcements – https://chembl.blogspot.com

21 PDG Biotech Meeting

Page 22: ICIC 2014 From SureChem to SureChEMBL

OCR Errors • Small, poor quality images

• OCR errors in names (OCR done by IFI). There is an OCR

correction step, but cannot fix all errors

-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-

3- vDbenzamide’

• Reliability better for US patents due to inclusion of mol

files 22 PDG Biotech Meeting

Page 23: ICIC 2014 From SureChem to SureChEMBL

Name Conversion Errors

Pentyl Thiol

2-(2-((3-chloro-6-methyl-5,5-dioxido-6,11-dihydrodibenzo[c,f][1,2]thiazepin-11-yl)amino)ethoxy)acetic acid

Page 24: ICIC 2014 From SureChem to SureChEMBL

• InChI based comparison using filtered parent compounds

ChEMBL – SureChEMBL Overlap

235K

18.4% 1.3M 12.2M

SureChEMBL ChEMBL

Filters

• MW between 100 and 1200

• #Atoms between 6 and 70

• ALogP between -10 and 10

• #C > 0

• #Rings > 0

• #C != #Atoms

• RTB <= 20

(ChEMBL 18)

Page 25: ICIC 2014 From SureChem to SureChEMBL

Future Entity Extraction and Indexing • Identify new entity types e.g. proteins, diseases and cell lines

• Extend using ChEMBL dictionaries + others

• Ontology/synonym mapping - semantic tagging

• Target-relevance assessment

• Protein/biotherapeutic sequence extraction

• Sequence-based patent searches

• Enhanced cross-referencing

• Tag up all commonly used identifiers (Company codes, CAS,

ChEBI, ChEMBL, PubChem, ENSEMBL, RefSeq, UniProt,…)

Page 26: ICIC 2014 From SureChem to SureChEMBL

EFO – http://www.ebi.ac.uk/efo

Page 27: ICIC 2014 From SureChem to SureChEMBL

Far Future - Bioactivity Data Extraction?

Target/Assay

Bioactivity

27 PDG Biotech Meeting

Page 28: ICIC 2014 From SureChem to SureChEMBL

Far Future – Markush Extraction?

-alkyl

-aryl

-heteroaryl

-heterocyclyl

-cycloalkyl

….

28 PDG Biotech Meeting

Page 29: ICIC 2014 From SureChem to SureChEMBL

Acknowledgements • ChEMBL team

• John Overington

• Jon Chambers

• George Papadatos

• Mark Davies

• Nathan Dedman

• Anna Gaulton

• Digital Science

• Nicko Goncharoff

• James Siddle

• Richard Koks

Funding:

• Wellcome Trust Strategic Award for

ChEMBL database

(WT086151/Z/08/Z &

WT104104/Z/14/Z)

• Open PHACTS - Innovative

Medicines Initiative Joint Undertaking

(grant no. 115191)

• European Molecular Biology

Laboratory

• BioMedBridges - European

Commission FP7 Capacities Specific

Programme (grant no. 284209)

• Technology Partners: