webinar: wed 13 th may 2015 unichem jon chambers and anne hersey, chembl group, the european...
Post on 20-Jan-2016
216 Views
Preview:
TRANSCRIPT
Webinar: Wed 13th May 2015
UniChem
Jon Chambers and Anne Hersey,
ChEMBL group,
The European Bioinformatics Institute, part of the European Molecular Biology Laboratory (EMBL-EBI).
An Introduction to UniChem: EMBL-EBI’s mapping tool for small molecule database identifiers.
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
A ChEMBL Compound Report Cardhttps://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL12
Compound Cross-references on a Compound Report Card…
Cross-references to the same molecule in other resources.
Automatically maintained via UniChem web services.
Other resources can make use of this same functionality.
REST Web services.
REST web services
https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL12/1
https://www.ebi.ac.uk/unichem/
UniChem query results.
LR = Last Release when Assignment was current.UCI = UniChem Identifier
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
EBI Resources containing small molecule data.
‘CHEMBL12’
‘49575’
‘DZP’
‘ECBD..??’
‘diazepam’
‘SCHEMBL21442’
- Many resources, each with very different user-bases.- New resources predicted to be developed/adopted in future.- How can chemistry-centric users make use of all these data ?
- Links between resources allow each resource to evolve independently.
- But, maintenance is manual/time consuming, and a duplication of effort.
Advantages of the UniChem model.
- All EBI DBs share the maintenance overhead of creating links to each other.
UniChem
- All EBI DBs share the benefits of maintained links to external resources.- The ‘mapping service’ could be opened for use by external users.
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
Essential requirements for UniChem.
• Create cross-referencing of chemical structures and their identifiers between databases.
• Fast (ie: capable of producing mappings ‘on the fly’ during a web page load, via a web service call.)
• Low maintenance.
• Up to date.
• Archive and track changes to ‘id-to-structure’ assignments over time.
Standard InChI used as the normalizing mechanism.
InChIs (International Chemical Identifier).
• Non-proprietary, free.
• Not a registry system.
• Designed for printed and electronic data sources.
• Hashed representation aids ‘private’ querying.
InChI (International Chemical Identifier)
InChIKey…27 characters long…MGDTEJBDJOHWYU-UHTGSUKQAC-N[‘connectivity block’ aka ‘First InChIKey Hash Block’ (FIKHB) shown in blue]
UniChem Schema
UC_STRUCTURE
UC_XREF
UC_RELEASE
UC_SOURCE
Entries here are immutable
eg: CHEMBL12
1 or 0
UCI -PKSTANDARDINCHISTANDARDINCHIKEY
UCI -FK -PKSRC_ID -FK -PKSRC_COMPOUND_ID -PKASSIGNMENTLAST_REL_CURRENT
SRC_ID-PK
NAMEDESCRIPTIONCURRENT_RELEASE_Uetc
SRC_ID-PK
RELEASE_U-PK
SRC_RELEASE_NUMBERSRC_RELEASE_DATEetc
UniChem Tracks Historical Assignments…
InChiXcpd123
UniChem will record that in this particular source, the id ‘cpd123’…
• … was last assigned to InChiX on Release No.1, but is not currently assigned to this structure.• … was last assigned to InChiY on Release No.2, but is not currently assigned to this structure.• … is currently assigned to InChiZ.
Data Release No1 from Source ‘S’:
Data Release No2 from Source ‘S’:
cpd123 InChiY
Data Release No3from Source ‘S’: (latest)
cpd123 InChiZ
ie: UniChem keeps a record of current AND obsolete assignments.
UniChem deals with ‘Multiple Assignments’…
InChiX
cpd123
cpd456
cpd789
Multiple ids from a particular source assigned to a single InChI…
cpd123
InChiX
InChiY
InChiZ
Single id from a particular source assigned to multiple InChIs…
…and…
Loading Rules
Records are not loaded if…
There is a mis-match between the InChI and the InChIKey… ie: where the InChIKey calculated by UniChem from
the InChI provided by the source does not exactly match the InChIKey provided by the source.
The Standard InChI supplied is greater than 2000 characters long.
20
Automated Loading and Release.
Common Format
Source specific downloaders and
parsersSingle loader
Overall process controlled by crontab (timings optimized for each DB to capture latest releases asap).
Weekly release process
Production
ReleaseIncl.Downloads+Mapping files
… etc …
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
Top Level stats.
Stats.
24
https://www.ebi.ac.uk/unichem/ucquery/stats
Sources.
Sources
Downloads.
ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/
Downloads on the UniChem ftp site …
Oracle Dumps on the UniChem ftp site …
Release number == UDRI
Contents of a single Release directory…
Downloads on the UniChem ftp site …
Whole Source Mapping Downloads
Whole Source Mapping Downloads – Files containing all id mappings between two sources.
An Example of a Whole source mapping file.
From src:'3' To src:'15'SX2 SCHEMBL33962230DU SCHEMBL6234813FM9 SCHEMBL12263874HHH SCHEMBL19579302DC SCHEMBL174617528Y SCHEMBL2320900X5 SCHEMBL3515230PU7 SCHEMBL19642011LP SCHEMBL111850ACK SCHEMBL4066485... (8719 records)
eg: src3src15.txt [PDBe and SureChEMBL]
Analyses.
Various analyses run on the current UniChem content, using ‘Structural Identity’ defined in one of 3 ways…
FULIK = The Full InChIKey.
FIKHB = First InChIKey Hash Block (commonly called 'the connectivity layer' of the InChIKey). SCFIB = Separated Single Components of FIKHB.
Structures by Source
Numbers of ‘structures’ contributed by each source, and of these, how many are unique to the source…
Overlaps between Sources
Numbers of ‘structures’ which ‘overlap’ between pairs of sources…
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
UniChem Connectivity SearchAn advanced use of UniChem which permits
searching across UniChem data sources for molecules with the same molecular skeleton as the query, but which may exist in …
Different stereochemical and isotopic forms Different salt forms or mixtures
Funded by FP7 Capacities Specific Programme, grant agreement no. 284209
Connectivity Based Searching in UniChem
Standard UniChem links created only on the basis of identical InChIKeys.
Aim: Create links on the basis of common connectivity (but differing elsewhere; stereochemistry, isotopic composition, etc).
Requirements… Fast (has to be created dynamically). Identify ‘relationships’ between molecules (eg: “has same
connectivity …and is isotopic variant of”) Link between cpds with common connectivity within
mixtures/salts. Generic / Flexible / Customizable.
Funded by FP7 Capacities Specific Programme, grant agreement no. 284209
Alternative views of molecular equivalence.
Sometimes, molecules that many scientists would consider equivalent in the context of their particular field (e.g. pharmacology, docking, etc.), are quite often depicted differently across different resources.
Frequently, these depictions have different Standard InChIs and so cannot be integrated by simply matching on Standard InChIKey.
Examples…
Isotopic Differences
PubChem CID 71450958
CHEMBL441225
DTQNEFOKTXXQKV-XRLBDJASSA-N
DTQNEFOKTXXQKV-HKUYNNGSSA-N
CP-99994, an NK1 antagonist…
NB: First InChIKey Hash Block (FIKHB) in blue.
Example of Stereochemical differences
AHOUBRCZNHFOSL-WMLDXEAASA-N
AHOUBRCZNHFOSL-YOEHRIQHSA-N
Paroxetine in two different sources ….
Incorrectly drawn, or Valid stereoisomeric forms ?
NB: First InChIKey Hash Block (FIKHB) in blue.
PIPZGJSEDRMUAW-VJDCAHTMSA-N
Yohimbine HCl (Antagonil in ‘Selleck’)
Yohimbine(CHEMBL15245 in ChEMBL)
BLGXFZZNTVWLAY-SCYLSFHTSA-N
Links between mixtures / salts ?
QJVHTELASVOWBE-AGNWQMPPSA-N
Amoxicillin
Clavulanic acid
Co_Amoxiclav
InChI=1S/C16H19N3O5S.C8H9NO5/c1-16(2)11(15(23)24)19-13(22)10(14(19)25-16)18-12(21)9(17)7-3-5-8(20)6-4-7;10-2-1-4-7(8(12)13)9-5(11)3-6(9)14-4/h3-6,9-11,14,20H,17H2,1-2H3,(H,18,21)(H,23,24);1,6-7,10H,2-3H2,(H,12,13)/b;4-1-/t9-,10-,11+,14-;6-,7-/m11/s1
PIPZGJSEDRMUAW-VJDCAHTMSA-N
Yohimbine HCl
YohimbineBLGXFZZNTVWLAY-SCYLSFHTSA-N
Links between mixtures / salts ?
PIPZGJSEDRMUAW-VJDCAHTMSA-N
Yohimbine HCl
YohimbineBLGXFZZNTVWLAY-SCYLSFHTSA-N
Links between mixtures / salts ?
BLGXFZZNTVWLAY-SCYLSFHTSA-N
VEXZGXHMUGYJMC-UHFFFAOYSA-N
Hydrochloride
Yohimbine
…Yes, but parsing of the InChI required first...
UniChem SchemaUC_STRUCTURE
UC_XREF
UC_RELEASE
UC_SOURCE
eg: CHEMBL12
1 or 0
UCI -PKSTANDARDINCHISTANDARDINCHIKEYFIKHB
UCI -FK -PKSRC_ID -FK -PKSRC_COMPOUND_ID -PKASSIGNMENTLAST_REL_CURRENT
SRC_ID-PK
NAMEDESCRIPTIONCURRENT_RELEASE_Uetc
SRC_ID-PK
RELEASE_U-PK
SRC_RELEASE_NUMBERSRC_RELEASE_DATEetc
UC_FIKHB_HIERARCHY
PARENTCHILD
Additions to schema for ‘Connectivity Search’ shown in green
BLGXFZZNTVWLAY-SCYLSFHTSA-N
Links between combinations of stereoisomers, isotopic variants, in mixtures / salts …
Yohimbine(CHEMBL15245 in ChEMBL)
Yohimbine HClPIPZGJSEDRMUAW-VJDCAHTMSA-N
…is a component of…
XIIDGINYXKOJGX-ZKKXXTDSSA-N
Rauwolscine Oxalate
Rauwolscine HClPIPZGJSEDRMUAW-ZKKXXTDSSA-N
…is a component of… AND…is stereoisomer of…
tritiated Rauwolscine
BLGXFZZNTVWLAY-XDGRAVGFSA-N…is isotopic variant of… AND…is stereoisomer of…
Refining ‘Connectivity Search’ to show salts and mixtures.
Select radio button ‘4’ of Option C.
Connectivity Search Results Page.
Connectivity Search Web Services
Connectivity Search Web service query results
https://www.ebi.ac.uk/unichem/rest/cpd_search/CHEMBL15245/1/0/0/4
Connectivity Search in ChEMBL
Connectivity Search in ChEMBL
Train Online
http://www.ebi.ac.uk/training/online/course/unichem-quick-tour-0
Acknowledgements
ChEMBL
John Overington
Anne Hersey
Anna Gaulton
Mark Davies
Louisa Bellis
George Papadatos
Shaun McGlinchey
Jon Chambers
ChEBI
Chris Steinbeck
Janna Hastings
PDBe
Sameer Velankar
Atlas
Robert Petryszak
Training
Tom Hancocks
Richard Grandison
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
• 20th May - ChEMBL walkthrough
• 27th May - Sequence searching (*3pm UK time)
• 3rd June – UniProt – accessing protein data programmatically
• 10th June – MyChEMBL walkthrough
• 17th June - ChEMBL Web Services
All webinars @ 4:00pm UK time unless stated
For details see: http://www.ebi.ac.uk/training/online/embl-ebi-training-webinar-series-2015
Future webinars:
__END__
62
InChI=1S/C10H6N4O2/c15-9-7-8(13-10 …
37325
37327
Example of multiple ids from a source assigned to a single Standard InChI…
mappings generated…
ChEMBL -> ChEBI ChEBI -> ChEMBLCHEMBL68500 -> 37325 37325 -> CHEMBL68500CHEMBL68500 -> 37327 37327 -> CHEMBL68500
Mapping imprecision
alloxazine
isoalloxazine
top related