the rsc chemical validation and standardization platform, a potential path to quality-conscious...

20
Chemistry Validation and Standardization Platform Modularization and “Hadoop”ization Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams ACS New Orleans April 2013

Upload: orcid-0000-0002-2668-4821

Post on 11-May-2015

514 views

Category:

Technology


0 download

DESCRIPTION

High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to deposition in a database is gone and quality-conscious databases need to heavily rely on automated validation checks. An automated chemical validation system is being developed by the cheminformatics team at the Royal Society of Chemistry to be the “quality gatekeeper” of databases at the point of deposition. ChemSpider is leading a community-wide standardization approach starting with our support of the Open PHACTS semantic web project, an Innovative Medicines Initiative. The Chemical Validation and Standardization Platform (CVSP) is being designed as an open, flexible chemical validation and standardization platform that validates and standardizes chemical records. This presentation will review the existing beta version of the system and work in progress.

TRANSCRIPT

Page 1: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Chemistry Validation and Standardization Platform

Modularization and “Hadoop”ization

Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams

ACS New Orleans April 2013

Page 2: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Overview

• Motivation• What we support• Modularization• Parallelization• Examples

Page 3: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Motivation: validation

Open and free chemical validation system for:

•Structure validation– Warn on query atoms, pseudo atoms, polymers,

etc.– Nonsensical stereo

•SDF field mapping for validating depositor-provided names, InChI, SMILES

Page 4: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Motivation: standardization

Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment

Page 5: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

What we support

• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,

SMILES

• Zipped files• GZipped files

Page 6: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

CVSP: modularization

Page 7: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Reusable workflows

Page 8: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

SMIRKS-based rules

Page 9: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 10: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 11: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 12: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

“Hadoop”izationApache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.

CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)

Farm:•28 CPU cores•42G memory•2T disk space

Processor intensive tasks•Tautomerization

Page 13: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Input file Deposit ID in database

Upload to farm for processing on HadoopHadoop processing

Download resultsUpload results to database for user

preview

Convert to SD format

Page 14: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Hadoop queuesThree Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions

•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue

– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization

All records have to be processed on Hadoop to user to see the results (no partial preview)

Page 15: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Examples

DrugBank •~6500 records, approximately 2 records per secondPubMed•~100 000 records, about 9 h

Page 16: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Rate-limiting step?

Canonical tautomerizationThis molecule took45 min tocanonicalize.

Page 17: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

DrugBank dataset (6516 records)Errors•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20

Warnings•INCHI not matching structure (100+)•SMILES not matching structure (100+)

Page 18: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+

DrugBank ID: DB00614

Page 19: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Stereo issues

DB08128 DB06287

J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277

Page 20: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Thank you

E-mail: [email protected], [email protected]

Please try CVSP at

http://cv.beta.rsc-us.org