open source cheminformatics in knime with the rdkit nodes
TRANSCRIPT
Open Source Cheminformatics in
KNIME with the RDKit Nodes
Manuel Schwarze, NIBR IT
Novartis Institutes for BioMedical Research, Basel
6th KNIME Users Group Meeting
Zurich, March 6-8, 2013
Frozen Tinguely Fountain in Basel
RDKit: What Is It?
Python (2.x), Java, C++ toolkit for cheminformatics
• Core data structures and algorithms in C++
• Heavy use of Boost libraries
• Python wrapper generated using Boost.Python
Functionality:
• 2D and 3D molecular operations
• Descriptor generation for machine learning
• Database cartridge for substructure and similarity searching
• Supports Mac/Windows/Linux
History:
• 2000-2006: Developed and used at Rational Discovery for building predictive models for ADME, Tox, biological activity
• June 2006: Open-source (BSD license) release of software, Rational Discovery shuts down
• to present: Open-source development continues, use within Novartis, contributions from Novartis back to open-source version
KNIME Integration1
Out of the box KNIME is strong on data processing and mining, but weak on chemistry
Goal: Develop a set of open-source RDKit-based nodes for KNIME that provide basic cheminformatics functionality
1 Work done together with knime.com
+
Distributed as KNIME community nodes
Binaries available as KNIME plug-in (no RDKit build/installation required)
Complete refactoring released in 2012:
• GUI alignment
• Improved processing speed
RDKit Node Wizard released in 2012
Work in progress:
• More nodes being added
• Existing nodes being improved
What’s There? New Updated Changed
NEW: RDKit to InChI
• Conversion is based on the official
IUPAC reference library that was
integrated into the RDKit recently
• Many «switches» are available for
experts under the «Advanced»
Tab to influence conversion results
NEW: InChI to RDKit
NEW: IUPAC to RDKit
• Conversion is based on the
OPSIN Java library developed at
Cambridge University, UK
NEW: RDKit Highlighting Atoms
• Output is done as SVG
Image of the structure with
highlighted atoms
NEW: RDKit Interactive Table
• To be used like KNIME Interactive Table with
all features that it offers, e.g. hiliting
• Additionally, it has ability to show additional
information in column headers, currently only
structures based on SMILES values
• Output table = Input table
• Used also as direct view in some RDKit
Nodes, e.g. in RDKit Substructure Counter
NEW: RDKit SMILES Headers
• Possible options for SMILES
values in columns: Setting,
Replacing, Deleting, Retrieving
• First input and output port: Data table
• Optional second input table: SMILES
definitions and column assignments
• Second output table: SMILES definition in
data table after its execution
UPDATE: Molecule to RDKit
UPDATE: RDKit to Molecule
UPDATE: RDKit Substructure Counter
What Is Coming Next?
Integration of new KNIME Molecule Type into RDKit nodes
Suggestions for new features and improvements are welcome
Please post to the KNIME Community RDKit Forum: http://tech.knime.org/forum/rdkit
Acknowledgements
Novartis: • Greg Landrum (NIBR IT)
• David Nick (NIBR IT)
• Eddie Cao (NIBR IT)
• Marc Litherland (NIBR IT)
• Dan Karavakis (NIBR IT)
Rational Discovery: • Santosh Putta (currently at Nodality)
• Julie Penzotti
knime.com • Michael Berthold
• Thorsten Meinl
• Bernd Wiswedel
• Thomas Gabriel
• Peter Ohl
RDKit open-source community
KNIME forum members • Simon Richards
• James Davidson
• Steve Roughley
• Ed1
• many others ...