open chemistry: realizing open data, open standards, and open source

1
Open Chemistry: Realizing Open Data, Open Standards and Open Source Marcus D. Hanwell, Kyle Lutz , David Lonie, Chris Harris, and David Cole Website: http://openchemistry.org/ Email: [email protected], [email protected] Scientific Computing, Kitware, Inc, 28 Corporate Drive, Clifton Park, NY 12065. Avogadro The Avogadro project is a cross-platform, open-source approach to building chemical structures. It uses external simulation packages in addition to integrated analysis and visualization routines. The work presented here illustrates a workflow for quantum mechanical calculations, allowing the preparation of chemical structures, rough optimization, and subsequent calculation of electron density isosurfaces, molecular orbitals, etc. Figure 1: Avogadro application (left), ray-traced molecule (center) and the periodic table widget (right). Avogadro allows the user to prepare jobs for quantum packages, such as NWChem, GAMESS, Gaussian and Q-Chem. Due to the plugin-based nature of the Avogadro project, many specialized functions can be added for a large range of applications, such as molecular docking, surface modeling and electronic structure. MoleQueue The MoleQueue application provides a graphical interface that integrates high- performance computing (HPC) resources on the desktop. It offers a seamless integration layer for applications, such as Avogadro, to submit jobs to local and remote computational resources. Job lifetime is managed by MoleQueue, and results can be opened in any external program. Figure 2: The MoleQueue program configuration dialog for a PBS remote system. Graphical configuration of queues and programs Support for Sun Grid Engine, PBS and running calculations locally JSON-RPC protocol for interprocess communication over local sockets or ZeroMQ C++ and Python client libraries Chemical Data Explorer The Chemical Data Explorer is an cross-platform, open-source application that builds on the capabilities of the Visualization Toolkit, Qt and MongoDB. It can connect to a local or remote database, ingest new data from various sources and make that data semantically rich. It can apply informatics techniques to the data it contains to search for structures with particular properties. Work is ongoing to more tightly integrate computational job storage and search. Figure 3: The user interface showing a query and structures (top-left), a scatter plot matrix (top-right), scatter plot with tooltip (bottom-left), and K-means clustering (bottom-right). Visualization Toolkit and ParaView The Visualization Toolkit (VTK) is an open-source, C++ toolkit for 2D and 3D graphics, volume rendering, image processing, visualization and modeling. Development began in 1993, and it now has a large community of developers distributed around the world in a diverse set of fields. VTK processes data using a data flow graph (pipeline) in which each algorithm takes zero or more inputs and produces zero or more outputs. VTK is scalable to large data because it has distributed algorithms that use MPI to execute on large computing clusters. Figure 4: Volume rendered molecular orbital with sliced contour (left), and library dependency graph (right). ParaView is an open-source, cross-platform data analysis and visualization application. It is one of the flagship open-source projects developed by Kitware, building on VTK and Qt to provide a client-server application that allows users to quickly build visualizations to analyze their data. ParaView was developed to analyze extremely large data sets using distributed memory computing resources. It can be used interactively with the cross-platform GUI, or scripted from Python. VTK and ParaView are being augmented with additional functionality for chemistry through projects such as the Google Summer of Code and Open Chemistry. Open Chemistry The Open Chemistry project is developing a suite of applications and support libraries to improve the workflow in computational chemistry, biology, materials science and related areas. A set of open, connected components that can tackle small problems on the desktop, and big research projects requiring significant time on the world’s top supercomputers. Supercomputer Local Cloud Simulation Results Job Submission Informatics HPC integration Log File Input File Figure 5: The workflow that the Open Chemistry components are being developed for. OpenQube OpenQube is a small, open-source C++ library that reads key quantum data from calculations produced by codes such as NWChem, GAMESS and Gaussian. It can read in basis sets, eigenvectors and density matrices, and calculate the magnitude of the molecular orbitals and electron density on regularly-spaced grids. The data produced can be used for further analysis and visualization of electronic structure. Chemkit Chemkit is an open-source, C++ library for molecular modeling, cheminformatics, and molecular visualization. It features a modular, plugin-based architecture and includes over 40 plugins that implement 15 file formats, 6 line formats, 4 force-fields, 2 partial charge models, 2 aromaticity models, 8 atom typers and 30 molecular descriptors. In addition, Chemkit includes an integrated visualization library built on OpenGL/Qt, with Python bindings for easy scripting. Figure 6: Cartoon rendering of protein (left), surface rendering (center), and molecule rendering (right). Software Process These projects are open-source, targeting multiple platforms and architectures. A quality-inducing software process is employed using best-of-breed technologies such as Git for distributed version control, Gerrit for code review, CMake for cross- platform building, CTest for unit/regression testing and CDash for software quality feedback. Most code is BSD licensed, and designed with reuse in mind.

Upload: marcus-hanwell

Post on 07-Jul-2015

1.216 views

Category:

Technology


2 download

DESCRIPTION

The Blue Obelisk has brought together the computational chemistry community and those who are passionate about Open Chemistry and realizing the promise of Open Data, Open Standards, and Open Software (ODOSOS); the three pillars the group promotes. We will present current work that has taken place over the past five years, which is inspired by these pillars, and present plans for future work. The group is actively engaged in multiple open source projects that rely on and promote open standards and open data including: Avogadro (a powerful 3D molecular editor), OpenQube (a library for quantum mechanics), ChemData (a tool for large-scale chemical data analysis and visualization), Chemkit (a library for cheminformatics), MoleQueue (a HPC queue manager), and VTK (a library for scientific data visualization). The Open Chemistry project benefits greatly from the activities of the Blue Obelisk and makes use of several prominent open-source projects including Qt and MongoDB.

TRANSCRIPT

Page 1: Open Chemistry: Realizing Open Data, Open Standards, and Open Source

Open Chemistry: Realizing Open Data, Open Standards and Open SourceMarcus D. Hanwell, Kyle Lutz, David Lonie, Chris Harris, and David Cole

Website: http://openchemistry.org/ Email: [email protected], [email protected]

Scientific Computing, Kitware, Inc, 28 Corporate Drive, Clifton Park, NY 12065.

Avogadro

The Avogadro project is a cross-platform, open-source approach to building chemicalstructures. It uses external simulation packages in addition to integrated analysis andvisualization routines. The work presented here illustrates a workflow for quantummechanical calculations, allowing the preparation of chemical structures, roughoptimization, and subsequent calculation of electron density isosurfaces, molecularorbitals, etc.

Figure 1: Avogadro application (left), ray-traced molecule (center) and the periodic table widget (right).

Avogadro allows the user to prepare jobs for quantum packages, such as NWChem,GAMESS, Gaussian and Q-Chem. Due to the plugin-based nature of the Avogadroproject, many specialized functions can be added for a large range of applications,such as molecular docking, surface modeling and electronic structure.

MoleQueue

The MoleQueue application provides a graphical interface that integrates high-performance computing (HPC) resources on the desktop. It offers a seamlessintegration layer for applications, such as Avogadro, to submit jobs to local andremote computational resources. Job lifetime is managed by MoleQueue, and resultscan be opened in any external program.

Figure 2: The MoleQueue program configuration dialog for a PBS remote system.

•Graphical configuration of queues and programs

• Support for Sun Grid Engine, PBS and running calculations locally

• JSON-RPC protocol for interprocess communication over local sockets or ZeroMQ

•C++ and Python client libraries

Chemical Data Explorer

The Chemical Data Explorer is an cross-platform, open-source application thatbuilds on the capabilities of the Visualization Toolkit, Qt and MongoDB. It canconnect to a local or remote database, ingest new data from various sources andmake that data semantically rich. It can apply informatics techniques to the datait contains to search for structures with particular properties. Work is ongoing tomore tightly integrate computational job storage and search.

Figure 3: The user interface showing a query and structures (top-left), a scatter plot matrix (top-right), scatter

plot with tooltip (bottom-left), and K-means clustering (bottom-right).

Visualization Toolkit and ParaView

The Visualization Toolkit (VTK) is an open-source, C++ toolkit for 2D and3D graphics, volume rendering, image processing, visualization and modeling.Development began in 1993, and it now has a large community of developersdistributed around the world in a diverse set of fields. VTK processes data usinga data flow graph (pipeline) in which each algorithm takes zero or more inputsand produces zero or more outputs. VTK is scalable to large data because it hasdistributed algorithms that use MPI to execute on large computing clusters.

Figure 4: Volume rendered molecular orbital with sliced contour (left), and library dependency graph (right).

ParaView is an open-source, cross-platform data analysis and visualizationapplication. It is one of the flagship open-source projects developed by Kitware,building on VTK and Qt to provide a client-server application that allows usersto quickly build visualizations to analyze their data. ParaView was developed toanalyze extremely large data sets using distributed memory computing resources.It can be used interactively with the cross-platform GUI, or scripted from Python.VTK and ParaView are being augmented with additional functionality for chemistrythrough projects such as the Google Summer of Code and Open Chemistry.

Open Chemistry

The Open Chemistry project is developing a suite of applications and support librariesto improve the workflow in computational chemistry, biology, materials science andrelated areas. A set of open, connected components that can tackle small problemson the desktop, and big research projects requiring significant time on the world’s topsupercomputers.

SupercomputerLocal Cloud

Simulation

Results Job SubmissionInformatics

HPC integration

Log File Input File

Figure 5: The workflow that the Open Chemistry components are being developed for.

OpenQube

OpenQube is a small, open-source C++ library that reads key quantum data fromcalculations produced by codes such as NWChem, GAMESS and Gaussian. It canread in basis sets, eigenvectors and density matrices, and calculate the magnitudeof the molecular orbitals and electron density on regularly-spaced grids. The dataproduced can be used for further analysis and visualization of electronic structure.

Chemkit

Chemkit is an open-source, C++ library for molecular modeling, cheminformatics,and molecular visualization. It features a modular, plugin-based architecture andincludes over 40 plugins that implement 15 file formats, 6 line formats, 4 force-fields,2 partial charge models, 2 aromaticity models, 8 atom typers and 30 moleculardescriptors. In addition, Chemkit includes an integrated visualization library builton OpenGL/Qt, with Python bindings for easy scripting.

Figure 6: Cartoon rendering of protein (left), surface rendering (center), and molecule rendering (right).

Software Process

These projects are open-source, targeting multiple platforms and architectures. Aquality-inducing software process is employed using best-of-breed technologies suchas Git for distributed version control, Gerrit for code review, CMake for cross-platform building, CTest for unit/regression testing and CDash for software qualityfeedback. Most code is BSD licensed, and designed with reuse in mind.