from sequence analysis to simulations: applications of hpc in modern biology

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

R. SankararamakrishnanDepartment of Biological Sciences & Bioengineering

IIT-Kanpur

IIT-K REACH Symposium 2010

Oct 9th 2010

Computers and Computing in Biology

Bioinformatics

Computational Biology

Mathematical Biology

Biostatistics

Biomathematics

Quantitative Biology

Biophysics

What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

- NIH Definition http://www.bisti.nih.gov/

Definitions

http://www.bisti.nih.gov/

http://www.bisti.nih.gov/

Explosive growth of biological data

HPC Applications: Three examples

Evolutionary relationship among a given set of protein or DNA sequences

Drug Discovery and Design

Structure-function relationship of large biomolecular assemblies

I. HPC in PhylogeneticsI. HPC in Phylogenetics

Phylogeny and Phylogenetic tree

Study of evolutionary relationships (sequences/species)

Relationships between organisms with common ancestor

Phylogenetic tree is a graph representing evolutionary history of sequences/species

HumanChimpanzee

Gorilla

Orangutan

Rooted Tree Unrooted Tree

Direction of evolution

Human

Chimpanzee

Gorilla

Orangutan

Phylogenetic trees can be represented in two different ways

Has a unique node

No assumption about common ancestry

Molecular phylogeny in a criminal investigation

Maximum Likelihood Method – An Introduction

David Mount (2002)

For each unrooted tree, there will be many possible rooted trees

!22

!322

n

nN

nR

!32

!523

n

nN

nU

Species

Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375

221,643,095,476,699,771,875

Number of possible unrooted and rooted trees

Maximum likelihood phylogeny problem is NP-hard

Very CPU intensive

For trees containing more than 20 to 25 sequences, the problem cannot be solved any more

Efficient heuristic tree search algorithms are required to reduce the size of the search space

Recently developed algorithms:

IQPNNI, PHYML, GARLI, RAxML

None of these algorithms are guaranteed to find the ML tree; only yield the best known ML tree

Computing phylogenetic trees using ML method

Parallelization strategy

Ott et al. (2008)

RAxML performance in some HPC platforms

Ott et al. (2008)

212 sequences, 566,470 base pairs

One of the largest datasets analyzed under ML

IBM BlueGene/L; 1024 CPUs

7 distinct tree searches in 14 hours

Phylogenetic analysis of plant channel proteins identified new subfamily

Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007)Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)

II. HPC in Drug Discovery & II. HPC in Drug Discovery & Drug DesignDrug Design

“Is there really a case where a drug that is on the market was designed by a computer?”“The reality is that the use of computers and computer methods permeates all aspects of drug discovery today”

Jorgensen (2004)

Roles of Computation in Drug Discovery

“Drug discovery is complex: Successful teams and companies need to congratulated, whereas search for one individual or computer program is counterproductive. There is not going to be a voila moment at the computer terminal. Instead, there is systematic use of wide-ranging computational tools to facilitate and enhance the drug discovery process”

Computation in Drug Discovery

Jorgensen (2004)

Structure-based Drug Design – An Introduction

http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html

http://www.biocryst.com/our_science



http://www.biocryst.com/our_science

Wim Holwww.bmsc.washington.edu/WimHol/sbdd3.JPG

Lead Generation

Lead optimization

De novo design

Virtual screening

Bleicher et al. (2003)

All drugs that are presently in the market are estimated to target less than 500 biomolecules

Docking & Scoring

Drug targets and Drug discovery: Issues

Issues: Scoring function, solvent effect and protein flexibility

Four proteins: trypsin, HIV PR, CDK2 and AChE

Test set for each protein: 10,000 randomly selected compounds

6000 docking poses were selected for the top 1000 compounds

They served as initial conformations for MD simulations

Combination of docking and MD showed a higher and more stable enrichment performance than docking method used alone

A special purpose computer, MDGRAPE-3, was used for MD simulations

It is a cluster of personal computers

Each equipped with 24 MDGRAPE-3 chips and has a peak speed of approximately 2 Tflops

50 such computers were used

Average computational time for a single protein-ligand complex is 2.5 h

For 6,000 protein-ligand conformations, calculations were completed in a week

Steered Molecular Dynamics to compute the force required to extract the inhibitors from enzymes

A small string is connected to the ligand in the complex

This string is pulled at constant velocity into the surrounding water

Force is determined from the extension of the spring and recorded as a function of time

Strongly-bound inhibitors higher peak forces

Weaker inhibitors flatter profiles

Steered MD in Drug Discovery

Jorgensen, 2010

Protein-protein interactions in programmed cell death

Lama and Sankararamakrishnan, Proteins (2008)Lama and Sankararamakrishnan, Biochemistry (2010)

Bcl-2 family complex structures

Total number of atoms: ~50,000 to ~75,000

Simulation period: 50 ns

III. Large Biomolecular Assemblies

First Biomolecular simulation was performed in 1977

GlpF: 81006 AtomsAQP1: 75057 Atoms PfAQP: 81503 Atoms

30ns production run was performed for all the three systems.

Each simulation takes ~40 days CPU time (Total CPU time ~ 120 days).

MD simulations of channel proteins in bilayers

Alok Jain, Ravi Verma and R. Sankararamakrishnan, Manuscript in preparation

Complete virus: 1 million atoms(Freddolino et al., 2006)

Arrays of light-harvesting proteins – 1 million atoms (Chandler et al., 2008)

Simulations reaching the million-atom mark

BAR domain proteins – 2.3 million atoms (Yin et al., 2009)

The flagellum – 2.4 million atoms (Kitao et al., 2006)

Minimization and equilibration

Cluster of 48 AMD Athlon 2600+ processors

Simulation

256 Altix nodes at NCSA @UIUC

1.1. ns/day

Complete virus: 1 million atoms

(Freddolino et al., 2006)

Functions of large molecular machines

30S ribosome

Fungal fatty acid synthase

Gumbart et al. (2009)

2.7 million atoms

50 ns simulation

MD of protein-conducting channel bound to ribosome

Largest system simulated to date

Bacterial ribosomes are important targets for antibiotics

Phylogenetic analysis

Large Biomolecula

r systems

Drug Design & Discovery

HPC

HPC Platforms for Biology Applications

FPGA-boards: Field programmable gate arrays are ICs which can be programmed. FGPA boards with commonly used bioinformatics algorithms are available

Graphics-Processing Unit (GPU): All bioinformatics applications

Grid Computing: Many applications

Distributed Computing: Protein folding, Drug docking

Cloud Computing:

Acknowledgements

Anjali Bansal

Dilraj Lama

Alok Jain

Tuhin Kumar Pal

Priyanka Srivastava

Vivek Modi

Ravi Kumar Verma

Krishna Deepak

Phani Deep

DST, DBT, CSIR, MHRD

from sequence analysis to simulations: applications of hpc in modern biology

Documents

use of biological

applications of hpc

drug discovery drug

application of data

study of biological

hpc platformsott

likelihood phylogeny

introductiondavid mount