a rapid tour of bioinformatics saurabh sinha, lenny pitt

62
A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Upload: emery-hudson

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

A rapid tour of

Bioinformatics

Saurabh Sinha, Lenny Pitt

Page 2: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Bioinformatics, or Computational Biology ?

• sometimes used interchangeably• latter sometimes includes former• often, latter means molecular modeling to

investigate properties and behaviors of molecules via computer simulation

• often, former refers to application of databases, algorithms, computational and statistical techniques to solve problems arising from the management and analysis of biological data.

Page 3: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Computational Biology

• Example: protein folding

• http://www.youtube.com/watch?v=lijQ3a8yUYQ

• http://fold.it/portal/

Page 4: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Molecular Biology 101

Page 5: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Cells

• Cells are the fundamental units of living organisms

• Cells are born, do their jobs, and die

• Study of life =

study of cells

Page 6: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Proteins• Many of the processes (chemical

reactions) inside cells are carried out by proteins iwrwww1.fzk.de/biostruct/ Assets/1a00x500.jpg

Page 7: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

DNA• DNA carries the information on which

proteins to produce in a cell, and how

SOURCE: http://www.microbe.org/espanol/news/human_genome.asp

Chromosome

DNA

Page 8: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

• DNA is a string written in the alphabet {A,C,G,T}

• Human DNA is a string with 3 billion characters !

adenine, cytosine, guanine, (DNA and RNA), thymine (DNA) uracil (RNA)

Page 9: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

DNA and Proteins

www.ornl.gov/.../slides/ images/01-0037low.jpg

Page 10: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Genes

• Genes are “substrings” (~1000 bp) of DNA

• A gene is used as a template for producing a protein

• Each protein comes from a different gene

• ~25,000 genes in the human DNA

• The process of making a protein from a gene can be regulated in the cell: GENE REGULATION

Page 11: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

The initial successes of bioinformatics

Page 12: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Some problems & successes

1. Sequence alignment

2. Comparative genomics

3. Sequencing the genome

4. Gene search

5. Evolutionary biology & phylogenetic trees

Page 13: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

1. Sequence Alignment(fundamental question)

• Is this string equal to that one?

• Does this string contain a copy of that one?

• Is this string “like” that one? How much alike?

• Is this string “like” a portion of that one?

Page 14: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Sequence alignment

• Could you have done this task, for two strings of length 1 million characters, by hand ? • Sequence analysis algorithms are the bread and butter of bioinformaticians.

Page 15: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

CS has already studied these!

• Is this string equal to that one?– compare two files

• Is this string equal to a portion of that one?– find a word in a document

• Is this string “like” a portion of that one?– find suggested spelling corrections

Page 16: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Edit Distance• how much alike are two strings?• CATTGAGCT• CTTAGCCTA

CATTGAGCT–

C–TTAGCCTA

• Is this the best possible?

CATTGAGC–T–

C–TT –AGCCTA

Page 17: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

• Charge one for each mismatch, each insertion, each deletion.

• Problem: find the least cost alignment

• Extensions: charge different amounts for A/C mismatch, for insertion, etc., reflecting (un)likelihood of certain genetic mutations.

• There are reasonably efficient algorithms for all of these problems

Page 18: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

2. Comparative genomics

• Human and mouse share the genetic “toolkit” for development

• Compare the two genomes and find the conserved features

• These are likely to be of functional importance

• How to compare two genomes ?– Sequence alignment

Page 19: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

2. Comparative genomics

http://genome.ucsc.edu/cgi-bin/hgGateway

Page 20: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

3. Sequencing the GenomeThe Human Genome Project

• Human genome: a “string” of length 3,000,000,000 characters !

• Starting with a human cell, how can we obtain this sequence ?– The problem of sequencing– 2001

Page 21: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Shotgun Sequencing• Lab technology: can sequence snippet of 1000-

2000 nucleotides.• Idea: “shotgun” apart multiple copies of whole

genome, sequence all snippets, reconstruct. http://en.wikipedia.org/wiki/Shotgun_sequencing

• 3 billion / 1000 = 3 million snippets.• Want multiple copies divided in different spots, so

many snippets overlap• From overlap, we can tell how things go together. • Need 7-fold replication to guarantee coverage

Page 22: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

How is the genome sequenced ?

http://www.wiley.com/legacy/college/boyer/0470003790/cutting_edge/shotgun_seq/computer.gif

Page 23: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Assembly Methods

• Greedy approach

• Graph approaches:– Hamiltonian path– Traveling Salesman (TSP) in k-mer graph– Eulerian path in k-mer graph

READ: http://www.cbcb.umd.edu/research/assembly_primer.shtml

Page 24: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Greedy Approach

• Merge two snippets with greatest overlap• Repeat

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Problem: may merge repeated segments (>50% of human genome are repeats)

Page 25: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Hamiltonian Path

• Create graph– vertices = snippets– edges = overlap

http://www.cbcb.umd.edu/research/assembly_primer.shtml

red edges correspond to repeated segments

Find a path that visits each vertex exactly once

Page 26: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Other graph approaches

• Unknown sequence.• Challenge: here are the “3-mers”:

CAG, ATC, GTC, CCA,

CAT, AGT, TCC, TCA• Max TSP approach

– 3-mers are vertices

• Eulerian Path approach– 3-mers are edges

Page 27: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Solution

CATCCAGTCA

Page 28: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Max TSP approach

3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }

AGT

CCA

ATC

ATCCAGT TCC CAG

ATCCAGT

ATC

CCA

TCC

AGT

CAG

2

2 22

1

1

10

11

3-mers extracted from unknown sequence

Find max-weight tour visiting all vertices

Page 29: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Max TSP approach

3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }

AGT

CCA

ATC

ATCCAGT TCC CAG

ATCCAGT

ATC

CCA

TCC

AGT

CAG

2

2 22

1

1

10

11

3-mers extracted from unknown sequence

Find max-weight tour visiting all vertices

Page 30: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Eulerian paths and k-mers

• get sequence of all k-mers (including multiplicities)

• edges are k-mers

• vertices are k-1 bp prefix and suffix.

• find Eulerian path (traverses each edge)

Page 31: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }

AGT

CCA

ATC

ATCCAGT TCC CAG

AT

TC

CC

CA

AG

GT

ATCCAG

AGT

CC

A

TC

C

ATCCAGTFind tour using all edges

Page 32: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Exercise

• Length-9 DNA sequence was deconstructed • 3-mers = {GTT, TCG, CGT, TTA, ACG, TTC,

TAC} • Draw graph with directed edges labeled by these

3-mers, and vertices labeled with the corresponding 2-mers

• Find a directed path through this graph that crosses each edge exactly once, and write down the possible original length-9 sequence that can be reconstructed from the path

Page 33: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

4. Gene Search• Find out where the genes are located in this long string• Genes cover ~2% of human genome• Finding them using computer algorithms and statistics

http://www.broad.mit.edu/annotation/argo/help/usecase/index_files/image012.jpg

Page 34: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

4. Gene Search

• Comparative genomics - similar regions to known genes for other organisms likely indicate similar function

• Similarity to gene-like patterns

• Reverse engineering from expressed proteins

• (http://en.wikipedia.org/wiki/Gene_prediction)

Page 35: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

5. Evolutionary Biology and Phylogenetic Trees

• See presentation by Jana Sperschneider

Page 36: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

21st century biology: bioinformatics drives the revolution

Page 37: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Special issue of journal Science, July 1, 2005.

Page 38: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Page 39: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Page 40: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

A simple organism

GENE

Raw

mat

eria

lsEnvironmental signal

Response (protein)

Page 41: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

A simple organism

GENE1

GENE2

GENE3

Environmental signalR

aw m

ater

ials

Page 42: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

A simple organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Page 43: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

A complex organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Complex circuit of interactions

Do not need more genes; additional complexitycomes from more interconnections among genes

Page 44: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Regulatory networks

• Genes are switches, transcription factors are input signals, proteins are outputs

• Proteins (outputs) are the signals for other genes (switches)

• This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)

• Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information

Page 45: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Decoding the regulatory network

• Find patterns (“binding sites”) in DNA sequence • Analyze high throughput measurements of gene

activity levels (“microarrays”)• Analyze measurements of protein-DNA interaction

(“ChIP-on-chip”)• Integration of heterogeneous sources of data

Page 46: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

REGULATORYNETWORKDISCOVERY

http://www.chiponchip.org/Images/scheme_800x600_crop.jpg

Microarrays

ChIP-on-chip

Patterns in DNA sequence

Page 47: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

“How does a single somatic cell become a whole plant ?”

Page 48: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Developmental biology

• The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult

• A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?

Page 49: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

How does a single cell lead to this ? …

… and to this ?

Drosophila(fruitfly)

Page 50: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Answer: Regulatory networks (Again !)

• Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo

• Hidden Markov models, a popular technique in signal processing, used to detect such regions

• Multiple species comparison aids discovery

Page 51: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

“How did cooperative behavior evolve?”

Page 52: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Cooperative social behavior

• What is the genetic (molecular) basis of social behavior ?• Social behavior in honey bees• Young worker bees are nurses in the hive; older ones go

out to forage• This behavioral pattern is determined by needs of colony

– How do the bees know ?

Page 53: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Bioinformatics of social behavior

• UIUC team scanned the honeybee genome to understand this

• Regulatory network of social behavior

• Statistical tools, machine learning, sequence analysis used for this project

Page 54: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

“How will big pictures emerge from a sea of biological data?”

Page 55: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

The sea

• Genomes: 3 x 109 bp of human genome• Similar numbers for other genomes: mouse, rat,

dog, chicken, chimp etc.• Microarray: snapshots of 1000s of genes’

activities at one time and condition. Thousands of microarrays.

• ChIP-on-chip data: measurements of a transcription factor’s binding affinity for 1000s of genes (promoters).

Page 56: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Segal et al. Nature Genetics 2005.

Big pictures

A compendium of cancergenes and their regulation

Page 57: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

The sea of biological data

• Biological literature, capturing decades of painstaking experimental work on genetics and molecular biology

• Can we glean useful information from this vast body of knowledge ?

• Biological literature mining. – Natural language processing– Text Information Retrieval (statistical approaches)

Page 58: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Some other challenges• Protein structure prediction• Can we predict the 3-D structure of a protein from

its sequence ?– Why ? – One good reason: structure gives clues about function. If

we can tell the structure, we can perhaps tell the function– We can design amino acid sequences that will fold into

proteins that do what we want them to do. Drug design !!

• Neural networks, a popular technique in computer science, applied to this problem

Page 59: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Some other challenges

• “Metagenomics”• Most studies to date are on genomes of one

species• A sample from the soil contains hundreds of

bacteria, thousands of viruses. Can we study all of these ?

• Bioinformatics is indispensable !!• New type of data, new types of algorithms

Page 60: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Many more challenges

• New types of data come due to technological breakthroughs in biology

• High throughput data carries unprecedented amount of information

• Too much noise

• Bioinformatics removes the noise and reveals the truth

Page 61: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Bioinformatics

• Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.)

• Is about a family of very different problems, all related to biology, all related to each other

• How can computers help solve any of this family of problems ?

Page 62: A rapid tour of Bioinformatics Saurabh Sinha, Lenny Pitt

Bioinformatics and You

• You can learn the tools of bioinformatics• These tools owe their origin to computer

science, information theory, probability theory, statistics, etc.

• You can learn the language of biology, enough to understand what the problems are

• You can apply the tools to these problems and contribute to science