introduction to algorithms in computational biologydang/coursecb/lecture01.pdfcomputational biology...

41
. Introduction to Algorithms in Computational Biology Lecture 1 Background Readings : The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001. This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il /~nir. Changes made by Dan Geiger.

Upload: others

Post on 26-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

.

Introduction to Algorithms in Computational Biology

Lecture 1

Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger.

Page 2: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

2

Course InformationMeetings:

Lecture, by Dan Geiger: Mondays 16:30 –18:30, Taub 4.Tutorial, by Ydo Wexler: Tuesdays 10:30 – 11:30, Taub 2.

Grade:20% in five question sets. These questions sets are obligatory. Each contains 4-6 theoretical problems. Submit in pairs in two weeks time80% test. Must pass beyond 55 for the homework’s grade to count

Information and handouts:

www.cs.technion.ac.il/~cs236522

A brochure with zeroxed material at Taub library

Page 3: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

3

Course PrerequisitesComputer Science and Probability Background

Data structure 1 (cs234218)Algorithms 1 (cs234247)Probability (any course)

Some Biology BackgroundFormally: None, to allow CS students to take this course.Recommended: Biology 1 (especially for those in the Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site).

Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

Page 4: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

4

Relations to Some Other CoursesIntro to Bioinformatics (cs236523). This course covers practical aspects and hands on experience with web-based bioinformatics Software . Albeit not a formal requirement, it is recommended that you look on the web site http://webcourse.technion.ac.il/234523/ and examine the relevant software.

Algorithms in Computational Biology (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution.

Bioinformatics project (cs5236524). Developing bioinformatics tools under close guidance.

Page 5: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

5

First Homework Assignment

Read carefully the first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

Solve two of the questions for Chapter 2 and two of the questions for Chapter 3.

Due time: During the third tutorial class, or earlier in the teaching assistant’s mail slot. Recall to submit in pairs.

Page 6: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

6

Computational BiologyComputational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictivemethodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, ComputerScience, and Statistics.

Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only.

Page 7: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

7

Examples of Areas of Interest• Building evolutionary trees from molecular (and other) data• Efficiently assembling genomes of various organisms• Understanding the structure of genomes (SNP, SSR, Genes)• Understanding function of genes in the cell cycle and disease• Deciphering structure and function of proteins

Page 8: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

8

Exponential growth of biological information: growth of sequences, structures, and literature.

Page 9: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

9

Four AspectsBiological

What is the task?Algorithmic

How to perform the task at hand efficiently?Learning

How to adapt/estimate/learn parameters and models describing the task from examples

StatisticsHow to differentiate true phenomena from artifacts

Page 10: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

10

Example: Sequence ComparisonBiological

Evolution preserves sequences, thus similar genes might have similar function

AlgorithmicConsider all ways to “align” one sequence against another

LearningHow do we define “similar” sequences? Use examples to define similarity

StatisticsWhen we compare to ~106 sequences, what is a random match and what is true one

Page 11: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

11

Course GoalsLearning about computational tools for (primarily) molecular biology.We will cover computational tasks that are posed by modern molecular biologyWe will discuss the biological motivation and setup for these tasksWe will understand the kinds of solutions that exist and what principles justify them

Page 12: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

12

Topics IDealing with DNA/Protein sequences:

Finding similar sequencesModels of sequences: Hidden Markov ModelsGene findingGenome projects and how sequences are found

Page 13: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

13

Topics IIModels of genetic change:

Long term: evolutionary changes among speciesReconstructing evolutionary trees from sequencesShort term: genetic variations in a populationFinding genes by linkage and association

Page 14: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

14

Topics III (One class, if time allows)Protein World:

How proteins fold - secondary & tertiary structureHow to predict protein folds from sequences dataHow to analyze proteins changes from raw experimental measurements (MassSpec)

Page 15: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

15

Human Genome

Most human cells contain 46 chromosomes:

2 sex chromosomes (X,Y):XY – in males.XX – in females.

22 pairs of chromosomes named autosomes.

Page 16: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

16

DNA OrganizationSo

urce

: Alb

erts

et a

l

Page 17: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

17

The Double HelixSo

urce

: Alb

erts

et a

l

Page 18: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

18

DNA ComponentsFour nucleotide types:

AdenineGuanineCytosineThymine

Hydrogen bonds(electrostatic connection):

A-TC-G

Page 19: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

19

Genome SizesE.Coli (bacteria) 4.6 x 106 basesYeast (simple fungi) 15 x 106 basesSmallest human chromosome 50 x 106 basesEntire human genome 3 x 109 bases

Page 20: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

20

Genetic InformationGene – basic unit of genetic information. They determine the inherited characters.Genome – the collection of genetic information.Chromosomes – storage units of genes.

Page 21: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

21

GenesThe DNA strings include:

Coding regions (“genes”) E. coli has ~4,000 genes Yeast has ~6,000 genesC. Elegans has ~13,000 genesHumans have ~32,000 genes

Control regionsThese typically are adjacent to the genesThey determine when a gene should be expressed

“Junk” DNA (unknown function)

Page 22: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

22

The Cell

All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

Page 23: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

23

Example: Tissues in Stomach

How is this variety encoded and expressed ?

Page 24: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

24

Central Dogma

Transcription

mRNA

Translation

ProteinGene

שעתוק תרגום

cells express different subset of the genesIn different tissues and under different conditions

Page 25: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

25

TranscriptionCoding sequences can be transcribed to RNA

RNA nucleotides:Similar to DNA, slightly different backboneUracil (U) instead of Thymine (T)

Sour

ce: M

athe

ws &

van

Hol

de

Page 26: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

26

Transcription: RNA Editing

1. Transcribe to RNA2. Eliminate introns3. Splice (connect) exons* Alternative splicing exists

Exons hold information, they are more stable during evolution.This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

Page 27: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

27

RNA rolesMessenger RNA (mRNA)

Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block).

Transfer RNA (tRNA)Decodes the mRNA molecules to amino-acids. It connects to the mRNA with one side and holds the appropriate amino acid on its other side.

Ribosomal RNA (rRNA) Part of the ribosome, a machine for translating mRNA to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created.

...

Page 28: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

28

Translation (Outside the nucleolus)Translation is mediated by the ribosomeRibosome is a complex of protein & rRNA moleculesThe ribosome attaches to the mRNA at a translation initiation siteThen ribosome moves along the mRNA sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

Page 29: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

29

Genetic Code

There are 20 amino acids from which proteins are build.

Page 30: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

30

Protein StructureProteins are poly-peptides of 70-3000 amino-acids

This structure is (mostly) determined by the sequence of amino-acids that make up the protein

Page 31: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

31

Protein Structure

Page 32: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

32

EvolutionRelated organisms have similar DNA

Similarity in sequences of proteinsSimilarity in organization of genes along the chromosomes

Evolution plays a major role in biologyMany mechanisms are shared across a wide range of organismsDuring the course of evolution existing components are adapted for new functions

Page 33: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

33

EvolutionEvolution of new organisms is driven by

DiversityDifferent individuals carry different variants of the same basic blue print

MutationsThe DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

Page 34: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

34

The Tree of Life

Sour

ce: A

lber

ts e

t al

Page 35: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

35

Example for Phylogenetic AnalysisInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

Question: Which evolutionary tree best explains these sequences ?

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between speciesand their originator in the evolutionary tree (Also called phylogenetic tree).

Page 36: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

36

Example ContinuedThere are many trees possible. For example:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4The left tree is “better” than the right tree.

Questions:Is this principle yielding realistic phylogenetic trees ? (Evolution)How can we compute the best tree efficiently ? (Computer Science)What is the probability of substitutions given the data ? (Learning)Is the best tree found significantly better than others ? (Statistics)

Page 37: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

37

Werner’s SyndromeA successful application of genetic

linkage analysis

Page 38: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

38

The Disease

First references in 1960sCauses premature ageingLinkage studies from 1992WRN gene cloned in 1996Subsequent discovery of mechanisms involved in wild-type and mutant proteins

Page 39: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

39

A sample Input

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

The study used 13 Markers; here we see only one.The study used 14 families; here we see only one.

Page 40: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

40

Genehunter Output

position LOD_score information0.00 -1.254417 0.2243841.52 2.836135 0.226379

...[data skipped]...

18.58 13.688599 0.38408819.92 14.238474 0.40199221.26 14.718037 0.42681822.60 15.159389 0.46228422.92 15.056713 0.46251023.24 14.928614 0.46320823.56 14.754848 0.464387

...[data skipped]...

81.84 1.939215 0.05974890.60 -11.930449 0.087869

D8S339D8S131D8S259

Marker’s name

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

distance between markers in centi-

morgans

Most ‘likely’ position

Page 41: Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

41

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.