introduction to algorithms in computational biology

41
. Introduction to Algorithms in Computational Biology Lecture 1 Background Readings : The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001. This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il /~nir. Changes made by Dan Geiger.

Upload: others

Post on 12-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Algorithms in Computational Biology

.

Introduction to Algorithms in Computational Biology

Lecture 1

Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger.

Page 2: Introduction to Algorithms in Computational Biology

2

Course InformationMeetings:

Lecture, by Dan Geiger: Mondays 16:30 –18:30, Taub 4.Tutorial, by Ydo Wexler: Tuesdays 10:30 – 11:30, Taub 2.

Grade:20% in five question sets. These questions sets are obligatory. Each contains 4-6 theoretical problems. Submit in pairs in two weeks time80% test. Must pass beyond 55 for the homework’s grade to count

Information and handouts:

www.cs.technion.ac.il/~cs236522

A brochure with zeroxed material at Taub library

Page 3: Introduction to Algorithms in Computational Biology

3

Course PrerequisitesComputer Science and Probability Background

Data structure 1 (cs234218)Algorithms 1 (cs234247)Probability (any course)

Some Biology BackgroundFormally: None, to allow CS students to take this course.Recommended: Biology 1 (especially for those in the Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site).

Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

Page 4: Introduction to Algorithms in Computational Biology

4

Relations to Some Other CoursesIntro to Bioinformatics (cs236523). This course covers practical aspects and hands on experience with web-based bioinformatics Software . Albeit not a formal requirement, it is recommended that you look on the web site http://webcourse.technion.ac.il/234523/ and examine the relevant software.

Algorithms in Computational Biology (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution.

Bioinformatics project (cs5236524). Developing bioinformatics tools under close guidance.

Page 5: Introduction to Algorithms in Computational Biology

5

First Homework Assignment

Read carefully the first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001.

Solve two of the questions for Chapter 2 and two of the questions for Chapter 3.

Due time: During the third tutorial class, or earlier in the teaching assistant’s mail slot. Recall to submit in pairs.

Page 6: Introduction to Algorithms in Computational Biology

6

Computational BiologyComputational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictivemethodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, ComputerScience, and Statistics.

Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only.

Page 7: Introduction to Algorithms in Computational Biology

7

Examples of Areas of Interest• Building evolutionary trees from molecular (and other) data• Efficiently assembling genomes of various organisms• Understanding the structure of genomes (SNP, SSR, Genes)• Understanding function of genes in the cell cycle and disease• Deciphering structure and function of proteins

Page 8: Introduction to Algorithms in Computational Biology

8

Exponential growth of biological information: growth of sequences, structures, and literature.

Page 9: Introduction to Algorithms in Computational Biology

9

Four AspectsBiological

What is the task?Algorithmic

How to perform the task at hand efficiently?Learning

How to adapt/estimate/learn parameters and models describing the task from examples

StatisticsHow to differentiate true phenomena from artifacts

Page 10: Introduction to Algorithms in Computational Biology

10

Example: Sequence ComparisonBiological

Evolution preserves sequences, thus similar genes might have similar function

AlgorithmicConsider all ways to “align” one sequence against another

LearningHow do we define “similar” sequences? Use examples to define similarity

StatisticsWhen we compare to ~106 sequences, what is a random match and what is true one

Page 11: Introduction to Algorithms in Computational Biology

11

Course GoalsLearning about computational tools for (primarily) molecular biology.We will cover computational tasks that are posed by modern molecular biologyWe will discuss the biological motivation and setup for these tasksWe will understand the kinds of solutions that exist and what principles justify them

Page 12: Introduction to Algorithms in Computational Biology

12

Topics IDealing with DNA/Protein sequences:

Finding similar sequencesModels of sequences: Hidden Markov ModelsGene findingGenome projects and how sequences are found

Page 13: Introduction to Algorithms in Computational Biology

13

Topics IIModels of genetic change:

Long term: evolutionary changes among speciesReconstructing evolutionary trees from sequencesShort term: genetic variations in a populationFinding genes by linkage and association

Page 14: Introduction to Algorithms in Computational Biology

14

Topics III (One class, if time allows)Protein World:

How proteins fold - secondary & tertiary structureHow to predict protein folds from sequences dataHow to analyze proteins changes from raw experimental measurements (MassSpec)

Page 15: Introduction to Algorithms in Computational Biology

15

Human Genome

Most human cells contain 46 chromosomes:

2 sex chromosomes (X,Y):XY – in males.XX – in females.

22 pairs of chromosomes named autosomes.

Page 16: Introduction to Algorithms in Computational Biology

16

DNA OrganizationSo

urce

: Alb

erts

et a

l

Page 17: Introduction to Algorithms in Computational Biology

17

The Double HelixSo

urce

: Alb

erts

et a

l

Page 18: Introduction to Algorithms in Computational Biology

18

DNA ComponentsFour nucleotide types:

AdenineGuanineCytosineThymine

Hydrogen bonds(electrostatic connection):

A-TC-G

Page 19: Introduction to Algorithms in Computational Biology

19

Genome SizesE.Coli (bacteria) 4.6 x 106 basesYeast (simple fungi) 15 x 106 basesSmallest human chromosome 50 x 106 basesEntire human genome 3 x 109 bases

Page 20: Introduction to Algorithms in Computational Biology

20

Genetic InformationGene – basic unit of genetic information. They determine the inherited characters.Genome – the collection of genetic information.Chromosomes – storage units of genes.

Page 21: Introduction to Algorithms in Computational Biology

21

GenesThe DNA strings include:

Coding regions (“genes”) E. coli has ~4,000 genes Yeast has ~6,000 genesC. Elegans has ~13,000 genesHumans have ~32,000 genes

Control regionsThese typically are adjacent to the genesThey determine when a gene should be expressed

“Junk” DNA (unknown function)

Page 22: Introduction to Algorithms in Computational Biology

22

The Cell

All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

Page 23: Introduction to Algorithms in Computational Biology

23

Example: Tissues in Stomach

How is this variety encoded and expressed ?

Page 24: Introduction to Algorithms in Computational Biology

24

Central Dogma

Transcription

mRNA

Translation

ProteinGene

שעתוק תרגום

cells express different subset of the genesIn different tissues and under different conditions

Page 25: Introduction to Algorithms in Computational Biology

25

TranscriptionCoding sequences can be transcribed to RNA

RNA nucleotides:Similar to DNA, slightly different backboneUracil (U) instead of Thymine (T)

Sour

ce: M

athe

ws &

van

Hol

de

Page 26: Introduction to Algorithms in Computational Biology

26

Transcription: RNA Editing

1. Transcribe to RNA2. Eliminate introns3. Splice (connect) exons* Alternative splicing exists

Exons hold information, they are more stable during evolution.This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

Page 27: Introduction to Algorithms in Computational Biology

27

RNA rolesMessenger RNA (mRNA)

Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block).

Transfer RNA (tRNA)Decodes the mRNA molecules to amino-acids. It connects to the mRNA with one side and holds the appropriate amino acid on its other side.

Ribosomal RNA (rRNA) Part of the ribosome, a machine for translating mRNA to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created.

...

Page 28: Introduction to Algorithms in Computational Biology

28

Translation (Outside the nucleolus)Translation is mediated by the ribosomeRibosome is a complex of protein & rRNA moleculesThe ribosome attaches to the mRNA at a translation initiation siteThen ribosome moves along the mRNA sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

Page 29: Introduction to Algorithms in Computational Biology

29

Genetic Code

There are 20 amino acids from which proteins are build.

Page 30: Introduction to Algorithms in Computational Biology

30

Protein StructureProteins are poly-peptides of 70-3000 amino-acids

This structure is (mostly) determined by the sequence of amino-acids that make up the protein

Page 31: Introduction to Algorithms in Computational Biology

31

Protein Structure

Page 32: Introduction to Algorithms in Computational Biology

32

EvolutionRelated organisms have similar DNA

Similarity in sequences of proteinsSimilarity in organization of genes along the chromosomes

Evolution plays a major role in biologyMany mechanisms are shared across a wide range of organismsDuring the course of evolution existing components are adapted for new functions

Page 33: Introduction to Algorithms in Computational Biology

33

EvolutionEvolution of new organisms is driven by

DiversityDifferent individuals carry different variants of the same basic blue print

MutationsThe DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

Page 34: Introduction to Algorithms in Computational Biology

34

The Tree of Life

Sour

ce: A

lber

ts e

t al

Page 35: Introduction to Algorithms in Computational Biology

35

Example for Phylogenetic AnalysisInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

Question: Which evolutionary tree best explains these sequences ?

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between speciesand their originator in the evolutionary tree (Also called phylogenetic tree).

Page 36: Introduction to Algorithms in Computational Biology

36

Example ContinuedThere are many trees possible. For example:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4The left tree is “better” than the right tree.

Questions:Is this principle yielding realistic phylogenetic trees ? (Evolution)How can we compute the best tree efficiently ? (Computer Science)What is the probability of substitutions given the data ? (Learning)Is the best tree found significantly better than others ? (Statistics)

Page 37: Introduction to Algorithms in Computational Biology

37

Werner’s SyndromeA successful application of genetic

linkage analysis

Page 38: Introduction to Algorithms in Computational Biology

38

The Disease

First references in 1960sCauses premature ageingLinkage studies from 1992WRN gene cloned in 1996Subsequent discovery of mechanisms involved in wild-type and mutant proteins

Page 39: Introduction to Algorithms in Computational Biology

39

A sample Input

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

The study used 13 Markers; here we see only one.The study used 14 families; here we see only one.

Page 40: Introduction to Algorithms in Computational Biology

40

Genehunter Output

position LOD_score information0.00 -1.254417 0.2243841.52 2.836135 0.226379

...[data skipped]...

18.58 13.688599 0.38408819.92 14.238474 0.40199221.26 14.718037 0.42681822.60 15.159389 0.46228422.92 15.056713 0.46251023.24 14.928614 0.46320823.56 14.754848 0.464387

...[data skipped]...

81.84 1.939215 0.05974890.60 -11.930449 0.087869

D8S339D8S131D8S259

Marker’s name

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

distance between markers in centi-

morgans

Most ‘likely’ position

Page 41: Introduction to Algorithms in Computational Biology

41

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.