david evans cs200: computers, programs and computing university of virginia computer science class...

39
David Evans http://www.cs.virginia.edu/ evans CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life DNA Helix Photomosaic from cover of Nature, 15 Feb 2001 (made by Eric Lande

Upload: shanon-joleen-russell

Post on 02-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

David Evanshttp://www.cs.virginia.edu/evans

CS200: Computers, Programs and ComputingUniversity of VirginiaComputer Science

Class 39:Meaning of Life

DNA Helix Photomosaic from cover ofNature, 15 Feb 2001 (made by Eric Lander)

Page 2: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 2

If P NP:

PNP

Decidable

Undecidable

Sorting: (n log n)

Simulating Universe: O(n3)

3SAT: O(2n) and (n)

Fill tape with 2n *s: (2n)

Halts: ()

NP-Complete

Page 3: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 3

If P = NP:

P Decidable

Undecidable

Sorting: (n log n)

Simulating Universe: O(n3)

3SAT: if shown to be O(nk) (not yet known if this is true)

Fill tape with 2n *s: (2n)

Halts: ()

Page 4: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 4

Is it ever useful to be confident that a problem is

hard?

Page 5: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 5

Factoring Problem

– Input: an n-digit number

– Output: two prime factors whose product is the input number

• Easy to multiply to check factors are correct

• Not proven to be NP-Complete (but probably is)

• Most used public key cryptosystem (RSA) depends on this being hard

Page 6: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 6

Speculations• Must study math for 15+ years before

understanding an (important) open problem– Was ~10 until Andrew Wiles proved Fermat’s Last

Theorem

• Must study physics for ~6 years before understanding an open problem

• Must study computer science for 1 semester before understanding the most important open problem– Unless you’re a 6-year old at Cracker Barrel

• But, every 5 year-old understands the most important open problems in biology!

Page 7: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 7

Biology’s Open Problem

Which came first, the chicken or the egg?

How can a (relatively) simple, single cell turn into a chicken?

Page 8: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 8

Brief History of Biology

Life isaboutmagic.(“vitalism”)

Most biologists work on ClassificationAristotle (~300BC) - genera and species

Descartes (1641) explain life mechanically

Life isaboutchemistry.

Life isaboutinformation.

Life isaboutcomputation.

1850 1950 2000

Schrödinger (1944) life is information crack the information code

Watson and Crick (1953) DNA stores the information

Page 9: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 9

DNA

• Sequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)

• Two strands, A must attach to T and G must attach to C

A

G

T

CC

Page 10: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 10

Central Dogma of Biology

• RNA makes copies of DNA segments

• RNA describes sequences of amino acids

• Chains of amino acids make proteins

DNA

Transcription

RNA

Translation

ProteinImage from http://www.umich.edu/~protein/

Page 11: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 11

Encoding Proteins

• There are 4 nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T) (replaced with uracil (U) in RNA)

• There are 20 different amino acids, and a stop marker (to separate proteins)

• How many nucleotides are needed to encode one amino acid?

with 2, could encode 16 things: 4 * 4with 3, could encode 64 things: 4 * 4 * 4

Page 12: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 12

Codons

• Three nucleotides encode an amino acid

• But, there are only 20 amino acids, so there may be several different ways to encode the same one

From http://web.mit.edu/esgbio/www/dogma/dogma.html

Page 13: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 13

Shortest (Known) Life Program • Nanoarchaeum equitans

– 490,885 bases (522 genes)= 490,885 * ¼ * 21/64 = 40,268 bytes

– Parasite: no metabolic capacity,

must steal from host– Complete components for information processing:

transcription, replication, enzymes for DNA repair

• Size of compiling C++ “Hello World”: Windows (bcc32): 112,640 bytes

Linux (g++): 11,358 bytes

http://www.mediscover.net/Extremophiles.cfmKO Stetter and Dr Rachel Reinhard

Page 14: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 14

How Big is the Make-a-Human Program?

• 3 Billion Base Pairs– Each nucleotide is 2 bits (4 possibilities)– 3 B pairs * 1 byte/4 pairs = 750 MB

• Every sequence of 3 base pairs one of 20 amino acids (or stop codon)– 21 possible codons, but 43 = 64 possible – So, really only 750MB * (21/64) ~ 250 MB

• Most of it (> 95%) is probably junk

Page 15: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 15

1 CD ~ 650 MB

Page 16: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 16

People are almost all the Same

• Genetic code for 2 humans differs in only 2.1 million bases– 4 million bits = 0.5 MB

• How big is 0.5MB?– 1/3 of a floppy disk– ~22 times the size of the PS6 adventure

game code

Page 17: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 17

Is DNA Really a Programming Language?

Page 18: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 18

Stuff Programming Languages are Made Of

• Primitives

• Means of Combination

• Means of Abstraction

codons (sequence of 3 nucleotides that encodes a protein)

?? Morphogenesis? Not well understood (by anyone).

DNA itself – separate proteins from their encodingGenes – group DNA by function (sort of)Chromosomes – package Genes togetherOrganisms – packages for reproducing Genes

This is where most of the expressiveness comes from!

Page 19: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 19

Jacob and Monod, 1959

• Not so simple: cells in an organism have the same DNA, but do different things– Structural genes: make proteins that make us– Regulator genes: control rate of transcription

of other genes

The genome contains not only a series of blue-prints,but a coordinated program of protein synthesis and

the means for controlling its execution.François Jacob and Jacques Monod, 1961

Page 20: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 20

Complexity

Molecular map of colon cancer cell from http://www.gnsbiotech.com/applications.shtml

Page 21: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 21

Split Genes

• Richard Roberts and Phillip Sharp, 1977 • Not so simple – genome is spaghetti

code (exons) with lots of noops/comments (introns)

• Exons can be spliced together in different ways before transcription

• Possible to produce 100s of different proteins from one gene

Page 22: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 22

Most Important Science/Technology Races

1930-40s:Decryption Nazis vs. British

Winner: British

Reason: Bletchley Park had computers (and Alan Turing), Nazi’s didn’t

1940s: Atomic Bomb Nazis vs. US

Winner: US

Reason: Heisenberg miscalculated, US has better physicists, computers, resources

1960s: Moon Landing Soviet Union vs. US

Winner: US

Reason: Many, better computing was a big one

1990s-2001: Sequencing Human Genome

Page 23: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 23

Human Genome Race

• UVa CLAS 1970• Yale PhD• Tenured Professor at U.

Michigan

Francis Collins(Director of public National Center for Human Genome Research)(Picture from UVa Graduation 2001)

Craig Venter(President of Celera Genomics)vs.

• San Mateo College• Court-martialed• Denied tenure at SUNY

Buffalo

Page 24: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 24

Reading the Genome

Whitehead Institute, MIT

Page 25: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 25

Gene Reading Machines

• One read: about 700 base pairs

• But…don’t know where they are on the chromosome

AGGCATACCAGAATACCCGTGATCCAGAATAAGCActualGenome

ACCAGAATACCRead 1

TCCAGAATAARead 2

TACCCGTGATCCARead 3

Page 26: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 26

Genome Assembly

Input: Genome fragments (but without knowing where they are from)

Ouput: The full genome

ACCAGAATACCRead 1

TCCAGAATAARead 2

TACCCGTGATCCARead 3

Page 27: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 27

Genome Assembly

Input: Genome fragments (but without knowing where they are from)

Ouput: The smallest genome sequence such that all the fragments are substrings.

ACCAGAATACCRead 1

TCCAGAATAARead 2

TACCCGTGATCCARead 3

Page 28: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 28

Common SuperstringInput: A set of n substrings and a maximum length k.

Output: A string that contains all the substrings with total length k, or no if no such string exists.

ACCAGAATACC

TCCAGAATAA

TACCCGTGATCCATACCCGTGATCCA

TCCAGAATAA

ACCAGAATACC

n = 26ACCAGAATACCCGTGATCCAGAATAA

Page 29: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 29

Common SuperstringInput: A set of n substrings and a maximum length k.

Output: A string that contains all the substrings with total length k, or no if no such string exists.

ACCAGAATACC

TCCAGAATAA

TACCCGTGATCCAn = 25

Not possible

Page 30: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 30

Common Superstring

• In NP:– Easy to verify a “yes” solution: just check the

letters match up, and count the superstring length

• NP-Complete– Similar to Smiley Puzzle!– Could transform 3SAT into Common

Superstring problem

Page 31: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 31

Shortest Common Superstring

Input: A set of n substrings

Output: The shortest string that contains all the substrings.

ACCAGAATACC

TCCAGAATAA

TACCCGTGATCCATACCCGTGATCCA

TCCAGAATAA

ACCAGAATACC

ACCAGAATACCCGTGATCCAGAATAA

Page 32: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 32

Shortest Common Superstring

• Also is NP-Complete:

function scsuperstring (pieces)

maxlen = sum of lengths of all pieces

for k = 1 to k = maxlen step 1 do

if (commonSuperstring (pieces, k))

return commonSuperstring (pieces, k)

end for

Page 33: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 33

Human Genome

• 3 Billion base pairs

• 600-700 bases per read

• ~8X coverage required

• So, n 37 Million sequence fragments

• Celera used 27.2 Million reads (but could get more than 700 bases per read)

> (/ (* 8 (* 3 1000 1000 1000)) 650) 36923076 12/13

Page 34: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 34

Give up?No way to solve an NP-Complete problem (best known solutions being O(2n) for n 20 Million)

1

100

10000

1E+06

1E+08

1E+10

1E+12

1E+14

1E+16

1E+18

1E+20

1E+22

1E+24

1E+26

1E+28

1E+30

2 4 8 16 32 64 128

2ntime since “Big Bang”

Page 35: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 35

Approaches

• Human Genome Project (Collins)– Start by producing a genome map (using

biology, chemistry, etc) to have a framework for knowing where the fragments should go

• Celera Solution (Venter)– Approximate: we can’t guarantee finding the

shortest possible, but we can develop clever algorithms that get close most of the time in O(n log n)

Page 36: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 36

Result: Draw

President Clinton announces Human Genome Sequenceessentially complete (with Venter and Collins), June 26, 2000

But, Human Genome Project mostly adopted Venter’s approach.

Page 37: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 37

So Why Haven’t We Cured Cancer Yet?

0

5

10

15

20

25

30

NIH NSF DARPA

$ B

illio

ns

2004 Projected Budgets

Page 38: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 38

Why Biologists Haven’t Done Much Useful with the Human Genome Yet

They are trying to debug highly concurrent, asynchronous, type-unsafe, multiple entry/exit, self-modifying programs that create programs that create programs running on an undocumented, unstable, environmentally-sensitive OS by looking at the bits (and just figuring out the shape of a protein is an NP-hard problem)

Page 39: David Evans  CS200: Computers, Programs and Computing University of Virginia Computer Science Class 39: Meaning of Life

23 April 2004 CS 200 Spring 2004 39

Charge: PS8 Due Monday• Before 10:55am Monday:

– Submit a zip file of all your code using a form linked from the CS200 web site

– If you want to use a few PowerPoint slides in your presentation, you may submit those also

• You only have 3 or 5 minutes: use them wisely– Figure out beforehand what you will do– Recommend: one team member drive web browser, one

(or two) talk– Talk about what users should know about your website,

not about how you built it (unless there is something especially interesting)