problems on strings and strands

16
Genome Revolution: COMPSCI 004G 4.1 Problems on Strings and Strands Find CGTC…CGTC, i.e., repeated sequence separated by specified number of nucleotides What about any four letter sequence repeated? What about any four letters repeated How do we do this? Find palindromic motif/sequence CTAGGATC or CTAGATC

Upload: adelle

Post on 20-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Problems on Strings and Strands. Find CGTC…CGTC, i.e ., repeated sequence separated by specified number of nucleotides What about any four letter sequence repeated? What about any four letters repeated How do we do this? Find palindromic motif/sequence CTAGGATC or CTAGATC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.1

Problems on Strings and Strands

Find CGTC…CGTC, i.e., repeated sequence separated by specified number of nucleotides What about any four letter sequence repeated?

What about any four letters repeated How do we do this?

Find palindromic motif/sequence CTAGGATC or CTAGATC

Page 2: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.2

What's a regular expression?

A way to specify a pattern of text A "machine" will recognize the pattern by matching it to a string

The matching process is very efficient, basically look at every character once (can we do better?)

Perl is heavily based on regex's, Java has a regular expression library (java.util.regex) We'll use it to do APT like things Used in English to find "words"

Page 3: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.3

Syntax for regular expressions

[aeiou]

a.*b.*c.*

[a]+

^a.*z$

(…)(\1)

Page 4: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.4

Tagged regexp

We identify part of a regexp by enclosing with parentheses (….)(xxxx)

We can refer back to the tagged part by an escaped number, e.g., \1 or \2 Entire regexp is usually represented by \0

“Efficient” querying mechanism Who pays for efficiency?

Page 5: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.5

Storing and Retrieving Values

Read words from a file and store them somewhere Dictionary of words to analyze in English All dimers in a DNA sequence All occurrences of TATA in … All words that have 'e' before 'I'

We don't want to print, we want to store, why? If we print… If we store …

Page 6: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.6

Collections of String, int, Sequences,…

We can create an array of String values, but we must know how big to make the array

String[] list = new String[20]; // store how many? for(int k=0; k < list.length; k++){ list[k] = "hello"; }

What if we're not sure how many objects to store? Create "big" array, and don't fill it, problems?

Create one array, when fill, create another

Page 7: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.7

Arrays that grow

// read file and store wordsint count = 0;String[] list = new String[20];while (sc.hasNext()){ String word = sc.next(); if (count < list.length){ list[count] = word; count++; } else { String[] copy = new String[count*2]; for(int k=0; k < count; k++){ copy[k] = list[k]; } list = copy; }}

Page 8: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.8

Ask not what you can do…

The ArrayList class is a growable array Add things to the end, it grows when needed

Useful when we don't know how big to make things

What's the cost of doubling? If it costs $0.01 to create a place for a string and $0.01 to copy a string, how much does it cost to start with an array of size 1 and then read and store 500 strings?

Page 9: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.9

What is Information? http://dictionary.com

Knowledge derived from study, experience, or instruction.

Knowledge of specific events or situations that has been gathered or received by communication; intelligence or news.

A collection of facts or data: statistical information. The act of informing or the condition of being

informed; communication of knowledge: Safety instructions are provided for the information of our passengers.

Computer Science. Processed, stored, or transmitted data.

A numerical measure of the uncertainty of an experimental outcome.

Law. A formal accusation of a crime made by a public officer rather than by grand jury indictment.

Page 10: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.10

Data on Audio CD, Human Genome (see howstuffworks.com)

Sample 44.1 KHz, sampling 16 bits (2 bytes), and in stereo: 44,100*2*2 = 176,400 bytes/second We need some overhead on a CD, so we get approx.

• 747 Megabytes on one CD or 10Mb/minute, uncompressed

Human Genome is approx. 3.2 Billion base-pairs 1BP is either C, G, A, T, we need 2 bits for this 6.4 Billion bits is about 0.8 Billion bytes, call

it 1GByte• No annotations which are definitely needed

Lessons from this comparison?

Page 11: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.11

Craig Venter http://www.sfgate.com (September 19, 2004)

Venter is one of the most creative minds in biology, and one of the most entertaining. Supremely immodest, he surges forth like a force of nature. The possibility that this man might push over the brink into Frankenstein territory is part of the fascination.

"Is it possible that you don't know what you're doing?" I ask.

"Sure, it's possible, but people have said that before about me. And they've been wrong."

Page 12: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.12

The Scientist http://www.the-scientist.com/yr1995/july/tigr_950724.html

Noting your continuing interest in ethics and policy, what are your thoughts on the relationship between scientific and bioethical concerns surrounding genomic research?

First, I think the worst thing for science and society is to leave the, quote, ethical issues up to the professional ethicist, because while there are some very good ones that have their feet on the ground and things based in reality, a lot of discussions that I've heard at ethical meetings are totally hypothetical situations that have nothing to do with facts or reality or even probability.

I feel that the scientists doing the work - and I seem to have somewhat of a unique philosophy in this - and leading the discoveries should be some of the people most involved in the ethical discussions, not the ones sitting on the sideline, leaving it up to the professional ethicist.

Page 13: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.13

Sir John Sulston (Nobel 2002)

… I think the important purpose of science is to explore, discover and understand. I’m glad if I’ve been able to contribute a little to that process, and hugely grateful to all my colleagues, both here and elsewhere, for their achievements and for the fun. And I hope that we can apply our ever increasing knowledge wisely, for the good of all.

Page 14: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.14

From Wired Interview, June 2003

If The Common Thread can be said to have a villain, it's Craig Venter.

SULSTON: … If there's a villain, it's an economic system that allows individuals to control information that belongs to everyone. The genome sequence is a discovery, not an invention.

Could genetic patents be a necessary evil? Venter has called them the best way to get drug company researchers to tackle certain diseases.

Diseases are not neglected for lack of patients, but for a lack of markets. Malaria has a thicket of patients but no market, because the victims are mostly poor. Thus 90 percent of the world's disease burden receives 10 percent of the research effort. Commercial incentives offer no way of redressing this; global funds from governments and charities are necessary to do so.

Page 15: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.15

Craig Venter at Woods Hole http://www.mblwhoilibrary.

org/services/lecture_series/venter/index.html What is Woods Hole? Who is David Baltimore?

What is the role of computing and Computer Science?

What is the difference between shotgun techniques

What is the reference to sailboats about? Is this important? What kinds of jokes do scientists laugh at? What does it mean to interpret genetic code? What is cDNA, what is RNA?, what is DNA? ESTs?

Page 16: Problems on Strings and Strands

Genome Revolution: COMPSCI 004G 4.16

PNAS http://www.pnas.org/cgi/content/full/99/6/3712