bioinformaticsdouglas/classes/cs521/bioinformatics/bioinformatics2005.pdf–gene regulation –dna...

1

Bioinformatics

Joshua GilkersonAlbert KalimKa-him LeungDavid Owen

This presentation will probablyinvolve audiencediscussion, which will createaction items. UsePowerPoint to keep track ofthese action items duringyour presentation

• In Slide Show, click on theright mouse button

• Select “Meeting Minder”• Select the “Action Items” tab• Type in action items as they

come up• Click OK to dismiss this box

This will automatically createan Action Item slide at theend of your presentationwith your points entered.

2

What is Bioinformatics? Bioinformatics: “The collection,

classification, storage, and analysis ofbiochemical and biological information usingcomputers especially as applied in moleculargenetics and genomics.” (Dictionary.com)

Molecular genetics: “The branch of geneticsthat deals with the expression of genes bystudying the DNA sequences ofchromosomes.” (Dictionary.com)

3

What is Bioinformatics? (cont.) Another definition of molecular genetics: “The

branch of genetics that deals with hereditarytransmission and variation on the molecularlevel.” (Dictionary.com)

Genomics: “A branch of biotechnology concernedwith applying the techniques of genetics andmolecular biology to the genetic mapping andDNA sequencing of sets of genes or the completegenomes of selected organisms using high-speedmethods, with organizing the results in databases,and with applications of the data (as in medicineor biology).” (Dictionary.com)

4

How old is the discipline? The answer to this one depends on

which source you choose to read.

From T K Attwood and D J Parry-Smith's"Introduction to Bioinformatics",Prentice-Hall 1999 [Longman HigherEducation; ISBN 0582327881]: "Theterm bioinformatics is used toencompass almost all computerapplications in biological sciences, butwas originally coined in the mid-1980sfor the analysis of biological sequencedata."

5

How old is the discipline? (cont.)

From Mark S. Boguski's article inthe "Trends Guide toBioinformatics" Elsevier, TrendsSupplement 1998 p1:"The term "bioinformatics" is arelatively recent invention, notappearing in the literature until1991 and then only in the contextof the emergence of electronicpublishing...”

6

Bioinformatic Research up to 2005

DNA sequence Gene expression Protein

expression Protein Structure Genome mapping

Metabolicnetworks

Regulatorynetworks

Trait mapping Gene function

analysis Scientific

literature

7

What remains to be done? Comparative

Genomics Description of

mRNAs, proteins(identity andstructure)

Functionalanalyses

Detailedunderstanding ofdevelopment,regulation,variation

8

The Human Genetic Code

9

Bioinformatics Activity: Where IsBioinformatics Done?The biggest and best source of

bioinformatics links is the GenomeWeb at the Rosalind FranklinCentre for Genomics Research atthe Genome Campus nearCambridge, United Kingdom.

Others: Research Centers,Sequencing Centers, and "Virtual"Centers (for example consortia andcommunities).

10

Research Centers Centro Nacional de Biotecnologia (CNB), Madrid, Spain. Computational Biology and Informatics Laboratory at the

University of Pennysylvania, Philadelphia, USA CIRB: Centro Interdipartimentale di Ricerche

Biotecnologiche, Bologna, Italy Cold Spring Harbor Labs, New York, USA European Molecular Biology Laboratory (EMBL),

Heidelberg, Germany. Généthon, France. GIRI: Genetic Information Research Institute, California,

USA. MRC Human Genetics Unit, Edinburgh, United Kingdom. MRC Rosalind Franklin Centre for Genomics

Research(RFCGR), Hinxton, United Kingdom.

11

Sequencing Centers

The Department of GenomeAnalysis at the Institute ofMolecular Biotechnology, Jena,Germany.

The Australian Genome ResearchFacility, Austalia.

Baylor College of Medicine, USA.Michael Smith Genome Sciences

Centre, Canada.

12

Virtual Centers

International Center forCooperation in Bioinformaticsnetwork (ICCBnet):http://www.iccbnet.org/

Belgian EMBnet node:http://www.be.embnet.org/

13

Online Resources: WhatBioinformatics Websites AreThere?

Blogs InformationDirectoriesPortalsSocietiesToolsTutorials

14

Blogs Bioinformatics.Org is a bioinformatics

blog. The Bio-Web (http://cellbiol.com/) links

to resources online for molecular andcell biologists and covers current newsin various biological/computationalfields.

Genehack (http://genehack.org/)is one of the first bioinformatics blogs.

15

Information The Australian National Genomic Information

Service (ANGIS) is operated by the AustralianGenomic Information Centre(http://www.angis.org.au/new/about/generalinfo.html#AGIC, currently at the University ofSydney) to offer software, databases,documentation, training and support forbiologists

"The University of Maryland AgNIC gateway(http://agnic.umd.edu/) is a guide to qualityagricultural biotechnology information on theInternet."

16

Directories Christy Hightower, Engineering

Librarian at the Science andEngineering Library, University ofCalifornia Santa Cruz has already donethis better than me.

Visit her excellent article(http://www.istl.org/istl/02-winter/internet.html) aboutbioinformatics Net resources in Issuesin Science and TechnologyLibrarianship.

17

Societies

Humberto Ortiz Zuazaga kindlyintroduced The InternationalSociety for Computational Biology(http://www.iscb.org/) which hepoints out "has links to programsof study and online courses incomputational biology and to jobpostings".

18

Collection of Tools Bioinformatics.Org for a collection of

bioinformatics toolbox. The Rosalind Franklin Center's

"GenomeWeb“(http://www.rfcgr.mrc.ac.uk/GenomeWeb/).

Of historical interest only now, is thelegendary " Pedro's Molecular BiologySearch and Analysis Tools“(http://www.public.iastate.edu/~pedro/research_tools.html) that provides a collection ofWWW Links to Informationand Services Useful to Molecular Biologists.

19

Portals Bioinformatics.Org is an international organization which

promotes freedom and openness in the field of bioinformaticsand is the root domain of a damned fine Website .

CCP11 (Collaborative Computational Project 11,http://www.rfcgr.mrc.ac.uk/CCP11/index.jsp) is another productof the UK's Genome Campus. CCP11 is funded by the BBSRCand is hosted at the MRC Rosalind Franklin Center forGenomics Research RFCGR located on the Wellcome TrustGenome Campus, Cambridge.“

Jennifer Steinbachs runs compbiology.org which is a generalcomputational biology site as well as being a portal to her ownwork.

BioPlanet (http://www.bioplanet.com/index.php) is well worthvisiting. It describes itself as "a not-for-profit site, funded withour resources, for [its users'] benefit."

ColorBasePair (http://www.colorbasepair.com/) is a denselypacked portal with lots of bioinformatics links.

20

Genome Project

Ka-Him Leung

21

Genomics

Genome– complete set of genetic instructions

for making an organismGenomics

– attempts to analyze or compare theentire genetic complement of aspecies

22

Genomic Issues Genomic DNA is a linear sequence of 4

nucleotides (A, C, G, T)

DNA forms the double helix by pairing with itsreverse complement (A-T, G-C)

Genomic DNA contains many genes, each ofwhich is formed from one or more exons(stretches of genomic DNA), separated byintrons

A gene is copied into complementary RNA in aprocess called transcription (U substitutes T)

23

Genomic Issues (cont.) DNA sequencing, the process of determining the exact

order of the 3 billion chemical building blocks (calledbases and abbreviated A, T, C, and G) that make up theDNA of the 24 different human chromosomes

In the human genome, about 3 billion bases are arrangedalong the chromosomes in a particular order for eachunique individual.

One million bases (called a megabase and abbreviatedMb) of DNA sequence data is roughly equivalent to 1megabyte of computer data storage space. Since thehuman genome is 3 billion base pairs long, 3 gigabytesof computer data storage space are needed to store theentire genome.

24

Different Genomics

Comparative Genomics: the managementand analysis of the millions of datapoints that result from Genomics

Functional Genomics: ways of identifyinggene functions and associations

Structural Genomic: emphasizes high-throughput, whole-genome analysis.

25

History of Genome 1980

– First complete genome sequence for an organism is published• FX174 - 5,386 base pairs coding nine proteins. (~5Kb)

1995– First bacterial genome(Haemophilus influenzea) sequenced (1.8 Mb)

1996– Saccharomyces cerevisiae genome sequenced (baker's yeast, 12.1

Mb) 1997

– E. coli genome sequenced (4.7 Mbp) 1998

– Sequence of first human chromosome completed 2000

– A. Thaliana genome (flower) (100 Mb)– D. Melanogaster genome(Fruitfly) (180Mb)

2001– 10,000 full-length human cDNAs sequenced

2003– Human genome sequence completed

26

Human Genome Project

U.S. Human Genome Project was a 13-yeareffort coordinated by the Department of Energyand the National Institutes of Health.

Start at 1990. To complete mapping andunderstanding of all the genes of humanbeings.

In June 2000, scientists completed the firstworking draft of the human genome.

A high-quality, "finished" full sequence wascompleted in April 2003.

27

Goals of HGP– identify all the approximately 20,000-25,000 genes in

human DNA,

– determine the sequences of the 3 billion chemicalbase pairs that make up human DNA,

– store this information in databases,

– improve tools for data analysis,

– transfer related technologies to the private sector,and

– address the ethical, legal, and social issues (ELSI)that may arise from the project.

28

DNA Sequencing Process Mapping

– Identify set of clones that span region of genome to besequenced

Library Creation– Make sets of smaller clones from mapped clones

Template Preparation– Purify DNA from smaller clones.– Setup and perform sequencing chemistries

Gel Electrophoresis– Determine sequences from smaller clones

Pre-finishing and Finishing– Specialty techniques to produce high quality sequences

Data editing Annotation– Quality assurance; Verification; Biological annotation;– Submission to public database

30

Future of HGP HGP is the first step in understanding humans at the molecular

level. Work is still ongoing to determine the function of many ofthe human genes.

What still need to be done:– Gene number, exact locations, and functions– Gene regulation– DNA sequence organization– Chromosomal structure and organization– Noncoding DNA types, amount, distribution, information content, and

functions– Coordination of gene expression, protein synthesis, and post-

translational events– Interaction of proteins in complex molecular machines– Predicted vs. experimentally determined gene function– Evolutionary conservation among organisms– Protein conservation (structure and function)– Proteomes (total protein content and function) in organisms

32

Sequence Alignment

Joshua Gilkerson

33

Sequence Alignment

In genomics, many situations arisewhen sequences need to becompared or searched for similarsub-sequences.

Both of these task are aided byaligning the sequences to oneanother.

The two sequences are called thesubject and the query.

34

Local vs. Global Global alignment aligns the entire query

to the entire subject. Local alignment aligns a piece one

sequence to a piece of the other. Which is used depends on the

application. Surprisingly, these are computationally

equivalent. Sometime local-global mixed are used,

aligning the entire query sequenceagainst any one part of the subject.

35

Example Alignments Global AlignmentAGCTCGA--GATTGCTGGACATGCTGCTGCT| |||| |||||| |||| ||||||A--TCGAGCGATTGC-----ATGCAGCTGCT Local Alignment

– Same subject as above– Query Sequence: GAGAT

AGCTCGAGATTGCTGGACATGCTGCTGCT|| | ||||| || ||AGAT GAGAT GAGAT

36

Model for Alignment

The best alignment is the onechosen from all possiblealignments that minimizes thescore.

Scoring is done pairwise at eachposition along the alignment.

Introducing a gap is moreexpensive than extending onealready introduced(affine gappenalty).

37

Model for Alignment Score = ∑ gap penalties + ∑ similarity

weights Gap penalty = open penalty + size * size

penalty Open penalty and size penalty are constants

>=0. Similarity weight is zero for same base, >=0

for disparate bases. BLOSUM similarity weights are most

commonly used.

38

Scoring Example Same example as earlier Using:

– Gap opening penalty of 1– Gap size penalty of 1– Similarity scores all 1

AGCTCGA--GATTGCTGGACATGCTGCTGCT| |||| |||||| |||| ||||||A--TCGAGCGATTGC-----ATGCAGCTGCT0210000210000002111100001000000=13

39

Needleman-Wunsch Algorithm Sequences Q and S Scoring matrix M len(Q) x len(S) Similarity matrix s Gap length penalty - g opening penalty -

0 M(i,j) - score for best alignment of first i

elements of Q and first j elements of S. M(i,j) = minimum of

– M(i-1,j)+g,– M(i,j-1)+g,– M(i-1,j-1)+s(Q(i),Q(j))

40

Needleman-Wunsch Example

CAT vs TAG<-s M->

g=1

00T130C1110AGTCA

G 0 3G2A1T

3210TAC

41


CAT vs TAG<-s M->

g=1

00T130C1110AGTCA

G 0 2333G3222A2221T3210TAC

42


CAT vs TAG<-s M->

g=1

00T130C1110AGTCA

G 0 2333G3222A2221T3210TAC

43


Two equally good alignments:-CAT C-AT | and |T-AG -TAG

44

Needleman-Wunsch Runs in n2 time. Easily generalized to allow gap opening

penalty by using 3 copies of M, one for prefixesending with a match, one ending with a gap ineach sequence.

Easily generalized to local alignment by sayings is best score for an alignment of some suffixof the sequences ending at i and j. In practice,this means:– The first row and column are filled with all zeroes

instead of just the top-left-most position.– The end of the alignment is at the globally minimal

position, not the lower-left corner.– The beginning is at the location where backtracking

cannot continue.

45

Other Alignment Tools The Basic Local Alignment Search Tool

(BLAST) is probably the most widelyused tool in genomics.– Finds local alignments.– Used on very large sequences (entire

genomes) Smith-Waterman Algorithm - Adaptation

of Needleman-Wunsch for localalignments.

FASTA package

46

The Importance ofBioinformatics and Summary

David Owen

47

The importance of bioinformatics

Traditionally, molecular biologyresearch was done entirely in alaboratory.

But the genome projects hasincreased the data by a hugeamount. Thus the researchersneed to incorporate computers formaking sense of the vast amountof data.

48

Challenges Intelligent and efficient storage of the massive

data. Easy and reliable access to the data. Development of tools which allow the

extraction of meaningful information.

The developer of the tool must also consider thefollowing:

The user (biologist) might not be an expert withcomputers.

The tool must be able to provide access acrossthe internet.

49

Processes

Three main processes a bioinformatics toolmust have: DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function

The information obtained from these processesallow us to understand better of the biology oforganisms.

50

Computer Scientist vs. Biologist Computer scientist:

– Logic– Problem-solving– Process-oriented– Algorithmic– Optimizing

Biologist:– Knowledge gathering– Experimentally-focused– Exceptions are as common as rules– Describe work as a story– Develop conclusions and models

The need for communication between computer scientist andbiologist.

51

Research Areas

Further research areas include:Sequence alignmentProtein structure predictionPrediction of gene expressionProtein-protein interactionsModeling of evolution

52

Future of Bioinformatics- Integration of a wide variety of data sources.

E.g. Combining the GIS data (maps) andweather systems, with crop health andgenotype data, allows us to predict successfuloutcomes of agricultural experiments.

- Large-scale comparative genomics. E.g. thedevelopment of tolls that can do 10-waycomparisons of genomes.

- Modeling and visualization of full networks ofcomplex system.

53

Ultimate Goal

Obtain a better understanding of thebiology of organisms through theexamination of biologicalinformation hidden in the vastamount of data we have.

This knowledge will allow us toimprove our standard of life.

54

References http://www.ornl.gov/sci/techresources/H

uman_Genome/project/about.shtml http://www.genome.gov/ http://bioinfo.mbb.yale.edu/course/proje

cts/final-4/ http://www.dictionary.com http://www.ebi.ac.uk/2can/bioinformatics

/index.html http://bioinformatics.ca/workshop_pages

/bioinformatics/day1-files/1.0_intro_bffo_2005.pdf

55

References (cont.)

http://elegans.uky.edu/520/Lecture/index.html

http://bioinformatics.org/

bioinformaticsdouglas/classes/cs521/bioinformatics/bioinformatics2005.pdf–gene regulation –dna...

Documents