bioinformaticsdouglas/classes/cs521/bioinformatics/bioinformatics2005.pdf–gene regulation –dna...
TRANSCRIPT
1
Bioinformatics
Joshua GilkersonAlbert KalimKa-him LeungDavid Owen
This presentation will probablyinvolve audiencediscussion, which will createaction items. UsePowerPoint to keep track ofthese action items duringyour presentation
• In Slide Show, click on theright mouse button
• Select “Meeting Minder”• Select the “Action Items” tab• Type in action items as they
come up• Click OK to dismiss this box
This will automatically createan Action Item slide at theend of your presentationwith your points entered.
2
What is Bioinformatics? Bioinformatics: “The collection,
classification, storage, and analysis ofbiochemical and biological information usingcomputers especially as applied in moleculargenetics and genomics.” (Dictionary.com)
Molecular genetics: “The branch of geneticsthat deals with the expression of genes bystudying the DNA sequences ofchromosomes.” (Dictionary.com)
3
What is Bioinformatics? (cont.) Another definition of molecular genetics: “The
branch of genetics that deals with hereditarytransmission and variation on the molecularlevel.” (Dictionary.com)
Genomics: “A branch of biotechnology concernedwith applying the techniques of genetics andmolecular biology to the genetic mapping andDNA sequencing of sets of genes or the completegenomes of selected organisms using high-speedmethods, with organizing the results in databases,and with applications of the data (as in medicineor biology).” (Dictionary.com)
4
How old is the discipline? The answer to this one depends on
which source you choose to read.
From T K Attwood and D J Parry-Smith's"Introduction to Bioinformatics",Prentice-Hall 1999 [Longman HigherEducation; ISBN 0582327881]: "Theterm bioinformatics is used toencompass almost all computerapplications in biological sciences, butwas originally coined in the mid-1980sfor the analysis of biological sequencedata."
5
How old is the discipline? (cont.)
From Mark S. Boguski's article inthe "Trends Guide toBioinformatics" Elsevier, TrendsSupplement 1998 p1:"The term "bioinformatics" is arelatively recent invention, notappearing in the literature until1991 and then only in the contextof the emergence of electronicpublishing...”
6
Bioinformatic Research up to 2005
DNA sequence Gene expression Protein
expression Protein Structure Genome mapping
Metabolicnetworks
Regulatorynetworks
Trait mapping Gene function
analysis Scientific
literature
7
What remains to be done? Comparative
Genomics Description of
mRNAs, proteins(identity andstructure)
Functionalanalyses
Detailedunderstanding ofdevelopment,regulation,variation
8
The Human Genetic Code
9
Bioinformatics Activity: Where IsBioinformatics Done?The biggest and best source of
bioinformatics links is the GenomeWeb at the Rosalind FranklinCentre for Genomics Research atthe Genome Campus nearCambridge, United Kingdom.
Others: Research Centers,Sequencing Centers, and "Virtual"Centers (for example consortia andcommunities).
10
Research Centers Centro Nacional de Biotecnologia (CNB), Madrid, Spain. Computational Biology and Informatics Laboratory at the
University of Pennysylvania, Philadelphia, USA CIRB: Centro Interdipartimentale di Ricerche
Biotecnologiche, Bologna, Italy Cold Spring Harbor Labs, New York, USA European Molecular Biology Laboratory (EMBL),
Heidelberg, Germany. Généthon, France. GIRI: Genetic Information Research Institute, California,
USA. MRC Human Genetics Unit, Edinburgh, United Kingdom. MRC Rosalind Franklin Centre for Genomics
Research(RFCGR), Hinxton, United Kingdom.
11
Sequencing Centers
The Department of GenomeAnalysis at the Institute ofMolecular Biotechnology, Jena,Germany.
The Australian Genome ResearchFacility, Austalia.
Baylor College of Medicine, USA.Michael Smith Genome Sciences
Centre, Canada.
12
Virtual Centers
International Center forCooperation in Bioinformaticsnetwork (ICCBnet):http://www.iccbnet.org/
Belgian EMBnet node:http://www.be.embnet.org/
13
Online Resources: WhatBioinformatics Websites AreThere?
Blogs InformationDirectoriesPortalsSocietiesToolsTutorials
14
Blogs Bioinformatics.Org is a bioinformatics
blog. The Bio-Web (http://cellbiol.com/) links
to resources online for molecular andcell biologists and covers current newsin various biological/computationalfields.
Genehack (http://genehack.org/)is one of the first bioinformatics blogs.
15
Information The Australian National Genomic Information
Service (ANGIS) is operated by the AustralianGenomic Information Centre(http://www.angis.org.au/new/about/generalinfo.html#AGIC, currently at the University ofSydney) to offer software, databases,documentation, training and support forbiologists
"The University of Maryland AgNIC gateway(http://agnic.umd.edu/) is a guide to qualityagricultural biotechnology information on theInternet."
16
Directories Christy Hightower, Engineering
Librarian at the Science andEngineering Library, University ofCalifornia Santa Cruz has already donethis better than me.
Visit her excellent article(http://www.istl.org/istl/02-winter/internet.html) aboutbioinformatics Net resources in Issuesin Science and TechnologyLibrarianship.
17
Societies
Humberto Ortiz Zuazaga kindlyintroduced The InternationalSociety for Computational Biology(http://www.iscb.org/) which hepoints out "has links to programsof study and online courses incomputational biology and to jobpostings".
18
Collection of Tools Bioinformatics.Org for a collection of
bioinformatics toolbox. The Rosalind Franklin Center's
"GenomeWeb“(http://www.rfcgr.mrc.ac.uk/GenomeWeb/).
Of historical interest only now, is thelegendary " Pedro's Molecular BiologySearch and Analysis Tools“(http://www.public.iastate.edu/~pedro/research_tools.html) that provides a collection ofWWW Links to Informationand Services Useful to Molecular Biologists.
19
Portals Bioinformatics.Org is an international organization which
promotes freedom and openness in the field of bioinformaticsand is the root domain of a damned fine Website .
CCP11 (Collaborative Computational Project 11,http://www.rfcgr.mrc.ac.uk/CCP11/index.jsp) is another productof the UK's Genome Campus. CCP11 is funded by the BBSRCand is hosted at the MRC Rosalind Franklin Center forGenomics Research RFCGR located on the Wellcome TrustGenome Campus, Cambridge.“
Jennifer Steinbachs runs compbiology.org which is a generalcomputational biology site as well as being a portal to her ownwork.
BioPlanet (http://www.bioplanet.com/index.php) is well worthvisiting. It describes itself as "a not-for-profit site, funded withour resources, for [its users'] benefit."
ColorBasePair (http://www.colorbasepair.com/) is a denselypacked portal with lots of bioinformatics links.
20
Genome Project
Ka-Him Leung
21
Genomics
Genome– complete set of genetic instructions
for making an organismGenomics
– attempts to analyze or compare theentire genetic complement of aspecies
22
Genomic Issues Genomic DNA is a linear sequence of 4
nucleotides (A, C, G, T)
DNA forms the double helix by pairing with itsreverse complement (A-T, G-C)
Genomic DNA contains many genes, each ofwhich is formed from one or more exons(stretches of genomic DNA), separated byintrons
A gene is copied into complementary RNA in aprocess called transcription (U substitutes T)
23
Genomic Issues (cont.) DNA sequencing, the process of determining the exact
order of the 3 billion chemical building blocks (calledbases and abbreviated A, T, C, and G) that make up theDNA of the 24 different human chromosomes
In the human genome, about 3 billion bases are arrangedalong the chromosomes in a particular order for eachunique individual.
One million bases (called a megabase and abbreviatedMb) of DNA sequence data is roughly equivalent to 1megabyte of computer data storage space. Since thehuman genome is 3 billion base pairs long, 3 gigabytesof computer data storage space are needed to store theentire genome.
24
Different Genomics
Comparative Genomics: the managementand analysis of the millions of datapoints that result from Genomics
Functional Genomics: ways of identifyinggene functions and associations
Structural Genomic: emphasizes high-throughput, whole-genome analysis.
25
History of Genome 1980
– First complete genome sequence for an organism is published• FX174 - 5,386 base pairs coding nine proteins. (~5Kb)
1995– First bacterial genome(Haemophilus influenzea) sequenced (1.8 Mb)
1996– Saccharomyces cerevisiae genome sequenced (baker's yeast, 12.1
Mb) 1997
– E. coli genome sequenced (4.7 Mbp) 1998
– Sequence of first human chromosome completed 2000
– A. Thaliana genome (flower) (100 Mb)– D. Melanogaster genome(Fruitfly) (180Mb)
2001– 10,000 full-length human cDNAs sequenced
2003– Human genome sequence completed
26
Human Genome Project
U.S. Human Genome Project was a 13-yeareffort coordinated by the Department of Energyand the National Institutes of Health.
Start at 1990. To complete mapping andunderstanding of all the genes of humanbeings.
In June 2000, scientists completed the firstworking draft of the human genome.
A high-quality, "finished" full sequence wascompleted in April 2003.
27
Goals of HGP– identify all the approximately 20,000-25,000 genes in
human DNA,
– determine the sequences of the 3 billion chemicalbase pairs that make up human DNA,
– store this information in databases,
– improve tools for data analysis,
– transfer related technologies to the private sector,and
– address the ethical, legal, and social issues (ELSI)that may arise from the project.
28
DNA Sequencing Process Mapping
– Identify set of clones that span region of genome to besequenced
Library Creation– Make sets of smaller clones from mapped clones
Template Preparation– Purify DNA from smaller clones.– Setup and perform sequencing chemistries
Gel Electrophoresis– Determine sequences from smaller clones
Pre-finishing and Finishing– Specialty techniques to produce high quality sequences
Data editing Annotation– Quality assurance; Verification; Biological annotation;– Submission to public database
29
30
Future of HGP HGP is the first step in understanding humans at the molecular
level. Work is still ongoing to determine the function of many ofthe human genes.
What still need to be done:– Gene number, exact locations, and functions– Gene regulation– DNA sequence organization– Chromosomal structure and organization– Noncoding DNA types, amount, distribution, information content, and
functions– Coordination of gene expression, protein synthesis, and post-
translational events– Interaction of proteins in complex molecular machines– Predicted vs. experimentally determined gene function– Evolutionary conservation among organisms– Protein conservation (structure and function)– Proteomes (total protein content and function) in organisms
31
32
Sequence Alignment
Joshua Gilkerson
33
Sequence Alignment
In genomics, many situations arisewhen sequences need to becompared or searched for similarsub-sequences.
Both of these task are aided byaligning the sequences to oneanother.
The two sequences are called thesubject and the query.
34
Local vs. Global Global alignment aligns the entire query
to the entire subject. Local alignment aligns a piece one
sequence to a piece of the other. Which is used depends on the
application. Surprisingly, these are computationally
equivalent. Sometime local-global mixed are used,
aligning the entire query sequenceagainst any one part of the subject.
35
Example Alignments Global AlignmentAGCTCGA--GATTGCTGGACATGCTGCTGCT| |||| |||||| |||| ||||||A--TCGAGCGATTGC-----ATGCAGCTGCT Local Alignment
– Same subject as above– Query Sequence: GAGAT
AGCTCGAGATTGCTGGACATGCTGCTGCT|| | ||||| || ||AGAT GAGAT GAGAT
36
Model for Alignment
The best alignment is the onechosen from all possiblealignments that minimizes thescore.
Scoring is done pairwise at eachposition along the alignment.
Introducing a gap is moreexpensive than extending onealready introduced(affine gappenalty).
37
Model for Alignment Score = ∑ gap penalties + ∑ similarity
weights Gap penalty = open penalty + size * size
penalty Open penalty and size penalty are constants
>=0. Similarity weight is zero for same base, >=0
for disparate bases. BLOSUM similarity weights are most
commonly used.
38
Scoring Example Same example as earlier Using:
– Gap opening penalty of 1– Gap size penalty of 1– Similarity scores all 1
AGCTCGA--GATTGCTGGACATGCTGCTGCT| |||| |||||| |||| ||||||A--TCGAGCGATTGC-----ATGCAGCTGCT0210000210000002111100001000000=13
39
Needleman-Wunsch Algorithm Sequences Q and S Scoring matrix M len(Q) x len(S) Similarity matrix s Gap length penalty - g opening penalty -
0 M(i,j) - score for best alignment of first i
elements of Q and first j elements of S. M(i,j) = minimum of
– M(i-1,j)+g,– M(i,j-1)+g,– M(i-1,j-1)+s(Q(i),Q(j))
40
Needleman-Wunsch Example
CAT vs TAG<-s M->
g=1
00T130C1110AGTCA
G 0 3G2A1T
3210TAC
41
Needleman-Wunsch Example
CAT vs TAG<-s M->
g=1
00T130C1110AGTCA
G 0 2333G3222A2221T3210TAC
42
Needleman-Wunsch Example
CAT vs TAG<-s M->
g=1
00T130C1110AGTCA
G 0 2333G3222A2221T3210TAC
43
Needleman-Wunsch Example
Two equally good alignments:-CAT C-AT | and |T-AG -TAG
44
Needleman-Wunsch Runs in n2 time. Easily generalized to allow gap opening
penalty by using 3 copies of M, one for prefixesending with a match, one ending with a gap ineach sequence.
Easily generalized to local alignment by sayings is best score for an alignment of some suffixof the sequences ending at i and j. In practice,this means:– The first row and column are filled with all zeroes
instead of just the top-left-most position.– The end of the alignment is at the globally minimal
position, not the lower-left corner.– The beginning is at the location where backtracking
cannot continue.
45
Other Alignment Tools The Basic Local Alignment Search Tool
(BLAST) is probably the most widelyused tool in genomics.– Finds local alignments.– Used on very large sequences (entire
genomes) Smith-Waterman Algorithm - Adaptation
of Needleman-Wunsch for localalignments.
FASTA package
46
The Importance ofBioinformatics and Summary
David Owen
47
The importance of bioinformatics
Traditionally, molecular biologyresearch was done entirely in alaboratory.
But the genome projects hasincreased the data by a hugeamount. Thus the researchersneed to incorporate computers formaking sense of the vast amountof data.
48
Challenges Intelligent and efficient storage of the massive
data. Easy and reliable access to the data. Development of tools which allow the
extraction of meaningful information.
The developer of the tool must also consider thefollowing:
The user (biologist) might not be an expert withcomputers.
The tool must be able to provide access acrossthe internet.
49
Processes
Three main processes a bioinformatics toolmust have: DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function
The information obtained from these processesallow us to understand better of the biology oforganisms.
50
Computer Scientist vs. Biologist Computer scientist:
– Logic– Problem-solving– Process-oriented– Algorithmic– Optimizing
Biologist:– Knowledge gathering– Experimentally-focused– Exceptions are as common as rules– Describe work as a story– Develop conclusions and models
The need for communication between computer scientist andbiologist.
51
Research Areas
Further research areas include:Sequence alignmentProtein structure predictionPrediction of gene expressionProtein-protein interactionsModeling of evolution
52
Future of Bioinformatics- Integration of a wide variety of data sources.
E.g. Combining the GIS data (maps) andweather systems, with crop health andgenotype data, allows us to predict successfuloutcomes of agricultural experiments.
- Large-scale comparative genomics. E.g. thedevelopment of tolls that can do 10-waycomparisons of genomes.
- Modeling and visualization of full networks ofcomplex system.
53
Ultimate Goal
Obtain a better understanding of thebiology of organisms through theexamination of biologicalinformation hidden in the vastamount of data we have.
This knowledge will allow us toimprove our standard of life.
54
References http://www.ornl.gov/sci/techresources/H
uman_Genome/project/about.shtml http://www.genome.gov/ http://bioinfo.mbb.yale.edu/course/proje
cts/final-4/ http://www.dictionary.com http://www.ebi.ac.uk/2can/bioinformatics
/index.html http://bioinformatics.ca/workshop_pages
/bioinformatics/day1-files/1.0_intro_bffo_2005.pdf
55
References (cont.)
http://elegans.uky.edu/520/Lecture/index.html
http://bioinformatics.org/