biology 162: computational genetics fall 2004 todd vision assistant professor department of biology,...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Biology 162: Computational Genetics
Fall 2004
Todd VisionAssistant Professor
Department of Biology, UNC Chapel Hill
Bioinformatics vs computational genetics
• Bioinformatics: The application of computing technology to molecular biology
• Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics
Course emphasis
• Data analysis in molecular genetics
• We will not cover– Developments in IT hardware– Analysis of protein structure– Modeling of metabolic pathways,
cells, tissues, organs, etc. (i.e. systems biology)
Prerequisites
• Bio 50: Molecular Biology and Genetics– Gene/protein structure and expression– Principles of inheritance
• Comp Sci 14: Introduction to Programming– Algorithms and their design– Fundamental programming skills
• Stat 31: Introduction to Statistics– Probability and Distributions– Hypothesis testing and parameter estimations
Related courses at UNC
• Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio)
• Summer courses in– Computer Science
• Graduate courses in– Bioinformatics and Computational Biology– Biostatistics– School of Pharmacy
Readings
• Gibson and Muse, A Primer of Genome Science, Sinauer Associates.– Available in Student Bookstore– Primarily covers genomic technologies– Brief on computational/statistical aspects
• Supplemental papers– Handed out in class or posted on Blackboard – Includes
• More detail on computational/statistical aspects• Papers which you will review for class assignments
https://blackboard.unc.edu
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Computer labs / Problem sets
• Thursdays 3:30-4:30 in Wilson 132• Assignments are due following Tuesday• Purpose:
– Familiarity with genomic databases and tools• Functional and evolutionary sequence analysis• Gene expression analysis• Mapping of genomes and complex traits
– Comfort with command-line tools and computing– Exercise of scientific reasoning and biological
judgement– No programming required (but learn Perl
anyway!)
Research paper
• Critical review of the computational challenges involved in assembly of the human genome
• Based on opposing articles from the main players in the drama
• Paper will be judged on– Understanding of content– Critical and synthetic reasoning– Clarity of scientific writing
Late policy
• Assignments are due at beginning of class on the due date
• Late assignments receive half-credit
• Exceptions can be made but require more than 24 hours notice
Group work
• You are encouraged to work together on most assignments (some exceptions)
• What you turn in should be your own– Show your work– Be able to defend your answers
• Know and love the UNC Honor Code– http://honor.unc.edu
Exams
• Two midterms• Final exam will be cumulative• May include material from labs/problem
sets, readings and lectures• Most questions will be similar to those
on lab/problem sets• You will receive a study guide in
advance
Grading
• 10 Labs/problem sets - 50% (5% each)• Review paper - 10%• Midterms - 20% (10% each)• Final exam - 20%• Final grades
– No curve, point divisions at discretion of instructor
– Different divisions for undergraduate/graduate students
Computer lab server: Biolinux
• All necessary analysis software is installed
• Dell PowerEdge server– Linux Redhat operating system– 2 Xeon processors– 2 GB RAM– 60 GB disk space
• Requires an ONYEN for login• Uses AFS file space
Connecting to Biolinux
• biolinux.bio.unc.edu (IP 152.2.66.25)• Windows
– Zip archive contains necessary connection software
• MacOSX– X11 for graphical sessions– Fugu for secure ftp
• Linux/Solaris/etc.– Should work as is
https://onyen.unc.edu
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
http://cilantro.bio.unc.edu/biolinux
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Cretaceous Park?
• In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil.
• DNA was extracted– Care was taken to prevent contamination
• Specific regions were amplified– 20 different PCR primer pairs used, including 6
pairs from mitochondrial cytB– How would you design primers for dinosaur DNA?– All yielded products in mammals, birds and reptiles– Only one cytB pair yielded a product from the fossil– Negative controls did not reveal contamination
Cretaceous Park?• One cytB fragment amplified• 9 sequences obtained from two bone samples
– Variability was present within and between the two samples, none were identical
• Consensus sequences used to search for homologs– Genbank (215,000 sequences)– BLAST
• Measured percent identity• Closest matches were ~70% identical
– Equidistant to mammals, birds, and reptiles
Cretaceous Park?
• One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians
• Other authors reanalyzed the data– Multiple alignment– Protein sequence scoring matrix– Phylogenetic analysis
• All concluded that the DNA was clearly mammalian, possibly human
• One group showed that similar sequences could be amplified from human nuclear DNA
Cretaceous Park?
• Three possibilities– Preparation of human nuclear DNA could have
been contaminated by dinosaur DNA– Dinosaurs and humans might have hybridized
during the Cretaceous– Dinosaur extracts were contaminated by human
DNA
• Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs
• Lesson learned: naïve computational analysis can lead to very misguided conclusions!
Discussion question
• You are given the sequence of a new gene and asked to determine its function.
• How would you begin?– What ‘wet lab’ approaches are possible?– What ‘in silico’ approaches are possible?– What approaches might require both
wet lab and in silico components?
Biological topics
• Sequence alignment and assembly• Sequence homology searching• Sequence evolution and phylogenetics• Finding genes and other features• Patterns of gene expression• Genetic mapping• Dissecting genetic diseases and
quantitative traits
Computational topics
• Dynamic programming• Regular expressions and suffix trees• Markov chains• Hidden Markov models and machine
learning• Techniques for clustering and classification• Maximum likelihood and Bayesian statistics• Graph traversal
Some informatics tools
• Genbank, Uniprot, and major sequence repositories
• InterPro and protein signature dBs• Gene Ontology• Model organism genome databases
(SGD, FlyBase, Ensembl)• A sampling of software programs
– Chosen primarily for pedagogical utility
Genomics
• Genetics on lots of genes?• Hypothesis-free science?• Some technologies• Enabled by
– Robotics– Computers
Genome database examples
• Primary databases– Genbank/EMBL/DDBJ
• Secondary databases– Pfam (protein domains)
• Organism-specific– SGD (yeast genomics)
• Specialized dBs– OMIM (human genetic disorders)
• Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/
Growth of Genbank
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
http://www.expasy.org/cgi-bin/show_thumbnails.pl?2
First bacterial genome: 1995
• Haemophilus influenzae (TIGR)– 1.8 x 106 bp shotgun assembly– Required 9 months of computer time
• Now there are hundreds– 160 Bacterial– 19 Archaeal– 32 Eukaryotic
• Over a thousand projects ongoing• And a bacterial genome takes only days
to sequence and assemble
Tree of life
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
More protein families await
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Other types of genomic data
• Spatiotemporal gene expression• Alternative transcription• Genetic knockout/overexpression phenotypes• Genetic variability
– Molecular polymorphism
• Phenotypic variation / disease• Comparative data / molecular evolution• Protein
– Structure, including modifications– Interactions with other molecules
• Metabolic profiling, etc., etc.
Algorithmic/statistical innovations
• The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981)
• Still too slow for general database search
– BLAST (1987)• Made database search of 107-108 sequences feasible• Statistical ranking of each alignment
• Statistical methods in molecular evolution <25 yrs old
• Modern genetic mapping methods ~15 yrs old
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Things to review
• Chemical differences among amino acids
• Prokaryotic and eukaryotic gene structure
• The central dogma• Anatomy of a typical protein
Reading for Thursday
• Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.