biology 162: computational genetics fall 2004 todd vision assistant professor department of biology,...

Biology 162: Computational Genetics

Fall 2004

Todd VisionAssistant Professor

Department of Biology, UNC Chapel Hill

Bioinformatics vs computational genetics

• Bioinformatics: The application of computing technology to molecular biology

• Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics

Course emphasis

• Data analysis in molecular genetics

• We will not cover– Developments in IT hardware– Analysis of protein structure– Modeling of metabolic pathways,

cells, tissues, organs, etc. (i.e. systems biology)

Prerequisites

• Bio 50: Molecular Biology and Genetics– Gene/protein structure and expression– Principles of inheritance

• Comp Sci 14: Introduction to Programming– Algorithms and their design– Fundamental programming skills

• Stat 31: Introduction to Statistics– Probability and Distributions– Hypothesis testing and parameter estimations

Related courses at UNC

• Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio)

• Summer courses in– Computer Science

• Graduate courses in– Bioinformatics and Computational Biology– Biostatistics– School of Pharmacy

Readings

• Gibson and Muse, A Primer of Genome Science, Sinauer Associates.– Available in Student Bookstore– Primarily covers genomic technologies– Brief on computational/statistical aspects

• Supplemental papers– Handed out in class or posted on Blackboard – Includes

• More detail on computational/statistical aspects• Papers which you will review for class assignments

https://blackboard.unc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Computer labs / Problem sets

• Thursdays 3:30-4:30 in Wilson 132• Assignments are due following Tuesday• Purpose:

– Familiarity with genomic databases and tools• Functional and evolutionary sequence analysis• Gene expression analysis• Mapping of genomes and complex traits

– Comfort with command-line tools and computing– Exercise of scientific reasoning and biological

judgement– No programming required (but learn Perl

anyway!)

Research paper

• Critical review of the computational challenges involved in assembly of the human genome

• Based on opposing articles from the main players in the drama

• Paper will be judged on– Understanding of content– Critical and synthetic reasoning– Clarity of scientific writing

Late policy

• Assignments are due at beginning of class on the due date

• Late assignments receive half-credit

• Exceptions can be made but require more than 24 hours notice

Group work

• You are encouraged to work together on most assignments (some exceptions)

• What you turn in should be your own– Show your work– Be able to defend your answers

• Know and love the UNC Honor Code– http://honor.unc.edu

Exams

• Two midterms• Final exam will be cumulative• May include material from labs/problem

sets, readings and lectures• Most questions will be similar to those

on lab/problem sets• You will receive a study guide in

advance

Grading

• 10 Labs/problem sets - 50% (5% each)• Review paper - 10%• Midterms - 20% (10% each)• Final exam - 20%• Final grades

– No curve, point divisions at discretion of instructor

– Different divisions for undergraduate/graduate students

Computer lab server: Biolinux

• All necessary analysis software is installed

• Dell PowerEdge server– Linux Redhat operating system– 2 Xeon processors– 2 GB RAM– 60 GB disk space

• Requires an ONYEN for login• Uses AFS file space

Connecting to Biolinux

• biolinux.bio.unc.edu (IP 152.2.66.25)• Windows

– Zip archive contains necessary connection software

• MacOSX– X11 for graphical sessions– Fugu for secure ftp

• Linux/Solaris/etc.– Should work as is

https://onyen.unc.edu



http://cilantro.bio.unc.edu/biolinux



Cretaceous Park?

• In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil.

• DNA was extracted– Care was taken to prevent contamination

• Specific regions were amplified– 20 different PCR primer pairs used, including 6

pairs from mitochondrial cytB– How would you design primers for dinosaur DNA?– All yielded products in mammals, birds and reptiles– Only one cytB pair yielded a product from the fossil– Negative controls did not reveal contamination

Cretaceous Park?• One cytB fragment amplified• 9 sequences obtained from two bone samples

– Variability was present within and between the two samples, none were identical

• Consensus sequences used to search for homologs– Genbank (215,000 sequences)– BLAST

• Measured percent identity• Closest matches were ~70% identical

– Equidistant to mammals, birds, and reptiles

Cretaceous Park?

• One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians

• Other authors reanalyzed the data– Multiple alignment– Protein sequence scoring matrix– Phylogenetic analysis

• All concluded that the DNA was clearly mammalian, possibly human

• One group showed that similar sequences could be amplified from human nuclear DNA

Cretaceous Park?

• Three possibilities– Preparation of human nuclear DNA could have

been contaminated by dinosaur DNA– Dinosaurs and humans might have hybridized

during the Cretaceous– Dinosaur extracts were contaminated by human

DNA

• Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs

• Lesson learned: naïve computational analysis can lead to very misguided conclusions!

Discussion question

• You are given the sequence of a new gene and asked to determine its function.

• How would you begin?– What ‘wet lab’ approaches are possible?– What ‘in silico’ approaches are possible?– What approaches might require both

wet lab and in silico components?

Biological topics

• Sequence alignment and assembly• Sequence homology searching• Sequence evolution and phylogenetics• Finding genes and other features• Patterns of gene expression• Genetic mapping• Dissecting genetic diseases and

quantitative traits

Computational topics

• Dynamic programming• Regular expressions and suffix trees• Markov chains• Hidden Markov models and machine

learning• Techniques for clustering and classification• Maximum likelihood and Bayesian statistics• Graph traversal

Some informatics tools

• Genbank, Uniprot, and major sequence repositories

• InterPro and protein signature dBs• Gene Ontology• Model organism genome databases

(SGD, FlyBase, Ensembl)• A sampling of software programs

– Chosen primarily for pedagogical utility

Genomics

• Genetics on lots of genes?• Hypothesis-free science?• Some technologies• Enabled by

– Robotics– Computers

Genome database examples

• Primary databases– Genbank/EMBL/DDBJ

• Secondary databases– Pfam (protein domains)

• Organism-specific– SGD (yeast genomics)

• Specialized dBs– OMIM (human genetic disorders)

• Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/

Growth of Genbank





http://www.expasy.org/cgi-bin/show_thumbnails.pl?2

First bacterial genome: 1995

• Haemophilus influenzae (TIGR)– 1.8 x 106 bp shotgun assembly– Required 9 months of computer time

• Now there are hundreds– 160 Bacterial– 19 Archaeal– 32 Eukaryotic

• Over a thousand projects ongoing• And a bacterial genome takes only days

to sequence and assemble

Tree of life



More protein families await



Other types of genomic data

• Spatiotemporal gene expression• Alternative transcription• Genetic knockout/overexpression phenotypes• Genetic variability

– Molecular polymorphism

• Phenotypic variation / disease• Comparative data / molecular evolution• Protein

– Structure, including modifications– Interactions with other molecules

• Metabolic profiling, etc., etc.

Algorithmic/statistical innovations

• The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981)

• Still too slow for general database search

– BLAST (1987)• Made database search of 107-108 sequences feasible• Statistical ranking of each alignment

• Statistical methods in molecular evolution <25 yrs old

• Modern genetic mapping methods ~15 yrs old

Things to review

• Chemical differences among amino acids

• Prokaryotic and eukaryotic gene structure

• The central dogma• Anatomy of a typical protein

Reading for Thursday

• Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.

biology 162: computational genetics fall 2004 todd vision assistant professor department of biology,...

Documents

systems biology slide

advance slide

unc biology

unc chapel hill slide

molecular genetics

class assignments

computational models

computational challenges