csce555 bioinformatics lecture 2 meeting: mw 4:00pm-5:15pm swgn2a21 instructor: dr. jianjun hu...

35
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Upload: britney-green

Post on 01-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 2Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 2

Page 3: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Tools to Learn Concepts Tools to Learn Concepts QuicklyQuicklyWikipedia.org

◦Search “Genome” bringing up many related information

◦In google, type “keywards wiki”Google search tips

◦Find info from university websites Genome, site:edu

◦Find info as powerpoint files Genome, tutorial, filetype:ppt

Page 4: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

DNADNADeoxyribonucl

eic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:

sugars and phosphate groupsDNA is a long polymer of simple units called nucleotides

BasesA: adenosine C: cytidine G: guanosine T: thymidine

Page 5: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Microbial Genome: Microbial Genome: Clostridium sp. OhILAsClostridium sp. OhILAsCTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT

Complementary Base Pairing:A TC G Write a program to export

complementary sequence?

Page 6: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Genome of organismsGenome of organismsgenome of an

organism is a complete DNA sequence of one set of chromosomes

Page 7: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Sequencing: Basic IdeasSequencing: Basic Ideas Current lab techniques can sequence small (say 700 base

pairs) DNA pieces.◦ Use restriction enzymes to cut DNA pieces◦ Sort pieces of different sizes using gel electrophoresis and use

the sorting to read them Mapping and Walking

◦ Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone

◦ Estimate for human genome sequencing using this method: 100 years

Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes◦ Obtain random sequence reads from a genome◦ Assemble them into contigs on the basis of sequence overlaps

Straightforward for simple genomes (with no or few repeat sequences) Merge reads containing overlapping sequence

Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches

Page 8: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

How Sequencing WorksHow Sequencing Works

Beckman CEQ 8000

Page 9: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Sequencing small DNA piecesSequencing small DNA pieces

Use DNA cloning or PCR to make multiple copies.

Put in 4 testtubes marked G, A, T and C

In testtube G use restriction enzymes that cuts at G.

Do the above step for the other testubes.

Use gel electrophoresis separately for the content in each testtube.

The data results in the table on the left.

Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17.

This gives us the sequence.

G A T C

G --------------

A --------------

C --------------

T --------------

T --------------

A --------------

G --------------

A --------------

T --------------

C --------------

A --------------

G --------------

G --------------

A --------------

A --------------

A --------------

C --------------

T --------------

G --------------

Page 10: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Methods for very large scale Methods for very large scale sequencingsequencing

A hierarchical approach◦ Map on a large scale (physical mapping),

sequence specific clones whose position in the genome is known

Shot gun sequencing◦ “Tear up” the genome and sequence

random fragments until it is doneSequence tagged connectors (STC)

◦ Sequence the ends of many clones and use this info to pick overlapping clones

Page 11: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

““Shotgun” sequencingShotgun” sequencing

Clone to sequence

CopySub-clone

Sequence and “assemble”

….GTCTACCTGTACTGATCTAGC...…. CCTGTACTGATCTAGCATTA...

…. GTACTGATCTAGCATTACG...

Page 12: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Emerging Sequence Emerging Sequence MethodsMethodsSequencing by

Hybridization (SBH).Mass

Spectrophotometric Sequences.

Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM )

Single Molecule Sequencing Techniques

Single nucleotide Cutting

Nanopore sequencingReadout of Cellular

Gene Expression

Page 13: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Whole Genomes of SpeciesWhole Genomes of SpeciesBacterial GenomesEukaryotic GenomesHuman Genome ProjectOther Animal and Plant GenomesModel Genomes

The genomes of more than 180 organisms have been sequenced since 1995

http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml

Page 14: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Sizes of GenomesSizes of GenomesYou will learn to download all these genomes into your computer’s harddrive

Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.

Page 15: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 15

Page 16: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

DNA Sequence DNA Sequence RepresentationRepresentationDNA Sequence: a string of

letters with alphabet {A, C, G, T}Protein sequence: a string of

amino acids with alphabet {ARNDCEQGHILKMFPSTWYV}◦20 standard amino acids

Genetic code:

Page 17: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Genetic Code: CondonGenetic Code: CondonDNA (ATCG)

RNA (AUCG)Three bases of

DNA encode an amino acid

Page 18: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Genetic Code with Genetic Code with DegeneracyDegeneracy

Page 19: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Representation of Representation of SequencesSequencesSingle DNA sequence

◦ATCCTTAAGGAAAMultiple sequences with similarity

◦Regular Expression◦ATAAA◦ACAAAA◦ATAAAAAA◦A[TC]A+

Page 20: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Representation of Representation of SequencesSequencesProbablistic Model: Position-

specific scoring matrices (PSSM)

Page 21: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Representation of Sequence: Representation of Sequence:

FASTA formatFASTA formattext-based format for representing either nucleic acid sequences or peptide sequences,

allows for sequence names and comments to precede the sequences.

Page 22: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 22

Page 23: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Sequence Retrieval, Sequence Retrieval,

ManipulationManipulationWhere to download genome/sequence data◦Online databases: EMBL, GenBank◦Entrez cross-database search (life

science search engine)◦Goolge -

Page 24: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of
Page 25: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Example: Download H. Example: Download H. influenzae Genomeinfluenzae GenomeFirst bacterial genome: H.

influenzae, 1830Kbhttp://www.ncbi.nlm.nih.gov/

sites/entrez NC_007146

LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27

Page 26: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Genome Information of H. Genome Information of H. influenzae influenzae

Page 27: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Download the Complete Download the Complete Genome Sequence in Fasta Genome Sequence in Fasta FormatFormat

Page 28: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 28

Page 29: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Simple Questions and Simple Questions and Analysis of Genome Analysis of Genome SequenceSequenceFrequencies of Bases A/C/G/T by

simple countingSliding windows to check local

densityAT AG AC TA TG TC

K-mers frequent/unusual words ◦2-mers AT AG AC TA TG TC etc.◦3-mers

Page 30: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Page 627

Genomic landscape: GC Genomic landscape: GC content analysiscontent analysisThe overall GC content of the

human genome is 41%.A plot of GC content versus

number of 20 kb windows shows a broad profile with skewing to the right.

Page 31: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Fig. 17.15Page 628Source: IHGSC (2001)

GC content of the human genome: mean 41%

Page 32: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Genomic landscape: CpG Genomic landscape: CpG islandsislands Dinucleotides of CpG are under-represented in

genomic DNA, occuring at one fifth the expected frequency.

CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine).

Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions.

Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression.

They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.

Page 33: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Broad genomic landscape: Broad genomic landscape: CpG islandsCpG islandsFindings:

◦50,267 CpG islands in human genome

◦28,890 after masking repeats with RepeatMasker

◦5-15 CpG islands per megabase◦(about <40 genes per megabase)

Page 34: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

SummarySummaryDNA, Chromosome, GenomeSequence modelsSequence database, retrievalWhole genome sequence

analysis

Page 35: CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:  University of

Slides CreditsSlides CreditsSlides in this presentation are

partially based on the work of slides from Internet.