csce555 bioinformatics lecture 2 meeting: mw 4:00pm-5:15pm swgn2a21 instructor: dr. jianjun hu...

Post on 01-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 2Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 2

Tools to Learn Concepts Tools to Learn Concepts QuicklyQuicklyWikipedia.org

◦Search “Genome” bringing up many related information

◦In google, type “keywards wiki”Google search tips

◦Find info from university websites Genome, site:edu

◦Find info as powerpoint files Genome, tutorial, filetype:ppt

DNADNADeoxyribonucl

eic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:

sugars and phosphate groupsDNA is a long polymer of simple units called nucleotides

BasesA: adenosine C: cytidine G: guanosine T: thymidine

Microbial Genome: Microbial Genome: Clostridium sp. OhILAsClostridium sp. OhILAsCTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT

Complementary Base Pairing:A TC G Write a program to export

complementary sequence?

Genome of organismsGenome of organismsgenome of an

organism is a complete DNA sequence of one set of chromosomes

Sequencing: Basic IdeasSequencing: Basic Ideas Current lab techniques can sequence small (say 700 base

pairs) DNA pieces.◦ Use restriction enzymes to cut DNA pieces◦ Sort pieces of different sizes using gel electrophoresis and use

the sorting to read them Mapping and Walking

◦ Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone

◦ Estimate for human genome sequencing using this method: 100 years

Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes◦ Obtain random sequence reads from a genome◦ Assemble them into contigs on the basis of sequence overlaps

Straightforward for simple genomes (with no or few repeat sequences) Merge reads containing overlapping sequence

Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches

How Sequencing WorksHow Sequencing Works

Beckman CEQ 8000

Sequencing small DNA piecesSequencing small DNA pieces

Use DNA cloning or PCR to make multiple copies.

Put in 4 testtubes marked G, A, T and C

In testtube G use restriction enzymes that cuts at G.

Do the above step for the other testubes.

Use gel electrophoresis separately for the content in each testtube.

The data results in the table on the left.

Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17.

This gives us the sequence.

G A T C

G --------------

A --------------

C --------------

T --------------

T --------------

A --------------

G --------------

A --------------

T --------------

C --------------

A --------------

G --------------

G --------------

A --------------

A --------------

A --------------

C --------------

T --------------

G --------------

Methods for very large scale Methods for very large scale sequencingsequencing

A hierarchical approach◦ Map on a large scale (physical mapping),

sequence specific clones whose position in the genome is known

Shot gun sequencing◦ “Tear up” the genome and sequence

random fragments until it is doneSequence tagged connectors (STC)

◦ Sequence the ends of many clones and use this info to pick overlapping clones

““Shotgun” sequencingShotgun” sequencing

Clone to sequence

CopySub-clone

Sequence and “assemble”

….GTCTACCTGTACTGATCTAGC...…. CCTGTACTGATCTAGCATTA...

…. GTACTGATCTAGCATTACG...

Emerging Sequence Emerging Sequence MethodsMethodsSequencing by

Hybridization (SBH).Mass

Spectrophotometric Sequences.

Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM )

Single Molecule Sequencing Techniques

Single nucleotide Cutting

Nanopore sequencingReadout of Cellular

Gene Expression

Whole Genomes of SpeciesWhole Genomes of SpeciesBacterial GenomesEukaryotic GenomesHuman Genome ProjectOther Animal and Plant GenomesModel Genomes

The genomes of more than 180 organisms have been sequenced since 1995

http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml

Sizes of GenomesSizes of GenomesYou will learn to download all these genomes into your computer’s harddrive

Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 15

DNA Sequence DNA Sequence RepresentationRepresentationDNA Sequence: a string of

letters with alphabet {A, C, G, T}Protein sequence: a string of

amino acids with alphabet {ARNDCEQGHILKMFPSTWYV}◦20 standard amino acids

Genetic code:

Genetic Code: CondonGenetic Code: CondonDNA (ATCG)

RNA (AUCG)Three bases of

DNA encode an amino acid

Genetic Code with Genetic Code with DegeneracyDegeneracy

Representation of Representation of SequencesSequencesSingle DNA sequence

◦ATCCTTAAGGAAAMultiple sequences with similarity

◦Regular Expression◦ATAAA◦ACAAAA◦ATAAAAAA◦A[TC]A+

Representation of Representation of SequencesSequencesProbablistic Model: Position-

specific scoring matrices (PSSM)

Representation of Sequence: Representation of Sequence:

FASTA formatFASTA formattext-based format for representing either nucleic acid sequences or peptide sequences,

allows for sequence names and comments to precede the sequences.

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 22

Sequence Retrieval, Sequence Retrieval,

ManipulationManipulationWhere to download genome/sequence data◦Online databases: EMBL, GenBank◦Entrez cross-database search (life

science search engine)◦Goolge -

Example: Download H. Example: Download H. influenzae Genomeinfluenzae GenomeFirst bacterial genome: H.

influenzae, 1830Kbhttp://www.ncbi.nlm.nih.gov/

sites/entrez NC_007146

LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27

Genome Information of H. Genome Information of H. influenzae influenzae

Download the Complete Download the Complete Genome Sequence in Fasta Genome Sequence in Fasta FormatFormat

RoadmapRoadmap

DNA, Chromosomes, Genomes

Genome Sequencing and whole genomes

DNA Sequence Representation, Models

Sequence Retrieval, Manipulation

Basic Analysis and Questions of Genomes

Summary

04/20/23 28

Simple Questions and Simple Questions and Analysis of Genome Analysis of Genome SequenceSequenceFrequencies of Bases A/C/G/T by

simple countingSliding windows to check local

densityAT AG AC TA TG TC

K-mers frequent/unusual words ◦2-mers AT AG AC TA TG TC etc.◦3-mers

Page 627

Genomic landscape: GC Genomic landscape: GC content analysiscontent analysisThe overall GC content of the

human genome is 41%.A plot of GC content versus

number of 20 kb windows shows a broad profile with skewing to the right.

Fig. 17.15Page 628Source: IHGSC (2001)

GC content of the human genome: mean 41%

Genomic landscape: CpG Genomic landscape: CpG islandsislands Dinucleotides of CpG are under-represented in

genomic DNA, occuring at one fifth the expected frequency.

CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine).

Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions.

Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression.

They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.

Broad genomic landscape: Broad genomic landscape: CpG islandsCpG islandsFindings:

◦50,267 CpG islands in human genome

◦28,890 after masking repeats with RepeatMasker

◦5-15 CpG islands per megabase◦(about <40 genes per megabase)

SummarySummaryDNA, Chromosome, GenomeSequence modelsSequence database, retrievalWhole genome sequence

analysis

Slides CreditsSlides CreditsSlides in this presentation are

partially based on the work of slides from Internet.

top related