csce555 bioinformatics lecture 2 meeting: mw 4:00pm-5:15pm swgn2a21 instructor: dr. jianjun hu...
TRANSCRIPT
CSCE555 BioinformaticsCSCE555 Bioinformatics
Lecture 2Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/20/23 2
Tools to Learn Concepts Tools to Learn Concepts QuicklyQuicklyWikipedia.org
◦Search “Genome” bringing up many related information
◦In google, type “keywards wiki”Google search tips
◦Find info from university websites Genome, site:edu
◦Find info as powerpoint files Genome, tutorial, filetype:ppt
DNADNADeoxyribonucl
eic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:
sugars and phosphate groupsDNA is a long polymer of simple units called nucleotides
BasesA: adenosine C: cytidine G: guanosine T: thymidine
Microbial Genome: Microbial Genome: Clostridium sp. OhILAsClostridium sp. OhILAsCTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT
Complementary Base Pairing:A TC G Write a program to export
complementary sequence?
Genome of organismsGenome of organismsgenome of an
organism is a complete DNA sequence of one set of chromosomes
Sequencing: Basic IdeasSequencing: Basic Ideas Current lab techniques can sequence small (say 700 base
pairs) DNA pieces.◦ Use restriction enzymes to cut DNA pieces◦ Sort pieces of different sizes using gel electrophoresis and use
the sorting to read them Mapping and Walking
◦ Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone
◦ Estimate for human genome sequencing using this method: 100 years
Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes◦ Obtain random sequence reads from a genome◦ Assemble them into contigs on the basis of sequence overlaps
Straightforward for simple genomes (with no or few repeat sequences) Merge reads containing overlapping sequence
Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches
How Sequencing WorksHow Sequencing Works
Beckman CEQ 8000
Sequencing small DNA piecesSequencing small DNA pieces
Use DNA cloning or PCR to make multiple copies.
Put in 4 testtubes marked G, A, T and C
In testtube G use restriction enzymes that cuts at G.
Do the above step for the other testubes.
Use gel electrophoresis separately for the content in each testtube.
The data results in the table on the left.
Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17.
This gives us the sequence.
G A T C
G --------------
A --------------
C --------------
T --------------
T --------------
A --------------
G --------------
A --------------
T --------------
C --------------
A --------------
G --------------
G --------------
A --------------
A --------------
A --------------
C --------------
T --------------
G --------------
Methods for very large scale Methods for very large scale sequencingsequencing
A hierarchical approach◦ Map on a large scale (physical mapping),
sequence specific clones whose position in the genome is known
Shot gun sequencing◦ “Tear up” the genome and sequence
random fragments until it is doneSequence tagged connectors (STC)
◦ Sequence the ends of many clones and use this info to pick overlapping clones
““Shotgun” sequencingShotgun” sequencing
Clone to sequence
CopySub-clone
Sequence and “assemble”
….GTCTACCTGTACTGATCTAGC...…. CCTGTACTGATCTAGCATTA...
…. GTACTGATCTAGCATTACG...
Emerging Sequence Emerging Sequence MethodsMethodsSequencing by
Hybridization (SBH).Mass
Spectrophotometric Sequences.
Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM )
Single Molecule Sequencing Techniques
Single nucleotide Cutting
Nanopore sequencingReadout of Cellular
Gene Expression
Whole Genomes of SpeciesWhole Genomes of SpeciesBacterial GenomesEukaryotic GenomesHuman Genome ProjectOther Animal and Plant GenomesModel Genomes
The genomes of more than 180 organisms have been sequenced since 1995
http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml
Sizes of GenomesSizes of GenomesYou will learn to download all these genomes into your computer’s harddrive
Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/20/23 15
DNA Sequence DNA Sequence RepresentationRepresentationDNA Sequence: a string of
letters with alphabet {A, C, G, T}Protein sequence: a string of
amino acids with alphabet {ARNDCEQGHILKMFPSTWYV}◦20 standard amino acids
Genetic code:
Genetic Code: CondonGenetic Code: CondonDNA (ATCG)
RNA (AUCG)Three bases of
DNA encode an amino acid
Genetic Code with Genetic Code with DegeneracyDegeneracy
Representation of Representation of SequencesSequencesSingle DNA sequence
◦ATCCTTAAGGAAAMultiple sequences with similarity
◦Regular Expression◦ATAAA◦ACAAAA◦ATAAAAAA◦A[TC]A+
Representation of Representation of SequencesSequencesProbablistic Model: Position-
specific scoring matrices (PSSM)
Representation of Sequence: Representation of Sequence:
FASTA formatFASTA formattext-based format for representing either nucleic acid sequences or peptide sequences,
allows for sequence names and comments to precede the sequences.
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/20/23 22
Sequence Retrieval, Sequence Retrieval,
ManipulationManipulationWhere to download genome/sequence data◦Online databases: EMBL, GenBank◦Entrez cross-database search (life
science search engine)◦Goolge -
Example: Download H. Example: Download H. influenzae Genomeinfluenzae GenomeFirst bacterial genome: H.
influenzae, 1830Kbhttp://www.ncbi.nlm.nih.gov/
sites/entrez NC_007146
LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27
Genome Information of H. Genome Information of H. influenzae influenzae
Download the Complete Download the Complete Genome Sequence in Fasta Genome Sequence in Fasta FormatFormat
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/20/23 28
Simple Questions and Simple Questions and Analysis of Genome Analysis of Genome SequenceSequenceFrequencies of Bases A/C/G/T by
simple countingSliding windows to check local
densityAT AG AC TA TG TC
K-mers frequent/unusual words ◦2-mers AT AG AC TA TG TC etc.◦3-mers
Page 627
Genomic landscape: GC Genomic landscape: GC content analysiscontent analysisThe overall GC content of the
human genome is 41%.A plot of GC content versus
number of 20 kb windows shows a broad profile with skewing to the right.
Fig. 17.15Page 628Source: IHGSC (2001)
GC content of the human genome: mean 41%
Genomic landscape: CpG Genomic landscape: CpG islandsislands Dinucleotides of CpG are under-represented in
genomic DNA, occuring at one fifth the expected frequency.
CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine).
Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions.
Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression.
They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.
Broad genomic landscape: Broad genomic landscape: CpG islandsCpG islandsFindings:
◦50,267 CpG islands in human genome
◦28,890 after masking repeats with RepeatMasker
◦5-15 CpG islands per megabase◦(about <40 genes per megabase)
SummarySummaryDNA, Chromosome, GenomeSequence modelsSequence database, retrievalWhole genome sequence
analysis
Slides CreditsSlides CreditsSlides in this presentation are
partially based on the work of slides from Internet.