dna sequence analysis

17
DNA sequence analysis School B&I TCD Bioinformatics May 2010

Upload: kail

Post on 05-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

DNA sequence analysis. School B&I TCD Bioinformatics May 2010. A, T/U, C, G. Simple code, lots of sequence Sequence analysis Computer intensive BLAST homology searching Gene/exon prediction Multiple sequence alignment Alignments in general “Trivial”. Trivial. Could be done by hand - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DNA sequence analysis

DNA sequence analysis

School B&I TCD Bioinformatics

May 2010

Page 2: DNA sequence analysis

A, T/U, C, G

• Simple code, lots of sequence

• Sequence analysis– Computer intensive

• BLAST homology searching• Gene/exon prediction• Multiple sequence alignment• Alignments in general

– “Trivial”

Page 3: DNA sequence analysis

Trivial

• Could be done by hand– Computers

• Quicker• More reliable

• Examples– Translate DNA– Restriction sites– Synonymous codon usage

Page 4: DNA sequence analysis

Sequence formats• Fasta Format

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX

• Phylip Format4 131 IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT

• CLUSTAL W(1.4) multiple sequence alignment

IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQATIXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT

• Interconvert: http://thr.cit.nih.gov/molbio/readseq/

Page 5: DNA sequence analysis

DNA sequence analysis

• Google EMBOSS– A suite of programs with the same look&feel– Does pretty much everything you need– Can be installed locally

Page 6: DNA sequence analysis

Translation• DNA anti-parallel.

– One strand 5’ -3’ matches the complementary strand 3’ – 5’

– Translation, transcription always 5’ – 3’

• Six possible translations, 3 each strand• ATGCCCGCATTTGAATAA• ATGCCCGCATTTGAATAA• ATGCCCGCATTTGAATAA• Stop codons underlined

Frameshift errorsFrameshift mutations

Page 7: DNA sequence analysis

Genetic codeThe “Universal” Genetic Code.

Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG

Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG

Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG

Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG

 

Page 8: DNA sequence analysis

Exceptions to the code• #1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W• #2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W• #3: Mitochondrial Code of Filamentous fungi: UGA=W• #4: Mitochondrial Code of Insects and platyhelminths: AUA=M

UGA=W AGR=S• #5: Nuclear Code of Candida cylindracea: CUG=S (*)• #6: Nuclear Code of Ciliata: UAR = Q• #7: Nuclear Code of Euplotes: UGA=C• #8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N• #9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M• #10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S

UAA=Y AAA=N• #11: Nuclear Code of Blepharisma: UAG=Q

(*) (see Nature 341:164):

Page 9: DNA sequence analysis

Start codons

• ATG the “universal” start codon … but

• 10% E.coli genes start with GTG

• 1% start with TTG.

• Bioinformaticians only make predictions

• Molecular biologists verify

Page 10: DNA sequence analysis

Restriction sites

• Essential for the construction of plasmids

• A key tool for molecular biology

• Hundreds available commercially– Need to decide which to order– Costs from $3.80/1000units - $500/1000

• http://tools.neb.com/NEBcutter2/index.php

• Usually need an enzyme that cuts once

Alu15'AG’CT 3'TC’GA

EcoR15'G’AATTC 3'CTTAA’G

BamH15'G’GATCC 3'CCTAG’G

BluntEnd

Page 11: DNA sequence analysis

Promoter Prediction

• To find start of transcript (97% Human genome not coding)

• False positive rate too high– Predicted 1 / kb gene-density 1 / 100kb

• RNA polII transcribes DNA – RNA– Needs general transcription factors (GTFs)

• Also specific (species, tissue, devt stage) TF• TF binding sites short and “fuzzy”• 7% of vertebrate genes are TFs

Page 12: DNA sequence analysis

Promoters 2

NF-AT4 matrix (3 known sites)and consensus:

Consensus YYAAAKKM = [CT](2)AAA[GT](2)[AC]Predicts five sites in 3Kb upstream of human IL-11:Bp 007 TTAAAGGCBp 248 ACAAATTCBp1959 GAGTTTGABp2154 TCAAAGGABp2181 GACTTTTAAsk if TF site relevant to your cell type is present.

A00333001C12000002G00000110T21000220 TCAAATTC

Page 13: DNA sequence analysis

Primer design

• You will be asked to design primers for sequencing, PCR etc.

• Manual pages cover this

• Computationally trivial, so lots of choice for available websites

Page 14: DNA sequence analysis

Not-trivial

• NA secondary structure– EMBOSS einverted for short palindromes– mFOLD

• Huge database of 16sRNA structures

• miRNA sites

Page 15: DNA sequence analysis

Secondary Structure

• DNA (and RNA) can form base-pairs.

• Not all of these are with complementary strands.Bioinformatic view= a cartoon

Closer to reality

Page 16: DNA sequence analysis

16s RNA

Gram -veGram +ve

Evolutionary consequences? Coordinated/dependent mutational change

Page 17: DNA sequence analysis

RDP

• Ribosomal Database Project-II Release 9 Notes

• RDP Release 9.42 (Release 9, update 42) consists of 262,030 aligned and annotated 16S rRNA sequences, along with five online analysis tools.