computational analysis of transcript identification using genbank slides by terry clark

34
Computational Analysis of Transcript Identification Using GenBank Slides by Terry Clark

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Computational Analysis of Transcript

Identification Using GenBank

Slides by Terry Clark

Differentiation of hematopoietic cellsPluripotent stem cell

Myeloid Lymphoid

Erythrocyte PlateletMonocyteNeutrophil Eosinophil Basophil B cell T cell

Pluripotent stem cellMyeloid LymphoidMyeloid Lymphoid

Genome-wide gene expression

number of expressed genes level of expression

100

< 5 mRNA / cell

5--50 mRNA / cell

>500 mRNA / cell

9,000

900

SAGE (Serial Analysis of Gene Expression)

isolate SAGE tags

link tags together& sequencing

AAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAA

AAAAAAAAAAA

AAAAAAA

AAAAAAAA

gene identification

mRNA/cDNA

Jes Stollberg et al. Genome Res. 2000; 10: 1241-1248

Figure 1 Schematic illustration of the SAGE process

SAGE & GLGI Overview

SPGI

SAGE

identify most of expressed genes

quantitative analysis of expressed genesby collecting tags

GLGI

Gene identification

GenBank

collect cDNA clones

mRNA

extend tags into longer 3' cDNAs

multi-match

single-match

no match

matchmatch

What is the chance of duplicate tags?

• We can assume we are drawing randomly from the set of all 4-letters sequences of the given tag length

• This is the same problem as having unique overlaps in the contig matching problem for shotgun sequencing

Random Model

Random model does not reflect biological process

• Genes evolve by duplication as well as point mutation

• Many motifs are repeated• Function widgets at work?• Result is a strong bias in observed

biological sequences, not a uniform distribution as the simple model hopes.

• Here are some numbers ….

SAGE tags match to many genes(Tags from Hashimoto S, et al. Blood 94:837, 1999)

Tags matched gene numbers Matched genes (only show up to 10)

CCTGTAATCC 405 Hs.267557,Hs.240615,Hs.231705,Hs.283045,Hs.236713,Hs.232277,Hs.181553,Hs.262716,Hs.181392,Hs.220696GTGAAACCCC 305 Hs.282868,Hs.170225,Hs.184220,Hs.194021,Hs.231625,Hs.171830,Hs.270571,Hs.270572,Hs.272193,Hs.283921CCACTGCACT 174 Hs.118778,Hs.256868,Hs.96023,Hs.31575,Hs.47517,Hs.200451,Hs.271222,Hs.253240,Hs.270018,Hs.270415ACTTTTTCAA 44 Hs.16426,Hs.10669,Hs.75155,Hs.28166,Hs.13975,Hs.79136,Hs.111334,Hs.133430,Hs.79356,Hs.239100TTGGGGTTTC 9 Hs.231375,Hs.273127,Hs.275603,Hs.175173,Hs.276612,Hs.224773,Hs.62954,Hs.182771,Hs.276326TGCACGTTTT 8 Hs.199160,Hs.279943,Hs.36927,Hs.5338,Hs.169793,Hs.83450,Hs.173902,Hs.183506TGTGTTGAGA 5 Hs.284136,Hs.275865,Hs.275221,Hs.274466,Hs.181165CCCGTCCGGA 5 Hs.276353,Hs.277498,Hs.277573,Hs.276350,Hs.180842TTGGTCCTCT 4 Hs.12328,Hs.108124,Hs.9739,Hs.112845CTGACCTGTG 3 Hs.277477,Hs.181244,Hs.77961TACCTGCAGA 3 Hs.100000,Hs.256957,Hs.253884AGGCTACGGA 3 Hs.119122,Hs.211582,Hs.183297GGGCTGGGGT 3 Hs.183698,Hs.118757,Hs.90436CCCTGGGTTC 2 Hs.52891,Hs.111334CACAAACGGT 2 Hs.2043,Hs.195453GTGAAGGCAG 2 Hs.4221,Hs.77039GGGCATCTCT 2 Hs.75061,Hs.76807ATGGCTGGTA 2 Hs.254246,Hs.182426CGCCGCCGGC 2 Hs.182825,Hs.132753AGGGCTTCCA 2 Hs.29797,Hs.276544TTGGTGAAGG 2 Hs.278674,Hs.75968GTGGCCACGG 1 Hs.112405GTTCACATTA 1 Hs.84298TGGTGTTGAG 1 Hs.275865CCCATCGTCC 1 Hs.151604GTTGTGGTTA 1 Hs.75415TTGTAATCGT 1 Hs.125078CCCACAACCT 1 Hs.252136GAGGGAGTTT 1 Hs.76064CCAGAACAGA 1 Hs.111222

Tag Frequency Groups for 10-base Tag Set

Containing 878,938 Tags for UniGene Human

Unique Tags among 878,938 EST Derived Tags

Unique Tags among 32,851 Gene Derived Tags

Converting tag into longer 3’ sequence

3' end

3' end5' end

SAGE tag

3' longer sequence

Generation of Longer 3'cDNA for Gene Identification (GLGI)

TAAAAAAAAAAACTCGCCGGCGAANNNNNNNNNNATTTTTTTTTTTGAGCGGCCGCTT

10 bases

hundred bases

TAAAAAAAAAAACTCGCCGGCGAANNNNNNNNNN

NNNNNNNNNN

NNNNNNNNNN

NNNNNNNNNN

NNNNNNNNNN

Sense extension

antisense extension TGAGCGGCCGCTT

nnnnnnnnnn

nnnnnnnnnn

nnnnnnnnnn

nnnnnnnnnn

nnnnnnnnnn

nnnnnnnnnn

SAGE tag

TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT

TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT

TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT

TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT

UniGene Human 3’ Part Length Distribution

Myeloid Tag Matches with UniGene Human SAGE Tag Reference Database

SAGE Tag Processing with GIST

k-mer tree

GIST Performance with Improved IO

Conspirators

Sanggyu LeeJanet D. RowleySan Ming Wang

Terry ClarkAndrew HuntworkJosef JurekL. Ridgway Scott