comparison of array detected transcription map with gencode/havana annotations in encode regions
Post on 18-Dec-2015
233 views
TRANSCRIPT
Comparison of array detected transcription map with
GENCODE/HAVANA annotations in ENCODE regions
AFFX Transcriptome Group
Computation Molecular BiologyS. Bekiranov P. KapranovS. Brubaker I. BellJ. Cheng J. DrenkowS. Ghosh D. Kampa-Bailey G. Helt J. LongG. Madhavan J. ManakS. Patel V. Sementchenko H. TammanaA. Piccolboni
Support:NCI Contract (21XS019C Phases I- III) 2001-2006NHGRI ENCODE GrantAFFYMETRIX
AcknowledgementsNCIHarvard Medical School
K. StruhlH. HirschH. H. NgE. Sekinger
Broad Institute
B. BernsteinM. KamalK. Lindblad-TohD. J. HuebertS. McMahonE. K. KarlssonE. J. Kulbokas IIIS. L. SchreiberE. S. Lander
Transcription Map & Modification Site Transcription Map & Modification Site Generation…Generation…II
1. Median Scaling: Scale all features on chip such that chip median = M
2. Quantile Normalization(QN): QN Feature intensities within replicates only.QN Treatment and Control separately.
3. Probe Mapping to Genome:Map PM,MM pairs to genome via exact 25-mer alignment of PM.
4. Wilcoxon Signed Rank Test:• Perform test on probe-pair signal S = log2(PM-
MM)• Apply a sliding window to estimate intensity of
each probe pair as a pseudo-median of all probes in the window.
• A Sliding window makes use of neighboring probes; this reduces false positive rate and increases sensitivity.
• Window size varies w/ experiment: RNA~50bp, IP~250bp
5. Map and Site Generation:RNA
• Join probes w/ intensity > 5%FPR & maxgap, minrun to generate transcribed fragments
Chromatin IP• Generate Hodges Lehman Estimator to estimate
expression level :logDiff = log2(min(PM-MM)T,1) – log2(PM-MM)C,1)
• Generate p-Value estimate per probeJoin probes w/ p-value 10-5 & maxgap, minrun to generate modification/transcription factor binding sites
CEL fileCEL
fileCEL file
Compute median (M)
of all chip medians(if multiple arrays in a set)
Median Scaling
Quantile Normalization
Probe Mapping to Genome
Wilcoxon Signed Rank Test
RNA or IP
RNA: Transfrag Generation
Chromation IP: Site Generation
Filtration of 10 Chromosome Data(Cheng, J., et al. Science Express; March 24, 2005)( see UCSD Browser for 8 cell line data see Version 33)
• Low Complexity Repeats• Processed Pseudogenes• BLAT hits more than itself
(lose some members of gene families)
• Use of all filters this reduces the transfrag by ~20% of transfrags, ~30% of which are pseudogenes. With BLAT data reduction is 14%
RACE Model
(Need isothermal RT for unannotated transfrags)
RACE Analysis of Coding Gene
DeGeorge Critical Region 14 gene
Un-annotated transfrags of PISD are part of at least 9 different, yet overlapping sense-antisense transcripts
Sense Strand
Anti-sense strand
Region
Total 5' and 3' + 5' or 3' successful RACE
Total successful 5' or 3' RACE
Percent total success considering total 256
Intergen 178/25 78 70%
Intronic 213/256 48 83%
Exonic 243/256 13 95%
RACE Regions Validated for 768 Loci
Data sets analyzed
• Part 1 : a) Analysis done on v34 of the human genome. Total number of Encode regions analyzed = 12 ( region Enm006 ignored for this analysis since no annotations are available for v34).
b) Set of Known/validated exons
c) Set of predicted exons (from multiple gene predictions)
d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation. (i.e one cell line at 4 biological states)
• Part 2 : a) Analysis done on v35 of the human genome. Total number of Encode regions analyzed = 44
b) Set of Known/validated exons.
c) Set of Vega putative exons.
d) Set of predicted exons outside sets b & c (from multiple gene predictions).
d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation.
Genomic sequence
35 bp avg. distance
Repeats (RepeatMasker)
Coverage of interogated Regions using algorithms usedTo call Transfrags
Probes
Exon 1 < 100% Covered
Exon 2 is 100% Covered
Annotation (e.g. Vega)
Analyses done only within interrogated regions
How Comparisons are carried out using arrays, Annotations and predicted regions
Predicted exons
Probes
Exon 2
Annotation
Genomic sequence
Transfrags after minrun/maxgap parameters
Positive probes
X
Predicted exons
Coverage of Annotation by array detected transfrags from HL60 cell line in 13 ENCODE
regions
Analysis results of 12/13 ENCODE Regions
Total Number of exons
Interrogated
Number of exons detected by array generated transfrags
( overlap by at least 1 bp)
Number of exons detected by array generated transfrags
( > 75% of exon bp overlapped by a transfrag)
1852
(Known/Validated)
1068 [ 57.7%] (74% avg. bp coverage)
700 (37.7%)
1181
(Predicted)
360 [30.5%] (69.2% avg.bp coverage)
175 (14.8%)
Size distribution of annotated/validated exons detected ENCODE array transfrags
0
50
100
150
200
250
300
350
400
450
1 3 5 7 9 11 13 15 17 19 21 23 25
Exon size (40 bp bins)
# o
f exo
ns
100 pct
< 100 pct
< 75 pct
< 50 pct
<25 pct
0 pct
• Mode size of annotated exons is ~120bp• Detection of exons is not dependent upon size (bp) of the exon (i.e. small exons are not biased against)• If an exon is detected by transfrag, 65% of these are covered at >75%
Size distribution of predicted exons detected by ENCODE array transfrags
050
100150200250
1 3 5 7 9 11 13 15 17 19 21 23 25
Exon size (40 bp bins)
# of
exo
ns
100 pct
< 100 pct
< 75 pct
< 50 pct
< 25 pct
0 pct
• Mode size of predicted exons is ~120bp• Approximately 30.5 % of predicted exons are covered (i.e. at least 1bp coverage) by transfrags. •If an exon is detected by transfrag, 48.6% of these are covered at >75%
Coverage of Annotation by array Detected transfrags from HL60 cell line in all 44 Encode
regions
Analysis results of 44 ENCODE regionsTotal number of exons
interrogatedNumber of exons detected by array generated transfrags
( overlap by at least 1 bp)
Number of exons detected by array generated transfrags
( > 75% of exon bp overlapped by a transfrag)
6467
(Known/Validated)
3487 [ 53.9%] (70% avg.
bp coverage)
2142 (33.1%)
4455
(Predicted)
809 [18.2%] ( 62.23% avg. bp coverage)
361 (8.1%)
185
(Vega Putative)
39 [ 21.1%] (35.71 % avg. bp coverage)
3 (1.6%)
Size distribution of known/validated exons detected by ENCODE array transfrags
0
200
400
600
800
1000
1200
1400
1 3 5 7 9 11 13 15 17 19 21 23 25
Exon sixe (40 bp bin)
# o
f exo
ns
100 pct
< 100 pct
<75 pct
< 50 pct
< 25 pct
0 pct
• Mode size of annotated exons is ~120bp• Detection of exons is not dependent upon size (bp) of the exon (i.e. small exons are not biased against)• If an exon is detected by transfrag, 61.4% of these are covered at >75%
Size distribution of predicted exons detected by ENCODE array transfrags
0
200
400
600
800
1000
1200
1400
1600
1 3 5 7 9 11 13 15 17 19 21 23 25
Exon size ( 40 bp bins)
# o
f eo
ns
100 pct
< 100 pct
< 75 pct
< 50 pct
< 25 pct
0 pct
• Mode size of predicted exons is ~80bp• Approximately, 18.2% of predicted exons are detected by transfrags ( ie. by at least 1 bp)•If an exon is detected by transfrag, 44.6% of these are covered at >75%
Important Caveats To Recall InPondering the Prediction vs Array
Results • Only one cell line used in this evaluation.• We have set very conservative thresholds for transfrag prediction. Other thresholds can be used• Strand information not deducible from transfrag map. TUFs (transcripts of unknown function) are collection of transfrags shown to be on the same molecule by RACE-RT/PCR-cloning/sequencing.• Array interrogation resolution is 20bp on average for non-repeat portion of the genome and probes are 25mers. Thus, the boundaries of transfrags are not as precise as arrays with 5bp interrogation resolution and some small exons will not not be interrogated or detected• Have not included other functional features (e.g.TF binding) which would provide additional confidence to transfrag data. These will be added under ENCODE project.
Conclusions
• Array based method detects ~53.9% of known/validated exons.
• Similarly, array based method provides evidence for ~18.2% of predicted exons. These detected exons should be analyzed further to improve the annotation.
• A combination of array based RNA map generation, followed by RACE experiments can significantly improve the rate of validation of gene predictions.
• Transfrags that map outside validated and predicted exons can be used to improve gene prediction programs and can form the basis for further experiments.