functional annotation of...
TRANSCRIPT
-
Functional annotationof ChIP-peaksMinghui Wang, Qi SunBioinformatics Facility
Institute of Biotechnology
-
Sense Antisense Antisense_step1 Antisense_step21 0 1 26 1 2 83 2 8 32 8 3 38 3 3 84 3 8 02 8 0 0
R= -0.27R= 0.13
R= 0.79
Sense
Antisense
Antisense_step1
Antisense_step2
peak_annotation
seqnamesstartendwidthstrandlengthsummittagsX.10.log10.pvalue.fold_enrichmentFDR...annotationgeneChrgeneStartgeneEndgeneLengthgeneStrandgeneIdtranscriptIddistanceToTSS
1I40584225168*16841364612.73.5710.99Promoter (
-
Experimental designBiorep 1
Yong
Old
Biorep 2 Biorep 3
TR
CL
TR
CL
𝑢𝑢𝑢
𝑢𝑢3
𝑢𝑢4
𝑢𝑢𝑢-𝑢𝑢2 ? 0𝑢𝑢2
𝑢𝑢3-𝑢𝑢4 ? 0
((𝒖𝒖𝒖𝒖-𝒖𝒖𝒖𝒖) − (𝒖𝒖𝒖𝒖 − 𝒖𝒖𝒖𝒖) )? 0 is for ????
-
GLM (Poisson)
Yi
341242184420251032153814
=
1 1 1 11 1 0 01 1 1 11 1 0 01 1 1 11 1 0 01 0 1 01 0 0 01 0 1 01 0 0 01 0 1 01 0 0 0
µ
αβ
α*β
+ εi
out
-
Identify enriched regions within Yong or Old
GLM
Biorep 1
Biorep 2
Biorep 3
M pu (2015) Trimethylation of Lys36 on H3 restricts gene expression change during aging and impacts life span. Genes Dev 1;29(7):718-31
-
Identify enrichment regions between young and old stages
Young
Old
Biorep 1
Biorep 2
Biorep 3
Biorep 1
Biorep 2
Biorep 3
GLM
M pu (2015) Trimethylation of Lys36 on H3 restricts gene expression change during aging and impacts life span. Genes Dev 1;29(7):718-31
-
Visualization & Annotation
-
Downstream analysis workflow
Peak calling
Enriched regionsVisualization with IGV
Nearest genes
Relationship to gene features
Density plotting of gene features
Functional enrichment(e.g. GO categories)
Motif analysis
Other advanced analysis
-
Tools for visualization
https://deeptools.readthedocs.io/en/latest/
http://software.broadinstitute.org/software/igv/
http://homer.ucsd.edu/homer/
https://deeptools.readthedocs.io/en/latest/http://software.broadinstitute.org/software/igv/http://homer.ucsd.edu/homer/
-
deepTools
Tools for BAM and bigWig file processing - multiBamSummary compute read coverages over bam files. Output used for plotCorrelation or plotPCA- multiBigwigSummary extract scores from bigwig files. Output used for plotCorrelation or plotPCA- correctGCBias corrects GC bias from bam file. Don't use it with ChIP data- bamCoverage computes read coverage per bins or regions- bamCompare computes log2 ratio and others of read coverage of two samples per bins or regions- bigwigCompare computes log2 ratio and others from bigwig scores of two samples per bins or regions- computeMatrix prepares the data from bigwig scores for plotting with plotHeatmap or plotProfile
Tools for QC - plotCorrelation plots heatmaps or scatterplots of data correlation- plotPCA plots PCA- plotFingerprint plots the distribution of enriched regions
…
Heatmaps and summary plots- plotHeatmap plots one or multiple heatmaps of user selected regions over different genomic scores- plotProfile plots the average profile of user selected regions over different genomic scores- plotEnrichment plots the read/fragment coverage of one or more sets of regions
…
-
[mingh@cbsumm16 ChIP_seq_workshop_2017]$ bamCoverage -hbamCoverage -b reads.bam -o coverage.bw optionalOptional arguments:--help, -h show this help message and exit--scaleFactor SCALEFACTOR--binSize INT bp…
usage: bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bwOptional arguments:--help, -h show this help message and exit--scaleFactorsMethod--binSize INT bp--ratio {log2,ratio,subtract,add,mean,reciprocal_ratio,first,second}…
deepTools
-
bigwig
peak_region
-
HOMERmakeTagDirectory [options] [alignment file 2]ormakeTagDirectory reps-combined -d rep-1 rep-2 rep-3
Optimizing tag files...Estimated genome size = 119638054Estimated average read density = 0.014636 per bpTotal Tags = 1751052.0Total Positions = 1559937Average tag length = 50.0Median tags per position = 1 (ideal: 1)Average tags per position = 1.123Fragment Length Estimate: 291Peak Width Estimate: 345Autocorrelation quality control metrics:
Same strand fold enrichment: 3.9Diff strand fold enrichment: 3.8Same / Diff fold enrichment: 1.0
Guessing sample is ChIP-Seq or unstranded RNA-Seq - autocorrelation looks good.
Optimizing tag files...Estimated genome size = 25588101Estimated average read density = 0.090854 per bpTotal Tags = 2324782.0Total Positions = 754412Average tag length = 48.9Median tags per position = 2 (ideal: 1)Average tags per position = 3.082
!! Might have some clonal amplification in this sample if sonication was used
Fragment Length Estimate: 155Peak Width Estimate: 154
!!! No reliable estimate for peak sizeSetting Peak width estimate to be equal to fragment length estimate
Autocorrelation quality control metrics:Same strand fold enrichment: 1.2Diff strand fold enrichment: 1.2Same / Diff fold enrichment: 1.0
Guessing sample is ChIP-Seq - may have low enrichment with lots of background
Treatment file Control file
-
HOMERmakeUCSCfile [options] -bigWig -o outputbedGraphToBigWig should be available in your executable pathwget https://github.com/ENCODE-DCC/kentUtils/blob/v302.1.0/bin/linux.x86_64/bedGraphToBigWigchmod 777 bedGraphToBigWiggenome_size.file1 304276712 196982893 234598304 185850565 26975502Mt 366924Pt 154478
https://github.com/ENCODE-DCC/kentUtils/blob/v302.1.0/bin/linux.x86_64/bedGraphToBigWig
-
Tools for gene features analysis
-
HOMERannotatePeaks.pl options >outputannotatePeaks.pl test_results_peaks.narrowPeak_chr tair10 >out
est_results_peaks.narrowP Chr Start End Strand Pea FDAnnotation Detailed Annotat Distance to TSS Nearest PromotNearest U Nearest Re Gene Name Gene Aliastest_results_peak_14368 Chr5 6833504 6837577 + exon (AT5G20 exon (AT5G20250 1881 AT5G20250.4 At.74986 NM_00103 DIN10 DARK IND test_results_peak_1382 Chr1 6971312 6973001 + exon (AT1G20 exon (AT1G20110 671 AT1G20110.1 At.15444 NM_10186 AT1G20110 T20H2.10|test_results_peak_855 Chr1 4347808 4349969 + promoter-TSS promoter-TSS (A 390 AT1G12760.1 At.43884 NM_00103 AT1G12760 T12C24.29
test_results_peak_15041 Chr5 15843775 15845935 + exon (AT5G39 exon (AT5G39570 896 AT5G39570.1 At.20492 NM_12331 AT5G39570 MIJ24.6|Mtest_results_peak_154 Chr1 739488 742090 + intron (AT1G0 intron (AT1G0309 1110 AT1G03090.2 At.24059 NM_10019 MCCA -
test_results_peak_6386 Chr2 16483892 16485127 + exon (AT2G39 exon (AT2G39480 530 AT2G39480.1 At.63501 NM_12950 PGP6 F12L6.14|F test_results_peak_3313 Chr1 24046891 24048872 + promoter-TSS promoter-TSS (A 741 AT1G64720.1 At.74749 NM_10514 CP5 F13O11.4|test_results_peak_8490 Chr3 7116011 7117776 + exon (AT3G20 exon (AT3G20410 692 AT3G20410.1 At.8182 NM_11293 CPK9 domain pro
-
PAVIS
features distance to TSS
-
PeakAnalyzer
Peak fileGTF file
-
peak
-
Tools for density plot
https://github.com/shenlab-sinai/ngsplot
https://github.com/shenlab-sinai/ngsplot
-
ngs.plot
ngs.plot.r -G genome -R region -C [cov|config] file-O name [Options]
-G Genome name. Use ngsplotdb.py list to show available genomes.-R Genomic regions to plot: tss, tes, genebody, exon, cgi, enhancer, dhs or bed-C Indexed bam file or a configuration file for multiplot-O Name for output: multiple files will be generated
Options-L Flanking region size(will override flanking factor)-N Flanking region factor-RB The fraction of extreme values to be trimmed on both ends default=0, 0.05 means 5% of extreme values will be trimmed-S Randomly sample the regions for plot, must be:(0, 1]-P #CPUs to use. Set 0(default) for auto detection-AL Algorithm used to normalize coverage vectors: spline(default), bin-CS Chunk size for loading genes in batch(default=100)-MQ Mapping quality cutoff to filter reads(default=20)-FL Fragment length used to calculate physical coverage(default=150)
https://github.com/shenlab-sinai/ngsplot/wiki/SupportedGenomes
-
ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.rmdup.sort.bam -O hesc.H3k4me3.tss -T H3K4me3 -L 3000 -FL 300
ngs.plot
-
ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.rmdup.sort.bam:hesc.Input.rmdup.sort.bam -O hesc.H3k4me3vsInp.tss -T H3K4me3 -L 3000
ngs.plot
-
ngs.plot.r -G hg19 -R genebody -C config.hesc.k36.txt -O hesc.k36.genebody -D ensembl -FL 300
hesc.H3k36me3.rmdup.sort.bam high_expressed_genes.txt "High"hesc.H3k36me3.rmdup.sort.bam medium_expressed_genes.txt "Med"hesc.H3k36me3.rmdup.sort.bam low_expressed_genes.txt "Low"
ngs.plot
-
deepToolscomputeMatrix scale-regions -S H3K27Me3-input.bigWig H3K4Me1-Input.bigWig H3K4Me3-Input.bigWig -R genes19.bed genesX.bed --beforeRegionStartLength 3000 --regionBodyLength 5000 --afterRegionStartLength 3000 --skipZeros -o matrix.mat.gz
plotProfile -m matrix.mat.gz -out ExampleProfile1.png --numPlotsPerRow 2 --plotTitle "Test data profile"
-
HOMERannotatePeaks.pl -hist …
-
ChIPseeker
-
Functional enrichmentOver-represented functional annotations of nearest genes of peaks• Gene Ontology• Biological Pathways
Typical tools• DAVID https://david.ncifcrf.gov/• GREAT http://bejerano.stanford.edu/great/public/html/• Blast2go https://www.blast2go.com/• Mapman http://mapman.gabipd.org• Homer http://homer.salk.edu/homer/ngs/annotation.html
https://david.ncifcrf.gov/http://bejerano.stanford.edu/great/public/html/https://www.blast2go.com/http://mapman.gabipd.org/http://homer.salk.edu/homer/ngs/annotation.html
-
Webpage tools
-
Oliver et al (2004) The plant J
-
HOMER
findGO.pl [options]
http://homer.ucsd.edu/homer/microarray/go.html
http://homer.ucsd.edu/homer/microarray/go.html
-
Motif pattern analysis
-
Motif analysismeme [optional arguments]
MEME (http://meme.sdsc.edu/meme/cgi-bin/meme.cgi)
http://meme.sdsc.edu/meme/cgi-bin/meme.cgi
-
HOMER
-
biomaRt & Bioconductor
-
ChIPseeker
-
ChIPpeakAnnopeak
-
peak
-
promoter
-
genome