qi sun, william lai and jeff glaubitz - cornell university · 2020. 11. 28. · epigenomics data...

35
Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz Bioinformatics Facilty & Epigenomics Facility

Upload: others

Post on 18-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Epigenomics data analysis

11/30/2020 – 12/18/2020

Qi Sun, William Lai and Jeff GlaubitzBioinformatics Facilty & Epigenomics Facility

Page 2: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

• gene• CDS• mRNA• exon• five_prime_UTR• three_prime_UTR• rRNA• tRNA

GFF files

• ncRNA• tmRNA• transcript• mobile_genetic_element• origin_of_replication• promoter• repeat_region

Genome feature annotation &

Read enrichment pattern

Genome features (gff3 feature types)

Page 3: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Genome feature annotation & Epigenomicsdata analysis

• File formats

• Public databases

• Genome browser

• Bedtools

• Deeptools

• Homer

Week 1Genome features

• Peak calling

• Data QC

Week 2Peak calling

Week 3Integrated analysis

Page 4: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

•Axt format•BAM format•BED format•BED detail format•bedGraph format•barChart and bigBarChart format•bigBed format•bigGenePred table format•bigPsl table format•bigMaf table format•bigChain table format•bigNarrowPeak table format•bigWig format•Chain format

•CRAM format•GenePred table format•GFF format•GTF format•HAL format•Hic format•Interact and bigInteract format•MAF format•Microarray format•Net format•Personal Genome SNP format•PSL format•VCF format•WIG format

File formats listed on UCSC web site

https://genome.ucsc.edu/FAQ/FAQformat.html#format3

Page 5: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

AxtBAMBEDBED detailbedGraphbarChart&bigBarChartbigBedbigGenePred tablebigPsl tablebigMaf tablebigChain tablebigNarrowPeak tablebigWigChain

CRAMGenePred tableGFFGTFHALHicInteract&bigInteractMAFMicroarrayNetPersonal Genome SNPPSLVCFWIG

General formatbroadPeak

gappedPeak

narrowPeak

pairedTagAlign

peptideMapping

RNA elements

tagAlign

Encode format2bit

fasta

fastQ

nib

Sequence

File formats listed on UCSC Genome Browser web sitehttps://genome.ucsc.edu/FAQ/FAQformat.html#format3

Page 6: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Column Name

1 Chromosome/contig2 Source of annotation3 Feature type4 Start position5 End position

GFF2, GFF3 and GTF

6 Score7 Strand ( + or -)8 Phase (0,1 or 2)9 Attributes or Group

9-column tab-delimited text file

chr1 Maker CDS 380 401 . + 0 gene_001

chr1 Maker CDS 501 650 . + 2 gene_001

chr1 Maker CDS 700 707 . + 2 gene_001

chr1 Maker start_codon 380 382 . + 0 gene_001

chr1 Maker stop_codon 708 710 . + 0 gene_001

Gff2 example:

Page 7: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

chr1 Maker CDS 380 401 . + 0 gene_001

chr1 Maker CDS 501 650 . + 2 gene_001

chr1 Maker CDS 700 707 . + 2 gene_001

chr1 Maker start_codon 380 382 . + 0 gene_001

chr1 Maker stop_codon 708 710 . + 0 gene_001

Gff2 example: Feature type

GFF2 vs GTFGroup

Feature type:

group:

GFF2: "CDS" "start_codon" "stop_codon" and "exon"

GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS

GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene

GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"

Page 8: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN

chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1

chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001

chr1 Maker exon 3000 3902. + . ID=exon00003;Parent=mRNA00001

chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001

chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001

GFF3 vs GTF

GFF3: Two mandatory attributes: ID and Parent

GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id

“mRNA00001"

GFF3 example:

Page 9: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

gffread, a tool to convert between GFF3 and GTF

gffread -E -T -o output.gtf input.gff3

gffread -E -G -o output.gff3 input.gtf

GFF3 -> GTF format

FTF -> GFF3 format

Page 10: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN

chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1

chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001

chr1 Maker exon 1050 1500. + .gene_id “gene0001"; transcript_id “mRNA00001“; gene_name "AT1G01010";

Lost information when converting from GFF3 to GTF

GFF3

GTF

gffread -E -G -o output.gff3 input.gtfmRNA feature created from

first exon

Page 11: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta

Extract mRNA and protein sequences from the genomeusing gffread

• Chromosome names should match between files; • “Chr1”, “chr1” and “1” are different

Page 12: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

CDSantisense_RNAantisense_lncRNAexonfive_prime_UTRgenelnc_RNAmRNAmiRNAmiRNA_primary_transcriptncRNAproteinpseudogene

pseudogenic_exonpseudogenic_tRNApseudogenic_transcriptrRNAsnRNAsnoRNAtRNAthree_prime_UTRtranscript_regiontransposable_elementtransposable_element_genetransposon_fragmentuORF

zcat ara.gff.gz|cut -f3|sort |uniq

Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-ArgChr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75

Other features in gff files:

Page 13: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Column Name

1 Chromosome/contig2 Start position3 End position4 Name5 Score

BED

chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601

6 Strand ( + or -)7-9 thickStart, thickEnd, itemRgb

10 blockCount11 blockSizes12 blockStarts

chr1 1000 5000chr2 2000 6000

3-column format

12-column format

Page 14: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

BED GFF3

chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN

chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1

chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001

chr1 999 9000 . . +

GFF3

BED

• Most software ignore the last 6 columns of BED

BED, bam0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100

GFF, GTF, vcf, genbank, sam, blast output1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100

Coordinate system for a 100-bp sequence)

Page 15: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Start/end positions of stranded DNA sequence:

chr1 0 100 . . +

chr1 0 100 . . -

Plus strand

Minus strand

chr1 1 100

chr1 100 1

BLAST output

Minus strand

BED

Plus strand Transcript start site in BED file:

Gene on plus strand gene: start position (add 1)

Gene on minus strand: end position

Page 16: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Stranded read position in SAM file format (1-based)

Leftmost alignment position on reference

Page 17: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Use LINUX “awk” to convert file format

chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN

chr1 999 9000 . . +

GFF3

BED

awk 'BEGIN { OFS = "\t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed

Awk command:

Page 18: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

SAM/BAM file formats

HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0

READ alignment to the reference genome

1 9 44 . 40 -1 9 44 . 40 -1 9 44 . 40 -

Chr start end name score strand

Convert to bed file:

Page 19: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Wiggle (wig) file format:

Page 20: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Sparse vs dense data profile

Dense

Sparse

Page 21: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Wiggle (wig), BedGraph, and bigwig file format:

fixedStep chrom=chr3 start=400601 step=100 span=5112233

variableStep chrom=chr3 span=5400601 11400701 22400801 33

Fixed step wig (1-based)

Variable step wig (1-based)

BedGraph (0-based)chr3 400600 400605 11chr3 400700 400705 22chr3 400800 400805 33

Good for dense data

Good for sparse data

Page 22: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

BigWig file format replacing the Wig format:

• Compressed and indexed binary format;

• For both sparse and dense data;

• Recommended format for genome browser;

Page 23: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Bedtools genomecovBam => bedgraph

Deeptools bamCoverageBam => bedgraph or bigwig

Kent Utility:Conversion between wig, bigwig,

Software for file conversion:

Page 24: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Public genomic databases

NCBI Refseq : NCBI gene annotation

Ensembl: Ensembl gene annotation

UCSC: Integrated with UCSC genome browser

* All three databases use the same genome assembly

Page 25: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Download from NCBI Refseq FTP site

• Genome fasta• GFF3• GTF• Protein sequence• Transcript sequence

Page 26: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Download from ENSEMBL FTP

Page 27: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

UCSC table browser

Page 28: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Ensembl Biomart:

Features:

Page 29: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Ensembl Biomart:

Ortholog in other species:

Page 30: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

NCBI

Ensembl

UCSC

JGI

General databases

Species specific databases

Drosophila: Flybase

C. elegans: Wormbase

Yeast: SGD

Maize: MaizeGDB

Arabidopsis: TAIR

Matching Gene/Transcript ID

Ensembl BioMart provides IDs from many different databases

Page 31: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Genome browsers

IGV – desktop application UCSC Genome browser– web based

Page 32: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Software running on your laptop/desktop

IGV – Integrated Genome Viewer

Page 33: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Select pre-loaded reference genome Create your own reference genome db

Genome fasta

GFF3 or GTF file

Page 34: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

Load data track

Page 35: Qi Sun, William Lai and Jeff Glaubitz - Cornell University · 2020. 11. 28. · Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz. Bioinformatics

View multiple tracks

View large data files, not need to upload