qi sun, william lai and jeff glaubitz - cornell university · 2020. 11. 28. · epigenomics data...

Epigenomics data analysis

11/30/2020 – 12/18/2020

Qi Sun, William Lai and Jeff GlaubitzBioinformatics Facilty & Epigenomics Facility

• gene• CDS• mRNA• exon• five_prime_UTR• three_prime_UTR• rRNA• tRNA

GFF files

• ncRNA• tmRNA• transcript• mobile_genetic_element• origin_of_replication• promoter• repeat_region

Genome feature annotation &

Read enrichment pattern

Genome features (gff3 feature types)

Genome feature annotation & Epigenomicsdata analysis

• File formats

• Public databases

• Genome browser

• Bedtools

• Deeptools

• Homer

Week 1Genome features

• Peak calling

• Data QC

Week 2Peak calling

Week 3Integrated analysis

•Axt format•BAM format•BED format•BED detail format•bedGraph format•barChart and bigBarChart format•bigBed format•bigGenePred table format•bigPsl table format•bigMaf table format•bigChain table format•bigNarrowPeak table format•bigWig format•Chain format

•CRAM format•GenePred table format•GFF format•GTF format•HAL format•Hic format•Interact and bigInteract format•MAF format•Microarray format•Net format•Personal Genome SNP format•PSL format•VCF format•WIG format

File formats listed on UCSC web site

https://genome.ucsc.edu/FAQ/FAQformat.html#format3

https://genome.ucsc.edu/goldenPath/help/axt.html

https://genome.ucsc.edu/FAQ/FAQformat.html#format5.1












https://genome.ucsc.edu/goldenPath/help/chain.html










https://genome.ucsc.edu/goldenPath/help/net.html






AxtBAMBEDBED detailbedGraphbarChart&bigBarChartbigBedbigGenePred tablebigPsl tablebigMaf tablebigChain tablebigNarrowPeak tablebigWigChain

CRAMGenePred tableGFFGTFHALHicInteract&bigInteractMAFMicroarrayNetPersonal Genome SNPPSLVCFWIG

General formatbroadPeak

gappedPeak

narrowPeak

pairedTagAlign

peptideMapping

RNA elements

tagAlign

Encode format2bit

fasta

fastQ

nib

Sequence

File formats listed on UCSC Genome Browser web sitehttps://genome.ucsc.edu/FAQ/FAQformat.html#format3


Column Name

1 Chromosome/contig2 Source of annotation3 Feature type4 Start position5 End position

GFF2, GFF3 and GTF

6 Score7 Strand ( + or -)8 Phase (0,1 or 2)9 Attributes or Group

9-column tab-delimited text file

chr1 Maker CDS 380 401 . + 0 gene_001



chr1 Maker start_codon 380 382 . + 0 gene_001

chr1 Maker stop_codon 708 710 . + 0 gene_001

Gff2 example:




chr1 Maker start_codon 380 382 . + 0 gene_001

chr1 Maker stop_codon 708 710 . + 0 gene_001

Gff2 example: Feature type

GFF2 vs GTFGroup

Feature type:

group:

GFF2: "CDS" "start_codon" "stop_codon" and "exon"

GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS

GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene

GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"

chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN

chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1

chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001


chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001

chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001

GFF3 vs GTF

GFF3: Two mandatory attributes: ID and Parent

GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id

“mRNA00001"

GFF3 example:

gffread, a tool to convert between GFF3 and GTF

gffread -E -T -o output.gtf input.gff3

gffread -E -G -o output.gff3 input.gtf

GFF3 -> GTF format

FTF -> GFF3 format




chr1 Maker exon 1050 1500. + .gene_id “gene0001"; transcript_id “mRNA00001“; gene_name "AT1G01010";

Lost information when converting from GFF3 to GTF

GFF3

GTF

gffread -E -G -o output.gff3 input.gtfmRNA feature created from

first exon

gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta

Extract mRNA and protein sequences from the genomeusing gffread

• Chromosome names should match between files; • “Chr1”, “chr1” and “1” are different

CDSantisense_RNAantisense_lncRNAexonfive_prime_UTRgenelnc_RNAmRNAmiRNAmiRNA_primary_transcriptncRNAproteinpseudogene

pseudogenic_exonpseudogenic_tRNApseudogenic_transcriptrRNAsnRNAsnoRNAtRNAthree_prime_UTRtranscript_regiontransposable_elementtransposable_element_genetransposon_fragmentuORF

zcat ara.gff.gz|cut -f3|sort |uniq

Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-ArgChr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75

Other features in gff files:

Column Name

1 Chromosome/contig2 Start position3 End position4 Name5 Score

BED

chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601

6 Strand ( + or -)7-9 thickStart, thickEnd, itemRgb

10 blockCount11 blockSizes12 blockStarts

chr1 1000 5000chr2 2000 6000

3-column format

12-column format

BED GFF3




chr1 999 9000 . . +

GFF3

BED

• Most software ignore the last 6 columns of BED

BED, bam0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100

GFF, GTF, vcf, genbank, sam, blast output1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100

Coordinate system for a 100-bp sequence)

Start/end positions of stranded DNA sequence:

chr1 0 100 . . +

chr1 0 100 . . -

Plus strand

Minus strand

chr1 1 100

chr1 100 1

BLAST output

Minus strand

BED

Plus strand Transcript start site in BED file:

Gene on plus strand gene: start position (add 1)

Gene on minus strand: end position

Stranded read position in SAM file format (1-based)

Leftmost alignment position on reference

Use LINUX “awk” to convert file format

chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN

chr1 999 9000 . . +

GFF3

BED

awk 'BEGIN { OFS = "\t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed

Awk command:

SAM/BAM file formats

HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0

READ alignment to the reference genome

1 9 44 . 40 -1 9 44 . 40 -1 9 44 . 40 -

Chr start end name score strand

Convert to bed file:

Wiggle (wig) file format:

Sparse vs dense data profile

Dense

Sparse

Wiggle (wig), BedGraph, and bigwig file format:

fixedStep chrom=chr3 start=400601 step=100 span=5112233

variableStep chrom=chr3 span=5400601 11400701 22400801 33

Fixed step wig (1-based)

Variable step wig (1-based)

BedGraph (0-based)chr3 400600 400605 11chr3 400700 400705 22chr3 400800 400805 33

Good for dense data

Good for sparse data

BigWig file format replacing the Wig format:

• Compressed and indexed binary format;

• For both sparse and dense data;

• Recommended format for genome browser;

Bedtools genomecovBam => bedgraph

Deeptools bamCoverageBam => bedgraph or bigwig

Kent Utility:Conversion between wig, bigwig,

Software for file conversion:

Public genomic databases

NCBI Refseq : NCBI gene annotation

Ensembl: Ensembl gene annotation

UCSC: Integrated with UCSC genome browser

* All three databases use the same genome assembly

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Download from NCBI Refseq FTP site

• Genome fasta• GFF3• GTF• Protein sequence• Transcript sequence

Download from ENSEMBL FTP

UCSC table browser

Ensembl Biomart:

Features:

Ensembl Biomart:

Ortholog in other species:

NCBI

Ensembl

UCSC

JGI

General databases

Species specific databases

Drosophila: Flybase

C. elegans: Wormbase

Yeast: SGD

Maize: MaizeGDB

Arabidopsis: TAIR

Matching Gene/Transcript ID

Ensembl BioMart provides IDs from many different databases

Genome browsers

IGV – desktop application UCSC Genome browser– web based

Software running on your laptop/desktop

IGV – Integrated Genome Viewer

Select pre-loaded reference genome Create your own reference genome db

Genome fasta

GFF3 or GTF file

Load data track

View multiple tracks

View large data files, not need to upload

qi sun, william lai and jeff glaubitz - cornell university · 2020. 11. 28. · epigenomics data...

Documents