qi sun, william lai and jeff glaubitz - cornell university · 2020. 11. 28. · epigenomics data...
TRANSCRIPT
Epigenomics data analysis
11/30/2020 – 12/18/2020
Qi Sun, William Lai and Jeff GlaubitzBioinformatics Facilty & Epigenomics Facility
• gene• CDS• mRNA• exon• five_prime_UTR• three_prime_UTR• rRNA• tRNA
GFF files
• ncRNA• tmRNA• transcript• mobile_genetic_element• origin_of_replication• promoter• repeat_region
Genome feature annotation &
Read enrichment pattern
Genome features (gff3 feature types)
Genome feature annotation & Epigenomicsdata analysis
• File formats
• Public databases
• Genome browser
• Bedtools
• Deeptools
• Homer
Week 1Genome features
• Peak calling
• Data QC
Week 2Peak calling
Week 3Integrated analysis
•Axt format•BAM format•BED format•BED detail format•bedGraph format•barChart and bigBarChart format•bigBed format•bigGenePred table format•bigPsl table format•bigMaf table format•bigChain table format•bigNarrowPeak table format•bigWig format•Chain format
•CRAM format•GenePred table format•GFF format•GTF format•HAL format•Hic format•Interact and bigInteract format•MAF format•Microarray format•Net format•Personal Genome SNP format•PSL format•VCF format•WIG format
File formats listed on UCSC web site
https://genome.ucsc.edu/FAQ/FAQformat.html#format3
AxtBAMBEDBED detailbedGraphbarChart&bigBarChartbigBedbigGenePred tablebigPsl tablebigMaf tablebigChain tablebigNarrowPeak tablebigWigChain
CRAMGenePred tableGFFGTFHALHicInteract&bigInteractMAFMicroarrayNetPersonal Genome SNPPSLVCFWIG
General formatbroadPeak
gappedPeak
narrowPeak
pairedTagAlign
peptideMapping
RNA elements
tagAlign
Encode format2bit
fasta
fastQ
nib
Sequence
File formats listed on UCSC Genome Browser web sitehttps://genome.ucsc.edu/FAQ/FAQformat.html#format3
Column Name
1 Chromosome/contig2 Source of annotation3 Feature type4 Start position5 End position
GFF2, GFF3 and GTF
6 Score7 Strand ( + or -)8 Phase (0,1 or 2)9 Attributes or Group
9-column tab-delimited text file
chr1 Maker CDS 380 401 . + 0 gene_001
chr1 Maker CDS 501 650 . + 2 gene_001
chr1 Maker CDS 700 707 . + 2 gene_001
chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example:
chr1 Maker CDS 380 401 . + 0 gene_001
chr1 Maker CDS 501 650 . + 2 gene_001
chr1 Maker CDS 700 707 . + 2 gene_001
chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example: Feature type
GFF2 vs GTFGroup
Feature type:
group:
GFF2: "CDS" "start_codon" "stop_codon" and "exon"
GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS
GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene
GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 Maker exon 3000 3902. + . ID=exon00003;Parent=mRNA00001
chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001
chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001
GFF3 vs GTF
GFF3: Two mandatory attributes: ID and Parent
GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id
“mRNA00001"
GFF3 example:
gffread, a tool to convert between GFF3 and GTF
gffread -E -T -o output.gtf input.gff3
gffread -E -G -o output.gff3 input.gtf
GFF3 -> GTF format
FTF -> GFF3 format
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 Maker exon 1050 1500. + .gene_id “gene0001"; transcript_id “mRNA00001“; gene_name "AT1G01010";
Lost information when converting from GFF3 to GTF
GFF3
GTF
gffread -E -G -o output.gff3 input.gtfmRNA feature created from
first exon
gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta
Extract mRNA and protein sequences from the genomeusing gffread
• Chromosome names should match between files; • “Chr1”, “chr1” and “1” are different
CDSantisense_RNAantisense_lncRNAexonfive_prime_UTRgenelnc_RNAmRNAmiRNAmiRNA_primary_transcriptncRNAproteinpseudogene
pseudogenic_exonpseudogenic_tRNApseudogenic_transcriptrRNAsnRNAsnoRNAtRNAthree_prime_UTRtranscript_regiontransposable_elementtransposable_element_genetransposon_fragmentuORF
zcat ara.gff.gz|cut -f3|sort |uniq
Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-ArgChr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75
Other features in gff files:
Column Name
1 Chromosome/contig2 Start position3 End position4 Name5 Score
BED
chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601
6 Strand ( + or -)7-9 thickStart, thickEnd, itemRgb
10 blockCount11 blockSizes12 blockStarts
chr1 1000 5000chr2 2000 6000
3-column format
12-column format
BED GFF3
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 999 9000 . . +
GFF3
BED
• Most software ignore the last 6 columns of BED
BED, bam0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100
GFF, GTF, vcf, genbank, sam, blast output1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100
Coordinate system for a 100-bp sequence)
Start/end positions of stranded DNA sequence:
chr1 0 100 . . +
chr1 0 100 . . -
Plus strand
Minus strand
chr1 1 100
chr1 100 1
BLAST output
Minus strand
BED
Plus strand Transcript start site in BED file:
Gene on plus strand gene: start position (add 1)
Gene on minus strand: end position
Stranded read position in SAM file format (1-based)
Leftmost alignment position on reference
Use LINUX “awk” to convert file format
chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN
chr1 999 9000 . . +
GFF3
BED
awk 'BEGIN { OFS = "\t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed
Awk command:
SAM/BAM file formats
HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
READ alignment to the reference genome
1 9 44 . 40 -1 9 44 . 40 -1 9 44 . 40 -
Chr start end name score strand
Convert to bed file:
Wiggle (wig) file format:
Sparse vs dense data profile
Dense
Sparse
Wiggle (wig), BedGraph, and bigwig file format:
fixedStep chrom=chr3 start=400601 step=100 span=5112233
variableStep chrom=chr3 span=5400601 11400701 22400801 33
Fixed step wig (1-based)
Variable step wig (1-based)
BedGraph (0-based)chr3 400600 400605 11chr3 400700 400705 22chr3 400800 400805 33
Good for dense data
Good for sparse data
BigWig file format replacing the Wig format:
• Compressed and indexed binary format;
• For both sparse and dense data;
• Recommended format for genome browser;
Bedtools genomecovBam => bedgraph
Deeptools bamCoverageBam => bedgraph or bigwig
Kent Utility:Conversion between wig, bigwig,
Software for file conversion:
Public genomic databases
NCBI Refseq : NCBI gene annotation
Ensembl: Ensembl gene annotation
UCSC: Integrated with UCSC genome browser
* All three databases use the same genome assembly
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
Download from NCBI Refseq FTP site
• Genome fasta• GFF3• GTF• Protein sequence• Transcript sequence
Download from ENSEMBL FTP
UCSC table browser
Ensembl Biomart:
Features:
Ensembl Biomart:
Ortholog in other species:
NCBI
Ensembl
UCSC
JGI
General databases
Species specific databases
Drosophila: Flybase
C. elegans: Wormbase
Yeast: SGD
Maize: MaizeGDB
Arabidopsis: TAIR
Matching Gene/Transcript ID
Ensembl BioMart provides IDs from many different databases
Genome browsers
IGV – desktop application UCSC Genome browser– web based
Software running on your laptop/desktop
IGV – Integrated Genome Viewer
Select pre-loaded reference genome Create your own reference genome db
Genome fasta
GFF3 or GTF file
Load data track
View multiple tracks
View large data files, not need to upload