fromencoded ata to encodea nalyses
TRANSCRIPT
FROM ENCODE DATA TO ENCODE ANALYSES
J. Seth Stra6an, PhD ENCODE Data Coordina=ng Center (DCC) Asia Pacific Bioinforma=cs Conference January, 2016
J. Seth Stra6an, PhD ENCODE DCC 1
ENCODE: Metadata, Data, and Analyses
J. Seth Stra6an, PhD ENCODE DCC 2
So far, you have learned
• The ENCODE Portal is the canonical source for ENCODE metadata and data.
• The Portal also documents ENCODE standards like an=body standards, data release.
• The Portal links to documenta:on and tutorials.
• How to use the Portal to browse and search what ENCODE has done.
Focus for the rest of the course
• Visualiza:on of ENCODE data.
• Programma=c search and download of ENCODE metadata and data.
• ENCODE data analyses, and how you can replicate them.
Find an experiment
J. Seth Stra6an, PhD ENCODE DCC 3
Use metadata to find data:
• Search for “H3K9ac neural tube”
• Facet on ChIP-‐seq; mouse; mm10 assembly
• Select an experiment, for example
h@ps://www.encodeproject.org/
experiments/ENCSR087PLZ/
• Note metadata on protocols, replicates
• Graph: files are related by processing steps
• Download from the graph or a list
• Click on “Visualize Data” to visualize the
results of this experiment.
Visualize the experiment
J. Seth Stra6an, PhD ENCODE DCC 4
Adjust the browser seengs to display fold-‐over-‐signal in ”full”
Find several experiments
J. Seth Stra6an, PhD ENCODE DCC 5
Use metadata to find data:
• Search for “H3K9ac neural tube”
• Facet on ChIP-‐seq; mouse; mm10
assembly
• Get a list of several experiments
• Click on “Visualize Data” to visualize
all the experiments matching this
search.
Visualize several experiments
J. Seth Stra6an, PhD ENCODE DCC 6
Stage-‐dependent
H3K9ac signal
present at Pax9 in
neural tube at
e11.5, e13.5.
Find & download several experiments
J. Seth Stra6an, PhD ENCODE DCC 7
Use metadata to find data:
• Search for “H3K9ac neural tube”
• Facet on ChIP-‐seq; mouse; mm10
assembly
• Get a list of several experiments
• Click on “Download” to download
selected metadata and complete
links to data.
Download several experiments
J. Seth Stra6an, PhD ENCODE DCC 8
Use metadata to find data:
• Search for “H3K9ac neural tube”
• Facet on ChIP-‐seq; mouse; mm10
assembly
• Get a list of several experiments
• Click on “Download” to download
selected metadata and complete
links to data.
Download several experiments
J. Seth Stra6an, PhD ENCODE DCC 9
• “Download” produces a file with a list of
links to all the files for all the experiments in
your search.
• You can iterate through the list in your own
script.
• Or:
xargs -n 1 curl -O -L < files.txt!
• The first link is to a file called metadata.tsv
that contains metadata you need to
interpret what each file is.
Download several experiments
J. Seth Stra6an, PhD ENCODE DCC 10
• metadata.tsv: Each line contains metadata on a file from the download package.
Programma=c access via the ENCODE REST API
J. Seth Stra6an, PhD ENCODE DCC 11
• All Portal content is accessible via URL’s; just add ?format=json!• The database record is returned in JSON format • JSON can be parsed in your language of choice
Programma=c access via the ENCODE REST API
J. Seth Stra6an, PhD ENCODE DCC 12
Programma=c access via the ENCODE REST API
J. Seth Stra6an, PhD ENCODE DCC 13
Programma=c access via the ENCODE REST API
J. Seth Stra6an, PhD ENCODE DCC 14
The ENCODE Portal: Recap
J. Seth Stra6an, PhD ENCODE DCC 15
• Interac=ve access to ENCODE metadata via faceted browsing and search • Interac=ve retrieval of ENCODE data one file at a =me • Batch download of ENCODE metadata and data files • Programma=c access using the ENCODE REST API
Next: ENCODE Data Analysis Pipelines • What do they produce? • How can they be run?
Pipelines Demonstra=on and Exercise
J. Seth Stra6an, PhD ENCODE DCC 16
To set up an account: h6ps://www.encodeproject.org/tutorials/apbc-‐2016/ Click “Prepare to run web-‐based pipelines”
Log in -‐>
DCC Delivers ENCODE Data
J. Seth Stra6an, PhD ENCODE DCC 17
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT + Sample Library Primary Data Processed Data
AWS S3 Bucket ENCODE Files
ENCODE DCC Delivers ENCODE Metadata
J. Seth Stra6an, PhD ENCODE DCC 18
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT + Sample Library Primary Data Processed Data
ENCODE Analysis Pipelines as Deliverables
J. Seth Stra6an, PhD ENCODE DCC 19
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT + Sample Library Primary Data Processed Data
Goals: 1. Deploy ENCODE-‐defined pipelines for ChIP-‐seq, RNA-‐seq, DNase-‐seq, methyla=on. 2. Use those pipelines to generate the standard ENCODE peaks, quan=ta=ons, CpG. 3. Capture metadata to make clear what sosware, versions, parameters, inputs were used. 4. Capture, accession, and distribute the output. 5. Deliver exactly the same pipelines in a form that anyone can run on their data or with
ENCODE data – one experiment or 1000.
Replicability – Provenance – Ease of Use – Scalability
Deployment Plauorm Considera=ons
J. Seth Stra6an, PhD ENCODE DCC 20
+ CCCFFFFFHHHHGIJJIGGHEIIEGGEGGIJJBHIG @BI:SL-‐HAB:D0RRAACXX:8:2309:21201:7829 1:X:0:GCCGTCGA CTAACCCTAACCCTAACCCTAACCCTAACCCTAACC + CCCFFFFFHHHHHJJJJJJJGJJJJIIJJJJGGIGJ @BI:SL-‐HAB:D0RRAACXX:8:2113:4623:40045 1:X:0:GCCGTCGA GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA + ??@ADDBDH:CDHHI+AEFHI?GGHII:EFIII?F= @BI:SL-‐HAB:D0RRAACXX:8:2206:11680:21762 1:X:0:GCCGTCGA AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT + Sample Library Primary Data Processed Data
Replicability – Provenance – Ease of Use – Scalability We chose to deploy first to a web/cloud-‐based plauorm, DNAnexus
Code is open source and adaptable for deployment to your HPC environment h6ps://github.com/ENCODE-‐DCC
Develop Share Run Elas:c Provenance Cost HPC Cluster (Scripts) Hard Hard Hard Cluster-‐Dependent Moderate Obscure/Subsidized
HPC Container Hard Moderate Moderate Cluster-‐Dependent Good Obscure/Subsidized
Web/Cloud Moderate Easy Easy Highly Excellent Apparent but Low
Schema: ENCODE ChIP-‐seq IDR Pipeline
J. Seth Stra6an, PhD ENCODE DCC 21
fastq reads Map
Pool Replicates Subsample
Pseudoreplicates Call Peaks
IDR (TF) Overlap (Histone)
Signal Tracks
BAM BAM
2 Pseudoreplicates per replicate
2 Pseudoreplicates per pool
Peak Calls
IDR-‐thresholded/replicable Peak Calls
bigWig BAM, BAI Processed,
mapped reads
Target Key So^ware Input Files Output Files QA Metrics
TF's
bwa
fastq's (SE or PE) Two biological replicates
Matched controls
NRF (Non-‐redundant frac=on) PBC1 and 2 (PCR bo6leneck coefficients)
Number of dis=nct uniquely-‐mapping reads NSC/RSC (Strand cross-‐correla=on)
IDR Rescue Ra=o IDR Self-‐Consistency Ra=o IDR Reproducibility Test
Picard markDuplicates One bam per replicate samtools bigWig fold signal over control
MACS2 (Signal tracks) bigWig p-‐value signal over control SPP (PeakSeq, GEM future) bed/bigBed true replicates peaks
IDR2 bed/bigBed pooled replicates peaks bed/bigBed IDR thresholded peaks
Histone Mods
MACS2 for peaks Overlap thresholding
IDR2 (future) bed/bigBed Replicated peaks
h0ps://github.com/ENCODE-‐DCC/chip-‐seq-‐pipeline
Pipelines Demonstra=on and Exercise
J. Seth Stra6an, PhD ENCODE DCC 22
To set up an account: h6ps://www.encodeproject.org/tutorials/apbc-‐2016/
Log in -‐> Exercises
Histone ChIP-‐seq RNA-‐seq
Uniformly Processed Data On the ENCODE Portal
J. Seth Stra6an, PhD ENCODE DCC 23
Histone ChIP-‐seq Example h6ps://www.encodeproject.org/experiments/ENCSR087PLZ/ • Pipeline graph shows rela=onships between files • Click on files to see more file metadata and download links • Click on steps to see more sosware metadata and download links
Transcrip=on Factor ChIP-‐seq Example h6ps://www.encodeproject.org/experiments/ENCSR077DKV/ • Same mapping, signal tracks and peak calls • Also have the IDR-‐thresholded peak calls • “Conserva=ve” set, based on “true” replicates; “op=mal” set if peaks can be
rescued by pseudo-‐replica=on.
ENCODE ChIP-‐seq Quality Metrics: Resources
J. Seth Stra6an, PhD ENCODE DCC 24
fastq reads Map
Pool Replicates Subsample
Pseudoreplicates Call Peaks IDR
Signal Tracks
BAM BAM
2 Pseudoreplicates per replicate
2 Pseudoreplicates per pool
Peak Calls
IDR-‐thresholded Peak Calls
bigWig BAM, BAI Processed,
mapped reads h0ps://github.com/ENCODE-‐DCC/chip-‐seq-‐pipeline Es:mates Descrip:on References Depth Number of uniquely mapping reads Jung YL, et al. Nucleic Acids Research. 2014;42(9):e74
Number of dis=nct uniquly mapping reads
Library Complexity Non-‐Redundant Frac=on
Landt S, et al. Genome Res. 2012. 22: 1813-‐1831 PCR Bo6leneck Coefficient
ChIP Quality Normalized Strand Cross-‐Correla=on Rela=ve Strand Cross-‐Correla=on
Replicate Concordance IDR Rescue Ra=o
Li Q, et al. Annals Applied Sta=s=cs. 2011, Vol. 5, No. 3, 1752–1779 IDR Self-‐Consistency Ra=o IDR Reproducibility Test
Schema: ENCODE WGBS Pipeline
Ben Hitz, PhD ENCODE DCC 25
RNA-‐Seq Pipeline
Non bisulfite conversion rate
QC metrics Map to λ genome
FASTQ (SE/PE) Replicates
Extract methyl calls
Trim Reads BAM
BigWigs BigWigs BigBEDs (.bb)
Map (converted genome)
FASTQ (SE/PE) Replicates
Extract methyl calls
Trim Reads BAM (Bismark)
BigWigs BigWigs BigBEDs (.bb)
Map (converted genome)
BISMARK (v 0.10)
Bed/BigBed files for: • CG context • CHG context • CHH context
h0ps://github.com/ENCODE-‐DCC/dna-‐me-‐pipeline
Schema: ENCODE RNA-‐seq Pipeline
Ben Hitz, PhD ENCODE DCC 26
IDR/MAD
FASTQ (SE/PE) Replicates
Map Reads
Quan:fica:on
Signal Tracks BAM (tophat)
RSEM file
Map Reads BAM (STAR)
BigWigs BigWigs BigWigs BigWigs (.bw)
Signal Tracks BigWigs BigWigs BigWigs BigWigs (.bw)
QC & filtered quan:fica:on
FASTQ (SE/PE) Replicates
Map Reads
Quan:fica:on
Signal Tracks BAM (tophat)
RSEM file
Map Reads BAM (STAR)
BigWigs BigWigs BigWigs BigWigs
Signal Tracks BigWigs BigWigs BigWigs BigWigs
Replicate 2
For each Mapper (STAR, tophat) BAM files: • mapped to genome • mapped to transcriptome BigWig files: • plus/minus strand (paired) • uniquely mapped • mul=+uniquely mapped Quan=fica=ons (RSEM): • genome • transcriptome
h0ps://github.com/ENCODE-‐DCC/long-‐rna-‐seq-‐pipeline
Uniformly Processed Data On the ENCODE Portal
J. Seth Stra6an, PhD ENCODE DCC 27
RNA-‐seq Example h6ps://www.encodeproject.org/experiments/ENCSR368QPC/
• Pipeline graph shows rela=onships between files • Click on files to see more file metadata and download links • Click on steps to see more sosware metadata and download links
Results from the ChIP-‐seq exercise
Ben Hitz, PhD ENCODE DCC 28
Results from the ChIP-‐seq exercise
J. Seth Stra6an, PhD ENCODE DCC 29
Results from the ChIP-‐seq exercise
J. Seth Stra6an, PhD ENCODE DCC 30
Results from the ChIP-‐seq exercise
J. Seth Stra6an, PhD ENCODE DCC 31
Results from the ChIP-‐seq exercise
J. Seth Stra6an, PhD ENCODE DCC 32
“Download” to generate temporary URL’s to the selected files
Results from the ChIP-‐seq exercise
J. Seth Stra6an, PhD ENCODE DCC 33
“Download” to generate temporary URL’s to the selected files
Visualize on the UCSC Genome Browser
J. Seth Stra6an, PhD ENCODE DCC 34
Visualize on the UCSC Genome Browser
J. Seth Stra6an, PhD ENCODE DCC 35
Pipeline Workshop Summary
J. Seth Stra6an, PhD ENCODE DCC 36
DCC Goals: 1. Deploy ENCODE-‐defined pipelines for ChIP-‐seq, RNA-‐seq, DNase-‐seq, methyla=on. 2. Use those pipelines to generate the standard ENCODE peaks, quan=ta=ons, CpG. 3. Capture metadata to make clear what sosware, versions, parameters, inputs were used. 4. Capture, accession, and distribute the output. 5. Deliver exactly the same pipelines in a form that anyone can run on their data or with
ENCODE data – one experiment or 1000. Replicability – Provenance – Ease of Use – Scalability
Contributors
J. Seth Stra6an, PhD ENCODE DCC 37
ENCODE Data Coordina:ng Center Mike Cherry, PI, Stanford Jim Kent, co-‐PI, UCSC Eurie Hong, Project Manager Pipeline Developers Ben Hitz, WGBS, Sosware Lead Tim Dreszer, RNA-‐seq, DNAse-‐seq J. Seth Stra6an, ChIP-‐seq Portal Developers Laurence Rowe Nikhil Podduturi Forrest Tanaka Data Wranglers Esther Chan Jean Davidson Venkat Malladi Cricket Sloan J. Seth Stra6an QA & Biocura:on Assistance Brian Lee Marcus Ho Adi= Narayanan Support Staff Stuart Miyasato Ma6 Simison Zhenhua Wang
ENCODE Data Analysis Center Zhiping Weng, PI, University of Massachuse6s Mark Gerstein, co-‐PI, Yale Methyla:on Junko Tsuji, U Mass Eric Mendenhall, U Alabama, HAIB RNA-‐seq Alex Dobin, CSHL Carrie Davis, CSHL Rafael Irizarryt, Harvard Xintao Wei, UConn Brent Gravely, UConn Colin Dewey, U Wisconsin Roderic Guigó, CRG Sarah Djebali, CRG ChIP-‐seq Anshul Kundaje, Stanford Nathan Boley, Stanford Jin Lee, Stanford
DNAnexus Mike Lin Andey Kislyuk Singer Ma Bre6 Hannigan Ohad Rodeh Joe Dale George Asimenos
@encodedcc encode-‐[email protected] h6ps://github.com/ENCODE-‐DCC/