rbp database: the encode eclip resource for rna binding protein … · each step of rna processing...

35
RBP database: the ENCODE eCLIP resource for RNA binding protein targets Eric Van Nostrand [email protected] Yeo Lab, UCSD 06/08/2016

Upload: buidieu

Post on 09-May-2019

232 views

Category:

Documents


1 download

TRANSCRIPT

RBP database: the ENCODE eCLIP resource for RNA binding protein targets

[email protected]

YeoLab,UCSD06/08/2016

ImageadaptedfromGenomeResearchLimited

Each step of RNA processing is highly regulated

StephanieHuelga

RNAbindingproteins(RBPs)actastransfactorstoregulateRNAprocessingsteps

EsOmated>1000RBPsinhuman

RNAprocessingplayscriOcalrolesindevelopmentandhumanphysiology

MutaOonoralteraOonofRNAbindingproteinsplayscriOcalrolesindisease

250 RNA Binding Proteins

CLIP-Seq(ChIP-Seq) Bind-N-Seq

RNAi &RNA-Seq

Yeo

Fu Graveley

Burge

K562 & HepG2 cells

ENCORE

ENCORE: ENCODE RNA regulaAon group

Lcuyer

RBP Localization

RBP Data ProducAon Overview (Released data only as of 6/8/16)

1,303Completed/ReleasedExperiments

6920456

27489

2024048

eCLIP-SeqRNAi/RNA-SeqChIP-SeqImagingeCLIP-SeqRNAi/RNA-SeqChIP-SeqRNABind-N-Seq

HepG

2K5

62

344RNABindingProteins

Outline

eCLIPoverview Methodoutline ENCODEsubmi_eddatastructure ENCODEeCLIPpipelinewalkthrough

Whatkindsofanalysescanbedone?

Toolscomingsoon

IdenOficaOonofRNAbindingproteintargetsbyeCLIP-seq

High-throughputsequencing

Dataprocessing&peakcalling

eCLIP computaAonal pipeline

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

CustomscriptnowbasedoffbothPEreads+randommer

Peaks

R2onlymapped,rmDupbamfile

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

Input-normalized

Peaks

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

CustomscriptnowbasedoffbothPEreads+randommer

Peaks

R2onlymapped,rmDupbamfile

Input-normalized

Peaks

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

FilesavailableonDCC

eCLIP computaAonal pipeline

Biosample1

eCLIPReplicate1

Size-matchedinput

Biosample2

eCLIPReplicate2

R1+R2fastqfiles

Paired-endmapping(STAR)

Input-normalizedpeaks

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

CustomscriptnowbasedoffbothPEreads+randommer

Peaks

R2onlymapped,rmDupbamfile

Input-normalized

Peaks

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

eCLIP computaAonal pipeline

AnalysisSOPavailableat:https://www.encodeproject.org/documents/dde0b669-0909-4f8b-946d-3cb9f35a6c52/@@download/attachment/eCLIP_analysisSOP_v1.P.pdf

Linked at boLom of each eCLIP experiment:

DemulAplexing (already has been done for files on ENCODE DCC)

File details: fastq files

DATASET.R1.fastq.gz: @CCAAC:SN1001:449:HGTN3ADXX:1:1101:1373:1964 1:N:0:1 CAAATGCCCCTGAGGACAAAGCTGCTGCCGGGCCTCTCTCTCTG + FFFFFFIIFIIIFIIFIFIFIIIIIIIIIIIIIIIIIIIIIIFI @CAGAT:SN1001:449:HGTN3ADXX:1:1101:1669:1914 1:N:0:1 TTAGAGACAGGGTCTCGCTCCGTTGCTCAGGCTGGAGTGCAGTG + FFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ...

DATASET.R2.fastq.gz: @CCAAC:SN1001:449:HGTN3ADXX:1:1101:1373:1964 2:N:0:1 GAGAGAGGAGTGGGAAGTTGGGATAGTACCCAGAGAGAGAGGCCCG + FFFFFBFFBFBFFFFFIFFFIFFIFIIIIIIFIIIIFFIFIIFFIF @CAGAT:SN1001:449:HGTN3ADXX:1:1101:1669:1914 2:N:0:1 TTGTACCACTGCACTCCAGCCTGAGCAACGGAGCGAGACCCTGTCT + FFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIII ...

@CCAAC=random-mer(first5or10ntofsequencedread2)hasbeenremovedfromthe5endofread2andappendedtoreadname

Anyin-linebarcodehasbeenremoved(aspartofdemulOplexing)

Adaptor trimming:

Adaptor trimming:

KeyconsideraOonweveobservedthatadaptor-concatamerfragments(evenatextremelylowfrequency)yieldhigh-scoringeCLIPpeaks

Difficulttotrimallwithonepass Cutadapt(bydefault)willmissadaptorswith5truncaOons

Toavoidthis,weerronthesideofover-trimming

RepeAAve element removal MajorityofRNAinmostcellsarerRNA/tRNA/repeats ThesecanmapandcausestrangearOfacts(parOcularlyrRNA,asa40ntrRNAreadwith1or2sequencingerrorscanmapuniquelytooneofthevariousrRNApseudogenesinthegenome)

ToavoidfalseposiOves,weFIRSTmapallreadsagainstaRepBasedatabase,andonlytakereadsthatremainunmappedforfurtherprocessing

Mapping to human genome

Weperformpaired-endmappingwithSTARtothehumangenomeplussplicejuncOondatabase,keepingonlyuniquelymappedreads

PCR duplicate removal Next,wecomparereadsthatmaptothesamelocaOon(basedonthemappedstartofR1andstartofR2)basedontheirrandom-mersequence

IftworeadsmaptothesameposiOonandhavethesamerandom-mer,oneisdiscarded

Input:bamfilecontainingonlyuniquelymappedreads Output:bamfilecontainingonlyUsable(uniquelymapped,non-PCRduplicate)reads

eCLIP significantly decreases PCR duplicaAon rate

File details: bam files

CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989 147 chr1 14771 255 43M = 14681 -133 CACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGT B

Peak callingStep1)IniOalclusteridenOficaOonwithCLIPper(spline-fisngwithtranscript-levelbackgroundnormalizaOon)

Step2)Compareclustersagainstsize-matchedinput

Step3)Compressclusters(asCLIPperistranscript-level,itcanoccasionallycalloverlappingpeaksthisstepiteraOvelyremovesoverlappingpeaksbykeepingtheonewithgreaterenrichmentaboveinput)

Why input normalize?

WeseemRNAbackgroundatnearlyallabundantgenes

buttruesignalishighlyenrichedabovethisbackground

Input normalizaAon removes false-posiAves and idenAfies confident binding sites

File details: bed narrowPeak (input-normalized peaks)

track type=narrowPeak visibility=3 db=hg19 name="RBFOX2_HepG2_rep01" description="RBFOX2_HepG2_rep01 input-normalized peaks"

Chr7 4757099 4757219 RBFOX2_HepG2_rep01 1000 + 6.539331235 400 -1 -1

Chr7 99949578 99949652 RBFOX2_HepG2_rep01 1000 + 5.233511963 400 -1 -1

Chr7 1027402 1027481 RBFOX2_HepG2_rep01 1000 + 5.243129966 69.5293984 -1 -1

chr \t start \t stop \t dataset_label \t 1000 \t strand \t log2(eCLIP fold-enrichment over size-matched input) \t -log10(eCLIP vs size-matched input p-value) \t -1 \t -1

Note:p-valueiscalculatedbyFishersExacttest(minimump-value2.2x10-16),withchi-squaretest(log10(p-value)setto400ifp-valuereported==0)

Ourtypicalstringentcutoffs:require-log10(p-value)5andlog2(fold-enrichment)3

What can we do with the eCLIP database?

Individual RBP analyses

RBFOX2Nucleoli

eCLIPanalysis RBPlocalizaOon

IntegraOonwithknockdownRNA-seq

An RNA-centric view of RBP-binding

in silico screen of a desired RNA against all CLIP datasets to idenAfy the best-binding RBPs

Integrated global views of RBP binding

Tools available soon (next few months):

eCLIPprocessingpipelineonDNANexus(shouldbeready~July)

FollowedquicklybyIDR&q/cmetricsforvalidaOngyourowneCLIPdatasets

RNA-centricbrowser(websiteatalphastagenow)

AllowuserstoqueryRNAsorgenomicregionsofinterestagainstourENCODEeCLIPdatabase

IntegraOonwithENCODEencyclopedia

Factorbook-likesummariesforeachRBP

Acknowledgements

Funding:

GeneYeoBrentGraveleyChrisBurgeEricLcuyerXiang-DongFu

ComputaOonal:GabrielPra_EricVanNostrandShashankSatheBrianYee

Experimental:EricVanNostrandStevenBlueThaiNguyenChelseaGelboin-BurkhartRuthWangInesRabanoAlumni:BalajiSundararamanKeriElkinsRebeccaStanton