chip:%quick%overview% chroma=n%immuno'precipita=on% · 11/21/13% 1%...

11/21/13%

1%

2012%'%BMMB%597D:%Analyzing%Next%Genera=on%Sequencing%Data%

%%Week%13,%Lecture%26%

István'Albert'

'

Biochemistry%and%Molecular%Biology%%and%Bioinforma=cs%Consul=ng%Center%

%Penn%State%

Protein'0'DNA'interac=ons%ChIP0Chip%and%ChIP0Seq%studies%

(many%other%techniques%where%DNA%is%isolated%in%some%manner)%

•  ChIP%!%Chroma=n%Immuno'Precipita=on%(refers%to%sample%prepara=on)%

%•  Chip%!%microarray%technology%to%detect%bound%genomic%loca=ons%

•  Seq%!%high%throughput%sequencing%to%detect%bound%genomic%loca=ons%

Chroma=n%Immuno'Precipita=on%

It%is%a%well%know%methodology%to%detect%protein'DNA%interac=ons:%

%•  transcrip=on%factor%binding%•  polymerase%binding%•  chroma=n%structure%and%modifica=ons%

%

ChIP:%Quick%Overview%

P1% P2%

P1% P2%

P2%P1%

IP1%

Crosslink%bound%%proteins%

Fragment/digest%DNA%around%bound%loca=ons%

Isolate%with%protein%specific%an=body%

Reverse%cross%link%(separate)%

Double%%stranded%DNA%

Proteins%

P1%

This%is%the%DNA%fragment%that%gets%sequenced%

11/21/13%

2%

Sample%origins%

Understanding%the%sample%prepara=on%is%essen=al%for%analysis%

•  WGS%whole%genome%sequencing%(shotgun)%!%random%DNA%fragments%covering%the%en=re%genome%

%•  Chip0Seq%!%DNA%fragments%covering%“isolated”%loca=ons%in%the%genome%

%

The%ChIP%output%

•  a%DNA%sample%enriched%for%fragments%associated%with%the%events%under%study%

•  BUT%we%are%measuring%an'ensemble'of'cells%that%may%be%in%different%states!%%%%

•  Coverage%depends%on%the%number%of%sites,%efficiency%of%the%IP%(precipita=on)%step.%%

•  Fragment%accuracy%depends%on%fragmenta@on%strategy:%sonica=on,%MNASE%diges=on,%lambda'nuclease%diges=on%

Note:%Plus%lots%of%other'DNA'fragments%can%make%it%through!%

Chip0Seq%!High%throughput%sequencing%

•  Fragments%are%sequenced%

•  Aligned%against%genome%(Bow=e%is%a%good%choice%for%Chip'Seq)%%%

•  The%output%is%in%the%form%of%the%intervals%(start/end)%where%each%read%matches%the%genome%

%

Sequencing%proceeds%from%the%5’%to%3’%This%is%where%we%get%reads%from.%%

Sequencing%process%

+%'%

the%original%%DNA%fragment%could%be%longer/shorter%than%the%length%of%the%read%

Original%two%stranded%DNA%fragment%

11/21/13%

3%

Other%techniques:%Chip'Exo%

Tiny%footprint%%Note%how%the%reads%may%be%longer%than%the%bound%loca=on%

Type%of%events%of%interest%

100s%of%bases%

~%20'40%bases%

~%10%bases%

And%of%course%various%fine%scale%seings:%transla=onal%

%WOR%'%%The%World%of%Read%coun=ng%

•  Aligners%!%base%match%%

•  Mappers%!%start/end%loca=on%

A%correct%mapping%could%s=ll%be%an%incorrect%alignment.%%

•  Read%Counts%!%%es=mates%of%abundance,%binding,%occupancy,%gene%expressions%%

•  Typically%we%prefer%shorter%and%more%numerous%reads%!%bejer%sta=s=cal%power%

%

Mapping%to%the%genome%

What%we%really%need%are%just:%%%

chrom,'start,'end,'strand'

%The%fragment%5’%end%loca=ons%of%each%fragment%

correspond%to%%

the%start%coordinate%for%the%+%strand%the%end%coordinate%for%the%–%strand%

%Chip'Seq%is%about%loca=ons.%The'5’'end'maEers!%%

%The%rest%of%the%read%is%needed%only%to%place%the%read%

11/21/13%

4%

It%is%about%posi=on%not%coverage!%

Most%misused%concept%–%posi=on%vs%coverage%

0%0%0%0%0%1%1%1%1%1%4%4%4%4%4%4%4%5%5%6%6%6%6%5%5%5%5%5%2%2%2%2%2%2%2%1%1%0%0%0%0%

Maximal%%coverage%

Average%start%

Other%considera=ons%

•  For%single%end%sequencing%each'fragment'may%correspond%to%0,%1%or%2%reads.%If%it%has%two'reads%we%don’t%know%which%two%formed%the%fragment%

•  For%paired%and%sequencing%each%fragment%corresponds%to%0,'1%or%2%reads.%(1%no%mate).%We%know%which%two'reads%correspond%to%one%another%!%fragment%size%es=ma=on.%

Peak%Calling%

•  Process%of%finding%the%loca=ons%enriched%due%to%events%of%interest%

%We'will'need'to'define%

%•  Peak'Region%'%con=guous%set%of%basepairs%that%belong%to%a%peak%%

•  Enrichment'Level%'%read'based%measure%of%suppor=ng%evidence%

Prac=cal%exercise%(also%homework)%

•  We%have%two%datasets%!%the%same%binding%factor%was%simulated%as%if%it%had%a%short%or%a%long%footprint%(long.fq,%short.fq)%

%•  We%will%visualize%and%inves=gate%these:%detect%bound%loca=ons,%fragment%size,%peak%loca=ons%etc%with%one%or%more%tools%

11/21/13%

5%

The%alignment%script%

•  Use%the%files%called%short.fq%and%long.fq%%

•  Create%an%alignment%rela=ve%to%the%yeast%genome%

•  Visualize%the%alignment%in%IGV%

Visualize%in%IGV%

strand%bias%

Deduplica=on%•  Dis=nc=on%needs%to%be%made%between%natural'vs%ar@ficial%

(PCR)%duplicates%%

•  There%is%no%obvious%consensus%–%the%more%accurate%the%method%the%more%likely%that%we%have%natural%duplicates%

•  Look%for%obvious%flaws%(strand%biases),%paired%end%sequencing%helps%iden=fying%ar=ficial%duplicates%

Deduplicated%reads%

11/21/13%

6%

“Poor”%man’s%peak%predictor% “Poor%man’s”%peak%predic=on%

bioawk'tools%on%github% Homework%26%

•  Using%the%data%and%scripts%found%in%the%file%lec26.tar.gz%on%the%webpage%produce%the%plot%seen%on%the%previous%slide.%

•  How%many%peaks%can%you%find%for%a%coverage%threshold%of%10?%

•  How%sensi=ve%is%the%peak%predic=on%(number%of%peaks%found)%to%the%threshold?%%%

•  Modify%the%awk%script%to%generate%1%base%long%intervals%that%indicate%the%midpoint%of%the%peaks.%

chip:%quick%overview% chroma=n%immuno'precipita=on% · 11/21/13% 1%...

Documents