chip:%quick%overview% chroma=n%immuno'precipita=on% · 11/21/13% 1%...
TRANSCRIPT
11/21/13%
1%
2012%'%BMMB%597D:%Analyzing%Next%Genera=on%Sequencing%Data%
%%Week%13,%Lecture%26%
István'Albert'
'
Biochemistry%and%Molecular%Biology%%and%Bioinforma=cs%Consul=ng%Center%
%Penn%State%
Protein'0'DNA'interac=ons%ChIP0Chip%and%ChIP0Seq%studies%
(many%other%techniques%where%DNA%is%isolated%in%some%manner)%
• ChIP%!%Chroma=n%Immuno'Precipita=on%(refers%to%sample%prepara=on)%
%• Chip%!%microarray%technology%to%detect%bound%genomic%loca=ons%
• Seq%!%high%throughput%sequencing%to%detect%bound%genomic%loca=ons%
Chroma=n%Immuno'Precipita=on%
It%is%a%well%know%methodology%to%detect%protein'DNA%interac=ons:%
%• transcrip=on%factor%binding%• polymerase%binding%• chroma=n%structure%and%modifica=ons%
%
ChIP:%Quick%Overview%
P1% P2%
P1% P2%
P2%P1%
IP1%
Crosslink%bound%%proteins%
Fragment/digest%DNA%around%bound%loca=ons%
Isolate%with%protein%specific%an=body%
Reverse%cross%link%(separate)%
Double%%stranded%DNA%
Proteins%
P1%
This%is%the%DNA%fragment%that%gets%sequenced%
11/21/13%
2%
Sample%origins%
Understanding%the%sample%prepara=on%is%essen=al%for%analysis%
• WGS%whole%genome%sequencing%(shotgun)%!%random%DNA%fragments%covering%the%en=re%genome%
%• Chip0Seq%!%DNA%fragments%covering%“isolated”%loca=ons%in%the%genome%
%
The%ChIP%output%
• a%DNA%sample%enriched%for%fragments%associated%with%the%events%under%study%
• BUT%we%are%measuring%an'ensemble'of'cells%that%may%be%in%different%states!%%%%
• Coverage%depends%on%the%number%of%sites,%efficiency%of%the%IP%(precipita=on)%step.%%
• Fragment%accuracy%depends%on%fragmenta@on%strategy:%sonica=on,%MNASE%diges=on,%lambda'nuclease%diges=on%
Note:%Plus%lots%of%other'DNA'fragments%can%make%it%through!%
Chip0Seq%!High%throughput%sequencing%
• Fragments%are%sequenced%
• Aligned%against%genome%(Bow=e%is%a%good%choice%for%Chip'Seq)%%%
• The%output%is%in%the%form%of%the%intervals%(start/end)%where%each%read%matches%the%genome%
%
Sequencing%proceeds%from%the%5’%to%3’%This%is%where%we%get%reads%from.%%
Sequencing%process%
+%'%
the%original%%DNA%fragment%could%be%longer/shorter%than%the%length%of%the%read%
Original%two%stranded%DNA%fragment%
11/21/13%
3%
Other%techniques:%Chip'Exo%
Tiny%footprint%%Note%how%the%reads%may%be%longer%than%the%bound%loca=on%
Type%of%events%of%interest%
100s%of%bases%
~%20'40%bases%
~%10%bases%
And%of%course%various%fine%scale%seings:%transla=onal%
%WOR%'%%The%World%of%Read%coun=ng%
• Aligners%!%base%match%%
• Mappers%!%start/end%loca=on%
A%correct%mapping%could%s=ll%be%an%incorrect%alignment.%%
• Read%Counts%!%%es=mates%of%abundance,%binding,%occupancy,%gene%expressions%%
• Typically%we%prefer%shorter%and%more%numerous%reads%!%bejer%sta=s=cal%power%
%
Mapping%to%the%genome%
What%we%really%need%are%just:%%%
chrom,'start,'end,'strand'
%The%fragment%5’%end%loca=ons%of%each%fragment%
correspond%to%%
the%start%coordinate%for%the%+%strand%the%end%coordinate%for%the%–%strand%
%Chip'Seq%is%about%loca=ons.%The'5’'end'maEers!%%
%The%rest%of%the%read%is%needed%only%to%place%the%read%
11/21/13%
4%
It%is%about%posi=on%not%coverage!%
Most%misused%concept%–%posi=on%vs%coverage%
0%0%0%0%0%1%1%1%1%1%4%4%4%4%4%4%4%5%5%6%6%6%6%5%5%5%5%5%2%2%2%2%2%2%2%1%1%0%0%0%0%
Maximal%%coverage%
Average%start%
Other%considera=ons%
• For%single%end%sequencing%each'fragment'may%correspond%to%0,%1%or%2%reads.%If%it%has%two'reads%we%don’t%know%which%two%formed%the%fragment%
• For%paired%and%sequencing%each%fragment%corresponds%to%0,'1%or%2%reads.%(1%no%mate).%We%know%which%two'reads%correspond%to%one%another%!%fragment%size%es=ma=on.%
Peak%Calling%
• Process%of%finding%the%loca=ons%enriched%due%to%events%of%interest%
%We'will'need'to'define%
%• Peak'Region%'%con=guous%set%of%basepairs%that%belong%to%a%peak%%
• Enrichment'Level%'%read'based%measure%of%suppor=ng%evidence%
Prac=cal%exercise%(also%homework)%
• We%have%two%datasets%!%the%same%binding%factor%was%simulated%as%if%it%had%a%short%or%a%long%footprint%(long.fq,%short.fq)%
%• We%will%visualize%and%inves=gate%these:%detect%bound%loca=ons,%fragment%size,%peak%loca=ons%etc%with%one%or%more%tools%
11/21/13%
5%
The%alignment%script%
• Use%the%files%called%short.fq%and%long.fq%%
• Create%an%alignment%rela=ve%to%the%yeast%genome%
• Visualize%the%alignment%in%IGV%
Visualize%in%IGV%
strand%bias%
Deduplica=on%• Dis=nc=on%needs%to%be%made%between%natural'vs%ar@ficial%
(PCR)%duplicates%%
• There%is%no%obvious%consensus%–%the%more%accurate%the%method%the%more%likely%that%we%have%natural%duplicates%
• Look%for%obvious%flaws%(strand%biases),%paired%end%sequencing%helps%iden=fying%ar=ficial%duplicates%
Deduplicated%reads%
11/21/13%
6%
“Poor”%man’s%peak%predictor% “Poor%man’s”%peak%predic=on%
bioawk'tools%on%github% Homework%26%
• Using%the%data%and%scripts%found%in%the%file%lec26.tar.gz%on%the%webpage%produce%the%plot%seen%on%the%previous%slide.%
• How%many%peaks%can%you%find%for%a%coverage%threshold%of%10?%
• How%sensi=ve%is%the%peak%predic=on%(number%of%peaks%found)%to%the%threshold?%%%
• Modify%the%awk%script%to%generate%1%base%long%intervals%that%indicate%the%midpoint%of%the%peaks.%