chip-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... ·...
TRANSCRIPT
![Page 1: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/1.jpg)
ChIP-seq data: filtering and
mapping readsD. Puthier, C. Rioualen, J. van Helden
Galaxy Workshop — Cuernavaca, 2017
1
![Page 2: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/2.jpg)
● Estrogen-receptor (ESR1) is a key factor in breast cancer development.
● Goal of the study: understand the dependency of ESR1 binding on presence of cofactors, in particular GATA3, which is mutated in breast cancers.
● Approaches: GATA3 silencing (siRNA), ChIP-seq on ESR1 in WT vs. siGATA3 conditions, chromatin profiling.
Dataset used
Theodorou,V., Stark,R., Menon,S. and Carroll,J.S. (2013) GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility. Genome Res, 23, 12–22.
2
![Page 3: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/3.jpg)
● Sequence Read Archive (SRA): https://www.ncbi.nlm.nih.gov/sra
○ Provides access to unaligned reads in sra format
○ SRA read files need to be converted to fastq (see later).○ Linked to Gene Expression Omnibus (GEO)
■ https://www.ncbi.nlm.nih.gov/geo/
● European Nucleotide Archive (ENA): http://www.ebi.ac.uk/ena
○ The European database of short read sequences.
○ Provides direct access to raw reads in fastq format.
○ Linked to ArrayExpress
■ https://www.ebi.ac.uk/arrayexpress/
Read archives
3
![Page 4: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/4.jpg)
Protocol
● Go to GEO web site (https://www.ncbi.nlm.nih.gov/geo/).
● Choose "Search" and paste GSE40129 (GSE stands for GEO Series Experiment). Click "GO" to get information about this experiment.
● In the "sample section" (middle of the page), click on "More" to visualize all sample names. Click on GSM986059 hyperlink (GSM stands for GEO SaMple) to get information about this sample.
● In the "relations" section, select SRX176856 hyperlink to open the SRA page corresponding to this sample.
● Click on the SRR link (bottom left) to access the record of the run.
NB: You can also get sequence data from the website of the European Nucleotide Archive (ENA): https://www.ebi.ac.uk/ena
Getting information about the study
4
![Page 5: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/5.jpg)
Exercises
● Q1: What is the HTS platform used to sequence this sample?
● Q2: Is this experiment single-end or paired-end sequencing?
● Q3: How many runs (i.e. lanes) are associated to this sample?
● Q4: How many reads were produced (# of Spots)?
● Q5: Select the hyperlink to the run SRR540188. What is the sequence of the first read?
Getting information about the study
5
![Page 6: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/6.jpg)
Exercises
● Q1: What is the HTS platform used to sequence this sample?
● Q2: Is this experiment single-end or paired-end sequencing?
● Q3: How many runs (i.e. lanes) are associated to this sample?
● Q4: How many reads were produced (# of Spots)?
● Q5: Select the hyperlink to the run SRR540188. What is the sequence of the first read?
Getting information about the study
6
![Page 7: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/7.jpg)
Raw data
7
![Page 8: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/8.jpg)
The raw data are provided in fastq format
■ Header
■ Sequence
■ + (optional header)
■ Quality (Sanger quality score or other format)
@QSEQ32.249996 HWUSI-EAS1691:3:1:17036:13000#0/1 PF=0 length=36GGGGGTCATCATCATTTGATCTGGGAAAGGCTACTG+=.+5:<<<<>AA?0A>;A*A################@QSEQ32.249997 HWUSI-EAS1691:3:1:17257:12994#0/1 PF=1 length=36TGTACAACAACAACCTGAATGGCATACTGGTTGCTG+DDDD<BDBDB??BB*DD:D#################
8
![Page 9: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/9.jpg)
The Sanger quality score
● Sanger quality score (Phred quality score): Measure the quality of each base call○ Based on p, the probability of ○ error (the probability that the corresponding base call is incorrect)○ Qsanger= -10*log10(p)○ p = 0.01 <=> Qsanger 20
● Quality score are in ASCII 33 ● Note that SRA has adopted Sanger quality score although original fastq files
may use different quality score (see: http://en.wikipedia.org/wiki/FASTQ_format)
9
![Page 10: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/10.jpg)
ASCII 33 encoding
● Storing PHRED scores as single characters gave a simple and space efficient encoding:○ Range 0-40○ ! is 0○ “ is 1○ # is 2○ $ is 3○ …○ I is 40
10
![Page 11: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/11.jpg)
Exercises
● In the short read below, the 5 first residues are Gs, but they are associated to different quality score. Compute the p-value associated to each of them:
Exercicse: Qsanger quality score conversion
@QSEQ32.249996 HWUSI-EAS1691:3:1:17036:13000#0/1 PF=0 length=36
GGGGGTCATCATCATTTGATCTGGGAAAGGCTACTG+=.+5:<<<<>AA?0A>;A*A################
Qsanger ASCII -log10(p) p
. 46 13 .05
+ 43 10 0.1.
5 53 20 0.01
: 58 25 3.2e-3
= 61 28 1.6e-3
A 65 32 6.3e-4
# 35 2 0.63 11
![Page 12: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/12.jpg)
Protocol
● Open a connection to the TIB2017 server (http://132.248.220.36/). Two solutions:
a. Enter the login and password you have received prior to the beginning of the school
b. Register in the “User” Menu
Connecting to the galaxy server
12
![Page 13: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/13.jpg)
About Galaxy…
13
![Page 14: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/14.jpg)
Protocol
1. In the upper right corner, click on Unnamed history and rename this workspace to ChIP-Seq_mapping.
2. Use Shared Data > Data libraries > Theodorou > FASTQ
3. Select siNT_ER_E2_r1_SRX176856_chr1.fq. Click on to history.
4. Set Select history to ChIP-Seq_mapping. Click import.
5. Go to your history use the pencil that is associated to the dataset to rename them to “ESR1_chr1.fq”.
Getting a dataset from the shared library
14
![Page 15: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/15.jpg)
Protocol
1. Select FastQC from the toolbox.
2. Select ESR1_chr1.fq as input dataset. Press execute.
3. Display the data for the corresponding result in your history (right panel).
Q: Carefully inspect all the statistics. What do you think of the overall quality of the sequencing ?
NB: Here is a comprehensive documentation on how to interpret FastQC results: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/
Quality Control with FastQC
15
![Page 16: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/16.jpg)
Quality control for high-throughput sequence data
● First step of analysis ○ Quality control○ Ensure proper quality of selected reads
16
![Page 17: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/17.jpg)
Quality control with FastQC program
Quality
Position in read
Nb Reads
Mean Phred Score
Position in read
Look also at over-represented sequences
17
![Page 18: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/18.jpg)
● A pre-processing step○ Input read ends with poor quality values are trimmed (most generally the
right end)○ May be a crucial step when working with aligners that perform global
alignments■ Lots of reads may be unmapped
● Several software for read trimming○ Sickle (sliding window-based trimming)○ FASTX-Toolkit (cut a defined number of nucleotides)○ Trimmomatic○ Cutadapt (delete ends using bwa algorithm)...
Read Trimming
18
![Page 19: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/19.jpg)
Protocol1. Search for the Sickle tool using the galaxy search engine (upper left corner). 2. Set Single-End or Paired-End reads to Single-end. 3. Select the file ESR1_chr1.fq.4. Set Quality Threshold to 20 and Length Threshold to 25.5. Execute.6. Rename Single-End output of Sickle to ESR1_chr1_trim.fq.7. Perform a new FastQC analysis using the trimmed reads as input.
Q: How many reads to you retrieve after trimming? How does it compare with the input fastq files?
19Trimming
![Page 20: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/20.jpg)
Protocol
1. Select FastQC from the toolbox.
2. Select ESR1_chr1_sickle.fq as input dataset. Press execute.
3. Rename the new fastQC result ESR1_chr1_sickle_fastQC.
4. Display the corresponding result by clicking on the eye of the fastQC Webpage.
Q: Carefully inspect all the statistics. What do you think of the overall quality of the sequencing ? Compare the number of reads before and after trimming.
Quality Control on trimmed data
20
![Page 21: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/21.jpg)
Mapping
21
![Page 22: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/22.jpg)
Mapping• Find out the position of the reads within the genome
Ref. Genome
Reads
• One position in the genome• Many possible positions
(repeat regions, duplicate regions, pseudogenes…)
2
22
![Page 23: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/23.jpg)
Seed and extend
● A seed is mapped to several positions
○ Check whether flanking bases are compatible with the read sequence
● An index has to be produced before the mapping to store the coordinates of seeds (k-mers).
✅❌ ❌
23
![Page 24: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/24.jpg)
Protocol
1. From the tool panel, select Bowtie2.
2. Set Is this single or paired library to Single-end.
3. Set FASTQ file to ESR1_chr1_sickle.fq.
4. Set Will you select a reference genome from your history or use a built-in index to Use a built-in genome index.
a. Set Select the reference genome to hg19 (chr1 only).
5. Set Do you want to use presets to default.
6. Set Save the bowtie2 mapping statistics to the history to Yes.
7. Press Execute. Rename the output to ESR1_chr1.bam
Q: What is format of the resulting dataset ? What should it contain ?
Mapping ChIP-seq reads with Bowtie
24
![Page 25: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/25.jpg)
Aligner output: SAM/BAM files
● SAM: ‘Sequence Alignment/MAP’
● BAM: binary/compressed version of SAM
● Store information related to alignments
○ Read alignment coordinates○ Mapping quality○ CIGAR String○ Bitwise FLAG
■ read paired, read mapped in proper pair, read unmapped, ...○ ...
25
![Page 26: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/26.jpg)
Bitwise flag
● Numerous informations are enclosed in the 3rd column of the bam file:○ read pairs○ reads mapped in proper pairs○ reads unmapped○ mates unmapped○ reads reverse strand○ mates reverse strand○ first in pair○ second in pair○ not primary alignment○ ...
26
These binary informationare enclosed in a single column
![Page 27: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/27.jpg)
● 00000000001 → 2^0 = 1 (read paired)
● 00000000010 → 2^1 = 2 (read mapped in proper pair)
● 00000000100 → 2^2 = 4 (read unmapped)
● 00000001000 → 2^3 = 8 (mate unmapped) …
● 00000010000 → 2^4 = 16 (read reverse strand)
● 00000001001 → 2^0+ 2^3 = 9 → (read paired, mate unmapped)
● 00000001101 → 2^0+2^2+2^3 =13 ...
● ...
Bitwise flag
https://broadinstitute.github.io/picard/explain-flags.html27
![Page 28: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/28.jpg)
● Examples of flags:
○ M alignment “match” (can be a sequence match or mismatch!)
○ I insertion to the reference
○ D deletion from the reference
● http://samtools.sourceforge.net/SAM1.pdf
The extended CIGAR string
ATTCAGATGCAGTAATTCA--TGCAGTA 5M2D7M
28
![Page 29: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/29.jpg)
Filtering
29
![Page 30: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/30.jpg)
Repeats and low complexity regions
● Some regions may contain repeats● Some regions may be of poor complexity
○ E.g AT rich, GC rich● Reads falling into these regions may be ambiguous● The mappability depends on read size.
○ As read are longer they become less ambiguous○ Mappability can be used as the measure of uniqueness
30
![Page 31: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/31.jpg)
Mappability● Mappability (a): how many times a read of a given length can align at a
given position in the genome.○ a=1 (read align once)○ a=1/n (read align n times)
31
![Page 32: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/32.jpg)
Unireads? Multireads?
● First aligners defined the notions of unireads and multireads.● A uniread is thought to map to a single position on the genome.● A multiread is thought to map to several position on the genome.
○ Which position/gene produced the signal ?
I’m a uniread
Genome
I’m a multiread
G1 G2 G3 G4
32
![Page 33: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/33.jpg)
How to deal with ‘multireads’
? ? ?
Keep 1 position randomly
Keep all possible position
Keep none 33
![Page 34: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/34.jpg)
● Uniqueness has no meaning as we don’t known the true sequence.
○ Indeed each nucleotide has an associated quality/probability of error.
● The notion has been superseded by the mapping quality score.
○ Mapping quality score is computed from the probability that alignment is wrong:
■ takes mappability and sequence quality into account
■ -log10(Prob(alignment is wrong))
● p=0.01 -> MAPQ: 20
● p=0.001 -> MAPQ: 30
● p=0.0001 -> MAPQ: 40
● ...
Filtering for Mapping Quality (MAPQ)
34
![Page 35: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/35.jpg)
Protocol
1. Select “Filter BAM datasets on a variety of attributes” from the toolbox.
2. Apply filter to ESR1_chr1.bam3. Set Select BAM property to filter on to mapQuality (selected by
default).4. Set Filter on read mapping quality (phred scale) to “>=30”.5. Click Execute.6. Rename the two result files as follows
a. ESR1_chr1_filtering_parameters_txtb. ESR1_chr1_filtered.bam
35
![Page 36: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/36.jpg)
Protocol
Q1: Check that the overall mean of MAPQ is improved on ESR1 chip after filtering.
● Use sam-stats from the toolbox to compute statistics on ESR1_chr1.bam and ESR1_chr1_filtered.bam.
● Use default parameters.● Mean MAPQ should change from ~33 to ~40.
NB: You could also check that the number of alignments decreased after filtering using flagstat.
36
![Page 37: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/37.jpg)
● PCR duplicates
○ Related to poor library complexity
○ The same set of fragments are amplified
■ Indicates that immuno-precipitation failed
○ Tools to check for
■ FastQC report (duplicate diagram)
■ PCR bottleneck metric (ENCODE)
Filtering for PCR duplicate
37
![Page 38: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/38.jpg)
QC : PBC (PCR Bottleneck Coefficient)● An approximate measure of library complexity
● PBC = N1/Nd ○ N1= Genomic position with 1 read aligned○ Nd = Genomic position with ≧ 1 read aligned
● Value : ○ 0-0.5: severe bottlenecking ○ 0.5-0.8: moderate bottlenecking○ 0.8-0.9: mild bottlenecking○ 0.9-1.0: no bottlenecking
https://genome.ucsc.edu/ENCODE/qualityMetrics.html
✅
❌
38
![Page 39: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/39.jpg)
Protocol
Remove duplicates from the filtered aligned reads.
1. Select the tool MarkDuplicates from the toolbox.
2. SAM/BAM dataset or dataset collection :ESR1_chr1_filtered.bam
3. Set If true do not write duplicates to YES.
4. Set the The scoring strategy for choosing the non-duplicate to SUM_OF_BASE_QUALITIES (default)
5. Rename the two output files
a. ESR1_chr1_filtered_nodup.bam
b. ESR1_chr1_filtered_nodup_metrics
6. Run sam-stats on the duplicate-filtered bam file.
Q: What is the percentage of duplicates ? 39
![Page 40: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/40.jpg)
Visualization
40
![Page 41: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/41.jpg)
Integrative genome viewer (IGV)
41
![Page 42: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/42.jpg)
Protocol1. Download the filtered dataset
a. Take care to download both the bam file and its associated index.2. Start IGV
a. Select hg19 as a genome (Menu > Genomes > Upload from server)b. Load the bam file using File > Load from file
c. Go to chr1
d. Check out gene “RNF223” for instance3. The bam file contains exhaustive information. Only a fraction of a bam dataset data is loaded
into memory at once. We will compute a lightweight file (a tdf) that will contain only coverage information.a. In IGV Select menu Tools > Run IGV tools > count. Browse to the bam file (! not the
*.bai)). Press Run.
b. Close the igvtools window.c. Load the tdf file.
42
![Page 43: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/43.jpg)
Make sure you select the correct genome version!
Rainbow colors on coverage tracks correspond to mismatches !!! 43
● ACTB (chr5) mm9 vs mm10 in IGV (integrated Genome Viewer)
![Page 44: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/44.jpg)
Customize the visualization parameters...
44
![Page 45: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/45.jpg)
Bam files are fat
● BAM files are fat as they do contain exhaustive information about read alignments.○ Memory issues (can only visualize fraction of the BAM).
● Need a more lightweight file format containing only genomic coverage information: ○ ❌ Wig (not compressed, not indexed) ○ ✅ TDF (compressed, indexed) ○ ✅ BigWig (compressed, indexed)
45
![Page 46: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/46.jpg)
● BAM files do not contain fragment location but read location ● We need to extend reads to compute fragments coordinates before
coverage analysis● Not required for PE
Coverage file and read extension
wi wi+1 wi+2 wi+3 wi+4
156 20 14 5
Window
Coverage 46
![Page 47: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/47.jpg)
● Signal needs to be normalized○ A simplistic approach: normalize coverage to 1x○ Beware: popular but not optimal
Library size normalization
ChIP 1 (10 reads)
ChIP 2(20 reads)
ChIP 3(20 reads)
✅ Already normalized to 1x coverage
✅ Should be decreased by 2 fold to get 1x coverage
❌ Decreasing by 2 fold would underestimate peak signal. Problem...
Peak 47
![Page 48: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/48.jpg)
Protocol
1. Find the bamCoverage tool
2. Set BAM files:
a. ESR1_chr1_filtered_nodup.bam
b. input_chr1_filtered_nodup.bam
3. Set Bin size in bp to 25
4. Set Scaling/Normalization method to Normalize to 1x
5. Set Effective genome size to user specified and enter 199400000 in Effective genome size (this is because we restricted the analysis to reads belonging to chromosome 1).
6. Region of the genome to limit the operation to: chr1
7. Execute, and rename the output to ESR1_chr1_filtered_nodup.bw and input_chr1_filtered_nodup.bw
8. Download the resulting files (bigwig and BAM) and open them in the IGV browser.
9. In IGV, right click on the left panel : select set data range, and set Max Value to 100.
48
![Page 49: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/49.jpg)
Extracting a workflow
49
![Page 50: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/50.jpg)
Our workflow
Trimmed reads (fastq)
Mapped reads (bam)
Filtered reads (bam)
AnnotationClustering
Motif discovery
Visualization (bigwig) Peak calling (bed)
Quality Control
Raw Data (fastq)fastqc (html)
fastqc (html)
Coverage file (bigWig)
50
![Page 51: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/51.jpg)
Protocol
Extracting a workflow 51
1. In the history menu, select history options.
2. Click on Extract workflow.3. Set the name of the new workflow
to ChIP-Seq_mapping.4. Using the menu go to workflow >
ChIP-Seq_mapping > edit.5. Move the boxes in order to optimize
the readability of the workflow. 6. Rename the input elements
according to their connections. TODO
![Page 52: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/52.jpg)
Protocol
1. Create a new history: History > Create new.2. Rename this workspace : INPUT.3. Select Shared Data > Data Libraries > Theodorou > FASTQ >
MCF_input_r3_SRX176888_chr1.fq4. Use to history to import the dataset into the INPUT history. 5. Click on Galaxy (top left) and go to INPUT history. 6. Rename the dataset to input_chr1.7. Select workflow > ChIP-Seq_mapping > run. Set the proper input files.8. Click Run workflow at the bottom of the page.9. Rename the datasets.
10. Load the results into IGV.a. Create a .tdf file for the input bam file.b. Compare it with the previous .tdf, corresponding to the chipped sample.c. Beware of the data scale!d. For readability you can rename the tracks by right-clicking on them
52
![Page 53: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/53.jpg)
Comparison between the input and the chip samples
53
![Page 54: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/54.jpg)
Why we use an input...
54
![Page 55: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/55.jpg)
Merci
55
![Page 56: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/56.jpg)
Bowtie: a fast and very popular aligner
● Burrows-Wheeler Transform-based algorithm. Two phases: “seed and extend”.
● The Burrows-Wheeler Transform of a text T, BWT(T), can be constructed as follows:○ The character $ is appended to T, where $ is a character not in T that is
lexicographically less than all characters in T.
○ The Burrows-Wheeler Matrix of T, BWM(T), is obtained by computing the matrix whose rows comprise all cyclic rotations of T sorted lexicographically.
1234567
acaacg$caacg$aaacg$acacg$acacg$acaag$acaac$acaacg
acaacg$
$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac
T BWT (T)gc$aaac
7314256 56
![Page 57: ChIP-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... · ChIP-seq data: filtering and mapping reads D. Puthier, C. Rioualen, J. van Helden](https://reader036.vdocuments.net/reader036/viewer/2022062603/5f05fa297e708231d415acc4/html5/thumbnails/57.jpg)
Bowtie principle
57
● Burrows-Wheeler Matrices have a property called the Last First (LF) Mapping:○ The ith occurrence of character C in the last column corresponds to the same text
character as the ith occurrence of C in the first column○ Example: searching “AAC” in ACAACG
7314256
57