생물학 연구를 위한 컴퓨터 활용기술 8강
TRANSCRIPT
![Page 1: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/1.jpg)
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
8th Lecture 2015.11.3
NGS Analysis I : align NGS read into reference genome
![Page 2: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/2.jpg)
Syllabus주 수업내용1 주차 Introduction : Why we need to learn this stuff?
2 주차 Basic of Unix and running BLAST in your PC
3 주차 Unix Command Prompt II and shell scripts
4 주차 Basic of programming (Python programming)
5 주차 Python Scripting II and sequence manipulations
6 주차 Ipython Notebook and Pandas
7 주차 Basic of Next Generation Sequencings and Tutorial
8 주차9 주차 Next Generation Sequencing Analysis I
10 주차 Next Generation Sequencing Analysis II
11 주차 R and statistical analysis
12 주차 Bioconductor I
13 주차 Bioconductor II
14 주차 Network analysis
![Page 3: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/3.jpg)
What we can do with NGS data
ResequencingDe novo genome sequencing
Is there reference sequence for your favorite organism?
Yes No
NGS Sequencing Data
Sequence Assembly
Output : Sequence Contigs
Alignment with reference genome
Output : variants (SNP, Structural Variations)
Gene PredictionsFunctional Classifications…
Association study with phenotypes
![Page 4: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/4.jpg)
Resequencing
Reference sequences : well-estabilished genome sequence
We are interested in understanding genome level differences
Snyder M et al. Genes Dev. 2010;24:423-431
SNP/Indel
Phased SNP
Deletion
Insertion
Inverstion
![Page 5: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/5.jpg)
ACGTTTGGATACTGCAAACCTATG
ACGTTTGTATACTGCAAACATATG
SNP (Single Nucleotide Polymorphisms)
• Change in Single Nucleotide Sequence
• When we compare with Human reference sequences, individual Human has 3 – 4 million SNPs
• Some of them is very frequent, while others are very rare
- Common Variant (20-40% frequencies in Populations)- Rare Variant (less than 1%_
![Page 6: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/6.jpg)
SNPs vs. SNVsBoth are found as single nucleotide variances
• SNP
– Known variant in the specie (Well Characterized)– Known variants exists in specific frequency in Populations– Verified in Population– Resistered in dbSNP (http://www.ncbi.nlm.nih.gov/snp)
• SNV
– Specific variants found on the specific person (Not well characterized)– Very low frequency– Not well characterized
Really a matter of frequency of occurrence
http://ccsb.stanford.edu/education/Nair_NGS.pptx
Single Nucleotide Variances
![Page 7: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/7.jpg)
TGCAAACCTATG
Indel (Insertion/Deletion)
• Deltion or addition of base (less than 1kb)
- 300,000-600,000 indels per person
• Large Scale Structural Variation (more than 2kbp
- more than 1,000 per person
TGCAAAC-TATGTGCAAACC-TATGTGCAAACCCTATG
![Page 8: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/8.jpg)
Today, we will learn how to find these variants from NGS sequencing
- Reference Genome Sequences (Fasta Format)- Sequence Data (Fastq format)
Software
-bwa, samtools, bcftools
• Most software is unix based• In the case of big eucaryotic genomes, it is difficult to run in ordinary PC• But in small eucaryote or bacteria, it would be ok…
WorkFlow
Sequencing DataFastQ
ReferenceGenome Sequence
Alignment File(sam format)
Mapping
![Page 9: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/9.jpg)
Some of informations for NGS
Single Read (SR) or Paired End (PE)
Read Length
Depth of Coverage (DNA)
![Page 10: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/10.jpg)
SRA
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement
Sequence Read Archive : Repository for NGS Data
![Page 11: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/11.jpg)
![Page 12: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/12.jpg)
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP018525
During this study, they performed 47 RNA sequencing (160.8Gbp)
![Page 13: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/13.jpg)
SourcesAccessions Type of Experiments
![Page 14: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/14.jpg)
Install SRA Toolkit
To download NGS data archived in NCBI/SRA, you need to download SRA Toolkit
http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
![Page 15: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/15.jpg)
tar -xvzf sratoolkit.2.5.4-mac64.tar.gz
Extract archive
cd binpwd
(In the case of mac)
![Page 16: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/16.jpg)
Setup sratoolkit in your PATHAdd These line into your .bash_profile in home directory
Setup path
![Page 17: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/17.jpg)
Download sra file
Let’s download some of datafile (It is BIG)
prefetch ERR560539 (SRA id)
Maximum file size download limit is 20,971,520KB
2015-11-02T01:09:26 prefetch.2.5.4: 1) Downloading ‘ERR560539 '...2015-11-02T01:09:26 prefetch.2.5.4: Downloading via http...2015-11-02T01:23:08 prefetch.2.5.4: 1) 'SRR032988' was downloaded successfully
File will be saved in ~/ncbi/public/sra
![Page 18: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/18.jpg)
Convert sra file into FASTQ file
fastq-dump --split-files ERR560539 Read 1887328 spots for ERR560539 Written 1887328 spots for ERR560539
<sra id>
ls ERR560539 _1.fastq ERR560539 _2.fastq Paired End reads
Reverse
Forward5’ 3’5’3’
![Page 19: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/19.jpg)
See end of fastq file
Quality
Sequence
Size of file
About 2.9Gb
![Page 20: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/20.jpg)
Quality Control of Fastq using FASTQC
Download and install FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Open Fastq file
![Page 21: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/21.jpg)
![Page 22: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/22.jpg)
![Page 23: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/23.jpg)
Install bwa, samtools, bcftools
bwa: short illumina read aligner to reference geome sequences
Genome sequence
Sequencing Data
Find out matching, and align sequences
samtools : convert data format find out variants in concert with bcftools
![Page 24: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/24.jpg)
Install bwa, samtools, bcftools
1. Download source files and compile it based on the instructions
2. Install via Homebrew (Mac) or apt-get (Ubuntu linux)
https://github.com/lh3/bwa/https://github.com/samtools/samtools/https://github.com/samtools/bcftools
http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install Homebrew
brew tap homebrew/science
http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/
brew install bwabrew install samtoolsbrew install bcftools
![Page 25: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/25.jpg)
samtools
![Page 26: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/26.jpg)
bwa
![Page 27: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/27.jpg)
bcftools
![Page 28: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/28.jpg)
What will do..
Align sequencing reads in reference genome
First, we will download Reference Genomes
https://support.illumina.com/sequencing/sequencing_software/igenome.html
We will use Saccharomyces cerevisiae genome (sacSer3)
Download this filehttps://support.illumina.com/sequencing/sequencing_software/igenome.html
Download genome file and genome sequence in current directory
![Page 29: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/29.jpg)
tar -xvzf Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gzcp ./Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa .mv genome.fa yeast
Extract reference genome
First, you need to generate index file for genome sequencebwa index yeast
You can think ‘index’ as something like address book in genome for fast access..
![Page 30: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/30.jpg)
Download ERR560539.sraPrefetch ERR560539fasta-dump –split-files ERR560539
Then, download NGS sequence Data to analysis
Saccharomyces cerevisiae seperated from wine
![Page 31: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/31.jpg)
Running bwa (Align NGS reads into Reference)
bwa mem -t 4 yeast ERR560539_1.fastq ERR560539_2.fastq > ERR560539.sam
memMethod for alignment (if NGS sequences is bigger than 50bp, select this)
Number of ThreadIf cpu of your computer (sever) is 4 core, uses –t 4
Two fastq files contains NGS sequencing
Output was saved as ERR560539.sam file
For Yeast alignments it takes 259.789secFor 4 core computer
![Page 32: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/32.jpg)
Sam file
Write down the location of each reads in references file
Starting PositionRead Name
![Page 33: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/33.jpg)
Convert Sam file to Bam file and indexing
samtools view -b -@ 4 ERR560539.sam > ERR560539.bam
samtools sort -@ 4 ERR5605392.bam ERR560539.sorted
Sort bam file
Convert sam to bam (binary sam file)
Generate index filesamtools index 941832.sorted.bam
941832.bam941832.sam941832.sorted.bam941832.sorted.bam.bai
output ‘bam’ file Uses 4 threads (for 4 Core CPU)
Uses 4 threads (for 4 Core CPU)
Now what?
![Page 34: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/34.jpg)
Let’s visualize data : Integrated Genome Viewer
https://www.broadinstitute.org/igv/download
![Page 35: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/35.jpg)
https://www.broadinstitute.org/software/igv/download
![Page 36: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/36.jpg)
In our examples, select sacCer3
![Page 37: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/37.jpg)
Zoom in Zoom OutSelect chromosome
Locations
![Page 38: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/38.jpg)
Load bam file
File->Load from file-> Select yeast.sorted.bam
![Page 39: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/39.jpg)
SNP
Gene
Zoom it
![Page 40: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/40.jpg)
Reference :C Sequenced : T
![Page 41: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/41.jpg)
Missing in Sequenced Genome?
Low sequencing Depth
![Page 42: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/42.jpg)
![Page 43: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/43.jpg)
Find out Variants
samtools mpileup -g -f yeast yeast.sorted.bam > yeast.bcf
Examine every position in genome and check alignmentFind out the possibility of alternative allele
bcftools call -c -v yeast.bcf > yeast.vcf
Write out variant as yeast.vcf
Open yeast.vcf in nano editor
![Page 44: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/44.jpg)
Header
Variants
![Page 45: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/45.jpg)
DP : Raw read depth….How many sequence reads support these variation?
<ID=DP4: Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
![Page 46: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/46.jpg)
Visualize
‘Load from Files’ in IGV
Select VCF file (yeast.vcf)
![Page 47: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/47.jpg)
SNV
![Page 48: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/48.jpg)
Data mining from variant data
VCF file is just text file. So we can handle them with unix utility and Pandas
head -n 50 yeast.vcf Print out first 50 line in yeast.vcf file
Headers start in ‘##’
We want to remove these lines started with ##. How?
![Page 49: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/49.jpg)
grep –v “^##” Display lines except start with ‘##’
And Save it as yeast2.vcf
We can uses premitive filtering using grep
grep 'chrIX' yeast2.vcf | wc -l 2818
Display variants in ChrIX and count lines
Filtering with Pandas
![Page 50: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/50.jpg)
Data mining using ipython Notebook
![Page 51: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/51.jpg)
![Page 52: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/52.jpg)
Informations are stored as DP=14;VBD=2.6447e=06…We want to convert them as columns in dataFrame. How?
![Page 53: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/53.jpg)
Define functions
DP=81;VDB=5.92922e-11;SGB=-0.693147;MQSB=1;MQ0.Convert string as dictionary
{‘DP’:81, ‘VDB’:5.92922e-11, ‘SGB’:-0.693147…}
![Page 54: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/54.jpg)
View single column as series
![Page 55: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/55.jpg)
Apply ‘split’ functions in each row
Convert as list
![Page 56: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/56.jpg)
Generate DataFrame from list
![Page 57: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/57.jpg)
Save as new dataframe named as info
Select two columns in info (DP, MQ) and add into vcf
![Page 58: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/58.jpg)
Filter DP (read depth) is higher than 50, MQ (Mapping Quality) is higher than 30
![Page 59: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/59.jpg)
How many filtered SNV is found on ‘chrI’?
Unfiltered
![Page 60: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/60.jpg)
Save back filtered VCF data…
Save as vcf3.vcf
grep "^##" yeast.vcf > header.vcf Extract Header regions in VCF
cat header.vcf vcf3.vcf > filtered.vcf Attach Header back
Open in IGV and compare original variant calling and filtered one..
![Page 61: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/61.jpg)
Filtered
Original
![Page 62: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/62.jpg)
Common Question Examples..
1. Find out all SNV present on Exon
2. Find out SNV present on Promoter Regions on the Genes
3. Find out SNV present on the specific genes of interest
4. Filter out SNV which causes Loss of Functions on genes
…You need another sets of tools to answer these questions
We will look in the next lectures
![Page 63: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/63.jpg)
SNV FilteringPre-processing in the mapping phase and SNV filtering help minimize false positives• Absent in dbSNP• Exclude LOH events• Retain non-synonymous• Sufficient depth of read coverage• SNV present in given number of reads• High mapping and SNV quality• SNV density in a given bp window• SNV greater than a given bp from a
predicted indel • Strand balance/bias• Concordance across various SNV callers
http://ccsb.stanford.edu/education/Nair_NGS.pptx
![Page 64: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/64.jpg)
Variant Annotation• 실제 찾아진 Variant 에 대한 해석• SeattleSeq
– annotation of known and novel SNPs – includes dbSNP rs ID, gene names and accession
numbers, SNP functions (e.g., missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association
• Annovar– Gene-based annotation– Region-based annotations– Filter-based annotation
http://snp.gs.washington.edu/SeattleSeqAnnotation/http://www.openbioinformatics.org/annovar/
http://ccsb.stanford.edu/education/Nair_NGS.pptx
![Page 65: 생물학 연구를 위한 컴퓨터 활용기술 8강](https://reader031.vdocuments.net/reader031/viewer/2022020716/58a1d85c1a28abb6678b58a7/html5/thumbnails/65.jpg)