dna/rna read simulators

19
A Look at DNA/RNA Simulation

Upload: ccr-collaborative-bioinformatics-resource

Post on 17-Feb-2017

169 views

Category:

Science


1 download

TRANSCRIPT

Page 1: DNA/RNA read simulators

A Look at DNA/RNA Simulation

Page 2: DNA/RNA read simulators

General Outline• Brief overview of available simulators• Pattnaik, et al. (2014). SInC: an accurate and fast error-model

based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15:40.

• Griebel, et al. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucl. Acids Res. 40 (20): 10073-10083.

• Mu, et al. (2015). VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics, 31 (9): 1469-1471.

• Conclusions/Suggestions

Page 3: DNA/RNA read simulators

Brief Overview• Read simulators:

– Wgsim(2009): basic sequencing simulation; dummy quality scores– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input– ART(2012): uses pre-trained quality score distribution profile– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error profile

• Variation/Read simulators:– GemSIM(2012): generates empirical error models from real data, multiple genome input, random

generation of SNPs and Indels– MAQ(2008): error model based on quality score profile from a order-one Markov chain, random SNP and

Indel generation– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy distributions generated

from a gene quantification file– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles, random SNP,

indel, and CNVs• Multi-step simulators:

– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from realistic statistical models

– VarSim(2015): genome and read simulation and validation framework

Page 4: DNA/RNA read simulators

SInC

• Three-part variation simulator and a read generator• Variation modules model SNPs, Indels, and CNVs

(copy number variations)• Read generator module models short-read

sequencing using a real-data derived quality distribution profile.

• Multi-threaded for fast read generation.• Performed a small evaluation versus 4 other

variation simulators.

Page 5: DNA/RNA read simulators

SInC

• SNPs, indels, and CNVs are randomly distributed across the reference genome by separate modules using command-line parameters

• Reads are generated using a pre-defined error profile distribution

• However, a separate tool is available to generate custom error profiles from real data sets

Page 6: DNA/RNA read simulators

SInC Workflow

Page 7: DNA/RNA read simulators

SInC Evaluation using GATK and Pindel

Page 8: DNA/RNA read simulators

SInC Evaluation

Page 9: DNA/RNA read simulators

FluxSim

• Generic RNA-seq experiment simulator• Multiple modules simulating different stages of

RNA Illumina library construction and sequencing, as well as a transcriptome simulator.

• Simulator Modules/Stages: transcription, fragmentation, reverse transcription, size selection, adapter ligation/PCR amplification, sequencing

Page 10: DNA/RNA read simulators

Outline of the Flux Simulator pipeline.

Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083

© The Author(s) 2012. Published by Oxford University Press.

Page 11: DNA/RNA read simulators

FluxSim Transcription

• FluxSim models gene expression by sampling from a power law distribution (i.e. modified Ziph’s law with exponential mRNA decay).– – This relationship models the networked nature of

cellular gene expression, with many lowly expressed genes (low ranked), several moderately expressed genes, and a few very highly expressed genes (high ranked).

Page 12: DNA/RNA read simulators

FluxSim: log-log plot of three real cellular transcriptome datasets

Page 13: DNA/RNA read simulators

FluxSim Sequencing

• A quality profile based model for Illumina sequencing– Quality values are randomly drawn from a pre-

defined empirical distribution dependent on cycle position

– Nucleotides are mutated according to the quality score error probability

– Nucleotide mutation choice/preference is determined based on the quality score using a first order Markov process

Page 14: DNA/RNA read simulators

VarSim

• Multi-step simulator and validation framework– 1) simulates perturbed diploid genomes from a reference

by inserting variants (VarSim simulates SNVs, deletions, insertions,MNPs, complex variants, tandem duplications and inversions) from existing databases distribution profiles

– 2) uses a third-party read simulator to generate sequenced reads (currently configured to use DWGSIM or ART) from the perturbed genomes

– 3) reads are mapped back to original reference genome using a modified vcf2diploid (Rozowsky etal., 2011) map file (MFF file)

Page 15: DNA/RNA read simulators

VarSim Validation

– read alignments (from mapping software, e.g. BWA-mem) are validated using read header metadata

– Variants (from variant caller software, e.g. FreeBayes) are validated against ‘true’ variants that were inserted into the perturbed genome

– Accuracy of variant calling is reported based on sensitivity (TPR) and precision (PPV/FDR), broken down by variant type and size, as a JSON file with SVG plots

Page 16: DNA/RNA read simulators

VarSim simulation and validation workflow.

John C. Mu et al. Bioinformatics 2015;31:1469-1471

© The Author 2014. Published by Oxford University Press.

Page 17: DNA/RNA read simulators

Validation results for some popular secondary analysis tools.

John C. Mu et al. Bioinformatics 2015;31:1469-1471

© The Author 2014. Published by Oxford University Press.

Page 18: DNA/RNA read simulators

Conclusions/Suggestions• There are no comprehensive evaluations (that I could find)

of DNA/RNA simulators other than the incomplete SInC comparison.

• However, SInC and VarSim appear to be a good candidates for genome variation and gDNA simulation, while FluxSim appears to be the only fully realized RNA simulator.

• A pipeline with SInC or VarSim genome perturbation combined with FluxSim transcription and library prep/sequencing might allow validation of RNAseq tools with biologically complex simulated data.

Page 19: DNA/RNA read simulators

Comparison of simulated reads with experimental evidence in different sequencing protocols.

Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083

© The Author(s) 2012. Published by Oxford University Press.

FluxSim Evaluation