bioinformatics analysis of single-cell rna-seq...
TRANSCRIPT
-
DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian
Bioinformatics analysis of single-cell RNA-seq data
Joshua W. K. Ho, PhD Head, Bioinformatics and Systems Medicine Laboratory
Victor Chang Cardiac Research Institute Senior Lecturer (Conjoint), UNSW Sydney
@joshuawkho
2018 Winter School in Mathematical & Computational Biology, University of Queensland, 3 July 2017
-
Bioinforma;cschallenges:- Scalability(>1millioncells)- Technicalnoise(dropouts)
-
RNA-seqalignmentandtranscriptreconstruc;on
-
Cloud computing to enable scalability
Cloud computing + Big Data Framework • Cloud computing
• A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources
• Key characteristics – elasticity + pay-as-you-go model • Advantages – low entry cost + scalability
• Big Data framework • Hadoop – a software framework for distributed processing of big data in large scale cluster (YARN for resource
management, HDFS for big data storage, and MapReduce for analytics engine) • Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a
speed up of up to 100x compared to MapReduce)
-
Existing tools
• Halvade (https://github.com/biointec/halvade) • Written in Hadoop MapReduce • Designed to perform variant calling of genomic data from FASTQ files • Provides support for transcriptomic analysis
• SparkBWA (https://github.com/citiususc/SparkBWA) • Written in Spark • Designed to perform alignment of FASTQ files only
• SparkSeq (https://bitbucket.org/mwiewiorka/sparkseq/wiki/Home) • Written in Spark • Designed to perform interactive analysis of BAM files
• Limitations: • Halvade and SparkBWA does not offer multi-sample analysis • SparkSeq does not perform alignment – which is the main bottleneck in analysis
-
Falco framework
MapReduce Spark
AndrianYang Yangetal(2017)Bioinforma)csMichaelTroup
-
Falco framework features
Ease of use • Falco provides helper script to launch EMR cluster and submit
jobs to the cluster • User can easily configure the cluster and jobs by modifying
the configuration file passed to the helper script
Customisation • Falco allows user to add custom alignment and/or quantification
tools • User will need to implement custom function to call the
aligner/quantification tool • Custom tool must be compatible with divide-and-conquer
approach
[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !
Sample configuration for running analysis job
-
Benchmarking
• Single-cell RNA-seq data sets • Mouse embryonic stem cell (mESC) data (869
samples) • 200bp paired-end reads,1.28×1012 bases, 1.02Tb
FASTQ.gz files) • Human brain data (466 samples)
• 100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files
• Performance comparison of Falco against single-node
• STAR+featureCount (S+F) • Mouse: speedup of 2.6x – 33.4x • Brain: speedup of 5.1x – 145.4x
• HISAT2+HTSeq (H+H) • Mouse: speedup of 2.5x – 58.4x • Brain: speedup of 4.0x – 132.5x
System Nodes Mouse - embryonic stem cell (hours)
Human - brain (hours)
S+F H+H S+F H+H
Standalone
1 (1 process) 93.7 154.7 85.67 65.34
1 (5 processes) 29.3 33.8 99.09 67.08
1 (12 processes) 21.1 16.4 115.71 55.15
1 (16 processes) 18.5 13.6 114.11 67.98
Falco
10 7.0 2.7 32.13 65.34
20 4.1 1.6 39.64 67.08
30 3.3 1.4 57.68 67.68
40 2.8 1.1 76.08 67.98
Table 1. Runtime analysis of single cell datasets
-
Cost effectiveness by using AWS spot instances
Utilising spot instances • AWS allows utilisation of unused Amazon computing capacity – known as
Spot instances • Typically cheaper compared to ‘on-demand’ cost
• To use spot instance, user needs bid for the resource • Use of spot instance for analysis provides a savings of ~65% compared
to using ‘on-demand’ instances • Alternative use - decrease runtime by utilising more instances for a
given ‘on-demand’ price Figure 3. Spot instance price history for September to October
Table2.Falcocostanalysis-on-demandvsspotinstances
Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount
Dataset Number of nodes
Time (hours)
On-demand cost (USD)
Spot cost (USD)
% Savings
Mouse - ESC
10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98
Human - brain
10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98
Table 3. Falco cost analysis - on-demand vs spot instances for HISAT2+HTSeq
Dataset Number of nodes
Time (hours)
On-demand cost (USD)
Spot cost (USD)
% Savings
Mouse - ESC
10 12 370.80 128.40 65.37
20 7 421.40 138.60 67.11
30 5 447.50 144.50 67.71
40 4 475.20 152.00 68.01
Human - brain
10 5 154.50 53.50 65.37
20 3 180.60 59.40 67.11
30 2 179.00 57.80 67.71
40 2 237.60 76.00 68.01
-
Scaling up to a larger data set
Data set (for Standalone + Falco) • Single-cell Mouse oligodendrocyte from central nervous
system (SRP066613) • 6,283 samples of 50bp single-ended reads, totalling to
231.02 Gbp stored in 200 Gb of fastq.gz file. • Standalone + Falco
• Preprocessing with Trimmomatic • Alignment with STAR • Quantification with featureCount • Clustering with CIDR
• Cell Ranger – custom pipeline designed by chromium • Alignment with STAR • Timing is approximated from runtime of a different
mouse scRNA-seq dataset 0.0
0.5
1.0
1.5
1 Process 12 Processes16 Processes Cell Ranger
Standalone
10 Nodes 40 Nodes
Num
ber o
f cel
ls p
roce
ssed
per
sec
onds
Falco
-
Next step – using Falco for transcript reconstruction
AndrianYang AbhinavKishore
-
Discovery of novel transcript isoforms in published data
• Identification of novel transcript and isoform
-
Availability
Source code • Falco is available to download from Github • Our work on Falco has been featured in a Nature Toolbox article Checkout Falco at
github.com/VCCRI/Falco
-
Technical noise in scRNA-seq: Dropouts
Figure 1 (a) Types of cell-to-cell variability observed in single-cell RNA-seq measurements. A smoothed scatter plot compares gene expression estimates from two cells of the same type (MEF cells), illustrating prevalence of dropout events, over-dispersion, and high-magnitude outliers.
Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742
-
The dropout problem in scRNA-seq data analysis
Fig. 1b. Heat maps showing the relationship between dropout rate and mean non-zero expression level for three published single-cell data sets including an approximate double exponential model fit.
Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241
-
Dropouts
1. What is the cause of zero read counts? • Biological reason: o True non-expression: Stochastic variability due to cell-to-cell variations (transcriptional burst)
• Technical reason (dropout): o Low starting mRNA that cause a transcript to be ‘missed’ during the initial reverse transcription step, and hence not
being detected during sequencing – cannot be recovered by deeper sequencing! o Amplification biases o Low sequencing depth o Impact clustering by inflating
cell-to-cell dissimilarity
-
Dropouts
How do we deal with dropouts?
• Ignore dropouts • Keep the zeros, and proceed as usual • Remove rows that have ‘too many’ zeros, then proceed as usual • Focus on only key ‘marker genes’ that are not excessively affected by zeros
• Account for the dropouts explicitly through a statistical mixture model • When performing differential expression analysis (for example), take into account of variance that can be attributed
to excessive zeros (e.g, SCDE; Kharchenko et al. 2014) • ZIFA: Modified probabilistic principal component analysis (PCA) that incorporate global zero-inflation parameter to
account for dropouts (Pierson et al. 2015) • Imputation (using a variety of methods)
Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742
Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241
-
h/ps://github.com/VCCRI/CIDR
DrPaulLin
-
S - B f E
S-
Bf
ENo dropout
Gene k
Gen
e - ck
c-c/ cB
cw
cf
cP
cE
S - B f E
S-
Bf
E
With dropout
Gene k
ck
c-c/ cB
cw
cf
cP
cEcB
c/
ck c-
cw cP cf cE
k-
/B
w
No dropout
Hei
ght
ck c/
c- cB cw cP cf cE
S-
Bf
EkS
With dropout
cB
c/ ck c-
cw cP cf cE
S-
Bf
EkS
With dropoutq CIDR dissimilarity
Adjusted Rand Index[ SOkf Adjusted Rand Index[ kOS
Gen
e -
a b
c
CIDR
13.0 54.5 40.6 0.25
3.8 18.5 2.1 0.89
3.4 2.9 19.5
Meansquared distance
Nodropout
Withdropout
:DO.
Shrinkage rate:DO-CIDR.
)DO
Betweenclusters :BC.
Withinclusters :WC.
Ratio :BC)WC.
S - B f E kS k-
SOS
SO-
SOB
SOf
SOE
kOS
Dropout rate function
x
Dro
pout
rate
P:x.W:x.
EuclideanCIDR
S - B f E kS k-
Sw
kSkw
-S
x
:x--xk.=S
Exp
ecte
d di
stan
ce
S - B f E kS k-
SkS
-S/S
BSwS
fS
x
:x--xk.=B
S - B f E kS k-
SOS
SO-
SOB
SOf
SOE
kOS
x
:x--xk.=S:x--xk.=-:x--xk.=B:x--xk.=f:x--xk.=E:x--xk.=kS:x--xk.=k-
[E:D
ata.
-E:C
IDR
.] ) E
:Dat
a.
Expected shrinkage rate
EuclideanCIDR
d
-
CIDRisfastandaccurate
-
−200 −100 0 100−150
−100
−50
0
50
100
PC1
PC2
1
2
aprcomp
−40 −20 0 20 40 60
−20
0
20
40
60
80
PC1
PC2
1
23
bt−SNE
−2 −1 0 1 2 3 4 5
−1
0
1
2
3
4
PC1
PC2
1
2
cZIFA
−40 −20 0 20 40 60
−20
0
20
40
60
80
PC1
PC2
1
2
3
dRaceID
−50 0 50
−60
−40
−20
0
20
40
60
PC1
PC2
1 23
4
5
6
eCIDR
prcomp t−SNE ZIFA RaceID CIDRAd
just
ed R
and
Inde
x0.0
0.2
0.4
0.6
0.8
1.0f
astrocytesendothelial
fetal quiescent neuronsfetal replicating neurons
microglianeurons
oligodendrocytesoligodendrocyte precursor cells
Clusters output by algorithms:
-
Neurons
Astrocytes
Oligodendrocytes
Endothelial
Nuer
on 1
Neur
on 2
Neur
on 3
Astro
cyte
1As
trocy
te 2
Astro
cyte
3O
ligod
endr
ocyt
e 1
Olig
oden
droc
yte
2O
ligod
endr
ocyt
e 3
Endo
thel
ia 1
Endo
thel
ial 2
Endo
thel
ial 3
CIDR
1CI
DR 2
CIDR
3pr
com
p 1/
ZIFA
1/C
IDR
4CI
DR 5
prco
mp
2/ZI
FA 2
/CID
R 6
tSN
E 1
tSN
E 2
tSN
E 3
Race
ID 1
Race
ID 2
Race
ID 3
log(TPM)151050
-
starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality
• EnablingwidespreaduseofVRvisualisa;onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)
• Supportinterac;onusingheadmovement,keyboard,remotegamepad,andvoicecontrol
JianfuLiYuYao
-
Usingstarmaptovisualiseadatasetof68,000cellsfromascRNA-seqdata
h_ps://www.youtube.com/watch?v=_LLidDFQH8A
-
Starmapinterac;on
-
starmapstarmapdemo:h/ps://vccri.github.io/starmap/
starmapsourcecode:h/ps://github.com/VCCRI/starmapbioRxivpreprint:h/ps://www.biorxiv.org/content/early/2018/05/17/324855
-
DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian
ScalingupclusteringofscRNA-seqdatabyborrowingideasfromflowcytometryanalysis
-
Clusteringmethods
Xiaoxin(Sean)Ye
-
Ultrafastgriddensity-basedclusteringforsinglecelldata-FlowGrid
Speedingupclusteringofsinglecelldatafromhourstoseconds
Xiaoxin(Sean)Ye h_ps://github.com/VCCRI/FlowGrid
-
FlowCap I DataSets
Lymph• DiffuseLargeB-cellLymphoma
• Numberofevents:10197• Numberofdimension:3
StemCell• Hematopoie;cStemCellTransplant
• Numberofevents:9780• Numberofdimension:4
GvHD• GralversusHostDisease• Numberofevents:23377• Numberofdimension:4
Datasource:h_p://flowcap.flowsite.org/codeanddata/
-
Performance
DataSet Events Dimension FlowGrid FlowSOM FlowPeaks Flock Time(s) ARI Time(s) ARI Time(s) ARI Time(s) ARI
Lymph 10197 3 0.05 0.84 1.27 0.94 0.18 0.90 0.16 0.89 GvHD 23377 4 0.02 0.98 1.73 0.97 0.43 0.97 0.78 0.69
StemCell 9780 4 0.02 0.85 1.39 0.98 0.10 0.96 0.29 0.95
Events(million) Dimension FlowGridTime(s) FlowSOMTime(s) FlowPeaksTime(s) 0.2 4 0.04 5.04 2.98 1.5 4 0.24 33.32 15.11 11.9 4 2.42 303.46 103.99
-
[email protected]_p://bioinforma;cs.victorchang.edu.au@joshuawkho
Wearecurrentlyrecrui;ng:• LabHead(Facultyposi;on),
Bioinforma;cs• PostdoctoralFellow• ResearchAssistant• PhDstudents(scholarshipavailable)