bioinformatics analysis of single-cell rna-seq...

34
Dr Victor Chang AC 1936-1991, Pioneering Cardiothoracic Surgeon and Humanitarian Bioinformatics analysis of single-cell RNA-seq data Joshua W. K. Ho, PhD Head, Bioinformatics and Systems Medicine Laboratory Victor Chang Cardiac Research Institute Senior Lecturer (Conjoint), UNSW Sydney @joshuawkho 2018 Winter School in Mathematical & Computational Biology, University of Queensland, 3 July 2017

Upload: others

Post on 21-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian

    Bioinformatics analysis of single-cell RNA-seq data

    Joshua W. K. Ho, PhD Head, Bioinformatics and Systems Medicine Laboratory

    Victor Chang Cardiac Research Institute Senior Lecturer (Conjoint), UNSW Sydney

    @joshuawkho

    2018 Winter School in Mathematical & Computational Biology, University of Queensland, 3 July 2017

  • Bioinforma;cschallenges:-  Scalability(>1millioncells)-  Technicalnoise(dropouts)

  • RNA-seqalignmentandtranscriptreconstruc;on

  • Cloud computing to enable scalability

    Cloud computing + Big Data Framework •  Cloud computing

    •  A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

    •  Key characteristics – elasticity + pay-as-you-go model •  Advantages – low entry cost + scalability

    •  Big Data framework •  Hadoop – a software framework for distributed processing of big data in large scale cluster (YARN for resource

    management, HDFS for big data storage, and MapReduce for analytics engine) •  Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a

    speed up of up to 100x compared to MapReduce)

  • Existing tools

    •  Halvade (https://github.com/biointec/halvade) •  Written in Hadoop MapReduce •  Designed to perform variant calling of genomic data from FASTQ files •  Provides support for transcriptomic analysis

    •  SparkBWA (https://github.com/citiususc/SparkBWA) •  Written in Spark •  Designed to perform alignment of FASTQ files only

    •  SparkSeq (https://bitbucket.org/mwiewiorka/sparkseq/wiki/Home) •  Written in Spark •  Designed to perform interactive analysis of BAM files

    •  Limitations: •  Halvade and SparkBWA does not offer multi-sample analysis •  SparkSeq does not perform alignment – which is the main bottleneck in analysis

  • Falco framework

    MapReduce Spark

    AndrianYang Yangetal(2017)Bioinforma)csMichaelTroup

  • Falco framework features

    Ease of use •  Falco provides helper script to launch EMR cluster and submit

    jobs to the cluster •  User can easily configure the cluster and jobs by modifying

    the configuration file passed to the helper script

    Customisation •  Falco allows user to add custom alignment and/or quantification

    tools •  User will need to implement custom function to call the

    aligner/quantification tool •  Custom tool must be compatible with divide-and-conquer

    approach

    [job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !

    Sample configuration for running analysis job

  • Benchmarking

    •  Single-cell RNA-seq data sets •  Mouse embryonic stem cell (mESC) data (869

    samples) •  200bp paired-end reads,1.28×1012 bases, 1.02Tb

    FASTQ.gz files) •  Human brain data (466 samples)

    •  100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files

    •  Performance comparison of Falco against single-node

    •  STAR+featureCount (S+F) •  Mouse: speedup of 2.6x – 33.4x •  Brain: speedup of 5.1x – 145.4x

    •  HISAT2+HTSeq (H+H) •  Mouse: speedup of 2.5x – 58.4x •  Brain: speedup of 4.0x – 132.5x

    System Nodes Mouse - embryonic stem cell (hours)

    Human - brain (hours)

    S+F H+H S+F H+H

    Standalone

    1 (1 process) 93.7 154.7 85.67 65.34

    1 (5 processes) 29.3 33.8 99.09 67.08

    1 (12 processes) 21.1 16.4 115.71 55.15

    1 (16 processes) 18.5 13.6 114.11 67.98

    Falco

    10 7.0 2.7 32.13 65.34

    20 4.1 1.6 39.64 67.08

    30 3.3 1.4 57.68 67.68

    40 2.8 1.1 76.08 67.98

    Table 1. Runtime analysis of single cell datasets

  • Cost effectiveness by using AWS spot instances

    Utilising spot instances •  AWS allows utilisation of unused Amazon computing capacity – known as

    Spot instances •  Typically cheaper compared to ‘on-demand’ cost

    •  To use spot instance, user needs bid for the resource •  Use of spot instance for analysis provides a savings of ~65% compared

    to using ‘on-demand’ instances •  Alternative use - decrease runtime by utilising more instances for a

    given ‘on-demand’ price Figure 3. Spot instance price history for September to October

    Table2.Falcocostanalysis-on-demandvsspotinstances

    Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount

    Dataset Number of nodes

    Time (hours)

    On-demand cost (USD)

    Spot cost (USD)

    % Savings

    Mouse - ESC

    10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98

    Human - brain

    10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98

    Table 3. Falco cost analysis - on-demand vs spot instances for HISAT2+HTSeq

    Dataset Number of nodes

    Time (hours)

    On-demand cost (USD)

    Spot cost (USD)

    % Savings

    Mouse - ESC

    10 12 370.80 128.40 65.37

    20 7 421.40 138.60 67.11

    30 5 447.50 144.50 67.71

    40 4 475.20 152.00 68.01

    Human - brain

    10 5 154.50 53.50 65.37

    20 3 180.60 59.40 67.11

    30 2 179.00 57.80 67.71

    40 2 237.60 76.00 68.01

  • Scaling up to a larger data set

    Data set (for Standalone + Falco) •  Single-cell Mouse oligodendrocyte from central nervous

    system (SRP066613) •  6,283 samples of 50bp single-ended reads, totalling to

    231.02 Gbp stored in 200 Gb of fastq.gz file. •  Standalone + Falco

    •  Preprocessing with Trimmomatic •  Alignment with STAR •  Quantification with featureCount •  Clustering with CIDR

    •  Cell Ranger – custom pipeline designed by chromium •  Alignment with STAR •  Timing is approximated from runtime of a different

    mouse scRNA-seq dataset 0.0

    0.5

    1.0

    1.5

    1 Process 12 Processes16 Processes Cell Ranger

    Standalone

    10 Nodes 40 Nodes

    Num

    ber o

    f cel

    ls p

    roce

    ssed

    per

    sec

    onds

    Falco

  • Next step – using Falco for transcript reconstruction

    AndrianYang AbhinavKishore

  • Discovery of novel transcript isoforms in published data

    •  Identification of novel transcript and isoform

  • Availability

    Source code •  Falco is available to download from Github •  Our work on Falco has been featured in a Nature Toolbox article Checkout Falco at

    github.com/VCCRI/Falco

  • Technical noise in scRNA-seq: Dropouts

    Figure 1 (a) Types of cell-to-cell variability observed in single-cell RNA-seq measurements. A smoothed scatter plot compares gene expression estimates from two cells of the same type (MEF cells), illustrating prevalence of dropout events, over-dispersion, and high-magnitude outliers.

    Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742

  • The dropout problem in scRNA-seq data analysis

    Fig. 1b. Heat maps showing the relationship between dropout rate and mean non-zero expression level for three published single-cell data sets including an approximate double exponential model fit.

    Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241

  • Dropouts

    1. What is the cause of zero read counts? •  Biological reason: o  True non-expression: Stochastic variability due to cell-to-cell variations (transcriptional burst)

    •  Technical reason (dropout): o  Low starting mRNA that cause a transcript to be ‘missed’ during the initial reverse transcription step, and hence not

    being detected during sequencing – cannot be recovered by deeper sequencing! o  Amplification biases o  Low sequencing depth o  Impact clustering by inflating

    cell-to-cell dissimilarity

  • Dropouts

    How do we deal with dropouts?

    •  Ignore dropouts •  Keep the zeros, and proceed as usual •  Remove rows that have ‘too many’ zeros, then proceed as usual •  Focus on only key ‘marker genes’ that are not excessively affected by zeros

    •  Account for the dropouts explicitly through a statistical mixture model •  When performing differential expression analysis (for example), take into account of variance that can be attributed

    to excessive zeros (e.g, SCDE; Kharchenko et al. 2014) •  ZIFA: Modified probabilistic principal component analysis (PCA) that incorporate global zero-inflation parameter to

    account for dropouts (Pierson et al. 2015) •  Imputation (using a variety of methods)

    Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742

    Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241

  • h/ps://github.com/VCCRI/CIDR

    DrPaulLin

  • S - B f E

    S-

    Bf

    ENo dropout

    Gene k

    Gen

    e - ck

    c-c/ cB

    cw

    cf

    cP

    cE

    S - B f E

    S-

    Bf

    E

    With dropout

    Gene k

    ck

    c-c/ cB

    cw

    cf

    cP

    cEcB

    c/

    ck c-

    cw cP cf cE

    k-

    /B

    w

    No dropout

    Hei

    ght

    ck c/

    c- cB cw cP cf cE

    S-

    Bf

    EkS

    With dropout

    cB

    c/ ck c-

    cw cP cf cE

    S-

    Bf

    EkS

    With dropoutq CIDR dissimilarity

    Adjusted Rand Index[ SOkf Adjusted Rand Index[ kOS

    Gen

    e -

    a b

    c

    CIDR

    13.0 54.5 40.6 0.25

    3.8 18.5 2.1 0.89

    3.4 2.9 19.5

    Meansquared distance

    Nodropout

    Withdropout

    :DO.

    Shrinkage rate:DO-CIDR.

    )DO

    Betweenclusters :BC.

    Withinclusters :WC.

    Ratio :BC)WC.

    S - B f E kS k-

    SOS

    SO-

    SOB

    SOf

    SOE

    kOS

    Dropout rate function

    x

    Dro

    pout

    rate

    P:x.W:x.

    EuclideanCIDR

    S - B f E kS k-

    Sw

    kSkw

    -S

    x

    :x--xk.=S

    Exp

    ecte

    d di

    stan

    ce

    S - B f E kS k-

    SkS

    -S/S

    BSwS

    fS

    x

    :x--xk.=B

    S - B f E kS k-

    SOS

    SO-

    SOB

    SOf

    SOE

    kOS

    x

    :x--xk.=S:x--xk.=-:x--xk.=B:x--xk.=f:x--xk.=E:x--xk.=kS:x--xk.=k-

    [E:D

    ata.

    -E:C

    IDR

    .] ) E

    :Dat

    a.

    Expected shrinkage rate

    EuclideanCIDR

    d

  • CIDRisfastandaccurate

  • −200 −100 0 100−150

    −100

    −50

    0

    50

    100

    PC1

    PC2

    1

    2

    aprcomp

    −40 −20 0 20 40 60

    −20

    0

    20

    40

    60

    80

    PC1

    PC2

    1

    23

    bt−SNE

    −2 −1 0 1 2 3 4 5

    −1

    0

    1

    2

    3

    4

    PC1

    PC2

    1

    2

    cZIFA

    −40 −20 0 20 40 60

    −20

    0

    20

    40

    60

    80

    PC1

    PC2

    1

    2

    3

    dRaceID

    −50 0 50

    −60

    −40

    −20

    0

    20

    40

    60

    PC1

    PC2

    1 23

    4

    5

    6

    eCIDR

    prcomp t−SNE ZIFA RaceID CIDRAd

    just

    ed R

    and

    Inde

    x0.0

    0.2

    0.4

    0.6

    0.8

    1.0f

    astrocytesendothelial

    fetal quiescent neuronsfetal replicating neurons

    microglianeurons

    oligodendrocytesoligodendrocyte precursor cells

    Clusters output by algorithms:

  • Neurons

    Astrocytes

    Oligodendrocytes

    Endothelial

    Nuer

    on 1

    Neur

    on 2

    Neur

    on 3

    Astro

    cyte

    1As

    trocy

    te 2

    Astro

    cyte

    3O

    ligod

    endr

    ocyt

    e 1

    Olig

    oden

    droc

    yte

    2O

    ligod

    endr

    ocyt

    e 3

    Endo

    thel

    ia 1

    Endo

    thel

    ial 2

    Endo

    thel

    ial 3

    CIDR

    1CI

    DR 2

    CIDR

    3pr

    com

    p 1/

    ZIFA

    1/C

    IDR

    4CI

    DR 5

    prco

    mp

    2/ZI

    FA 2

    /CID

    R 6

    tSN

    E 1

    tSN

    E 2

    tSN

    E 3

    Race

    ID 1

    Race

    ID 2

    Race

    ID 3

    log(TPM)151050

  • starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality

    •  EnablingwidespreaduseofVRvisualisa;onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)

    •  Supportinterac;onusingheadmovement,keyboard,remotegamepad,andvoicecontrol

    JianfuLiYuYao

  • Usingstarmaptovisualiseadatasetof68,000cellsfromascRNA-seqdata

    h_ps://www.youtube.com/watch?v=_LLidDFQH8A

  • Starmapinterac;on

  • starmapstarmapdemo:h/ps://vccri.github.io/starmap/

    starmapsourcecode:h/ps://github.com/VCCRI/starmapbioRxivpreprint:h/ps://www.biorxiv.org/content/early/2018/05/17/324855

  • DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian

    ScalingupclusteringofscRNA-seqdatabyborrowingideasfromflowcytometryanalysis

  • Clusteringmethods

    Xiaoxin(Sean)Ye

  • Ultrafastgriddensity-basedclusteringforsinglecelldata-FlowGrid

    Speedingupclusteringofsinglecelldatafromhourstoseconds

    Xiaoxin(Sean)Ye h_ps://github.com/VCCRI/FlowGrid

  • FlowCap I DataSets

    Lymph• DiffuseLargeB-cellLymphoma

    • Numberofevents:10197• Numberofdimension:3

    StemCell• Hematopoie;cStemCellTransplant

    • Numberofevents:9780• Numberofdimension:4

    GvHD• GralversusHostDisease• Numberofevents:23377• Numberofdimension:4

    Datasource:h_p://flowcap.flowsite.org/codeanddata/

  • Performance

    DataSet Events Dimension FlowGrid FlowSOM FlowPeaks Flock Time(s) ARI Time(s) ARI Time(s) ARI Time(s) ARI

    Lymph 10197 3 0.05 0.84 1.27 0.94 0.18 0.90 0.16 0.89 GvHD 23377 4 0.02 0.98 1.73 0.97 0.43 0.97 0.78 0.69

    StemCell 9780 4 0.02 0.85 1.39 0.98 0.10 0.96 0.29 0.95

    Events(million) Dimension FlowGridTime(s) FlowSOMTime(s) FlowPeaksTime(s) 0.2 4 0.04 5.04 2.98 1.5 4 0.24 33.32 15.11 11.9 4 2.42 303.46 103.99

  • [email protected]_p://bioinforma;cs.victorchang.edu.au@joshuawkho

    Wearecurrentlyrecrui;ng:•  LabHead(Facultyposi;on),

    Bioinforma;cs•  PostdoctoralFellow•  ResearchAssistant•  PhDstudents(scholarshipavailable)