Single-cell RNA-seq analysis -...

37
Single-cell RNA-seq analysis Winter School on Mathematical and Computational Biology 2019, UQ, 2 July 2019 Dr Joshua W. K. Ho Associate Professor School of Biomedical Sciences The University of Hong Kong Dr Kitty Lo Dr Pengyi Yang Prof Jean Yang School of Mathematics and Statistics University of Sydney Sydney Australia

Transcript of Single-cell RNA-seq analysis -...

  • Single-cell RNA-seq analysis

    Winter School on Mathematical and Computational Biology 2019, UQ, 2 July 2019

    Dr Joshua W. K. Ho Associate Professor School of Biomedical Sciences The University of Hong Kong

    Dr Kitty Lo Dr Pengyi Yang Prof Jean Yang

    School of Mathematics and Statistics University of Sydney Sydney Australia

  • Groupminionsbasedontheirsimilarityofphysicalappearance–clusteringIden7fyingdis7nguishingfeaturesbetweendifferentgroupsofminions–differen.alexpressionanalysis

  • Example – diverse cell types in the mouse nervous system

    Zeisel(2018),Cell

  • Exponen7al growth in single cell RNA seq technologies

    Svenssonetal.NatureProtocols(2018)

  • Droplet based technologies are now domina7ng

    Macoskoetal.(2015),Cell

    10XGenomicsisacommercialproviderofdropletbasedscRNAseqplaNorm

  • scRNAseq experiments approaching 1 million cells

    Saundersetal.,(2018)Cell

    690,000individualcellsfrom9regionsofadultmousebrain

  • Number of scRNAseq tools also increasing rapidly

    Downloadedfromwww.scrna-tools.org

  • Steps in scRNA-seq analysis

    Zappiaetal.(2018)

    Software •  CellRanger for 10X Genomics data •  https://support.10xgenomics.com/single-cell-

    gene-expression/software/overview/welcome

    •  Seurat (all-purpose single cell R package) •  https://satijalab.org/seurat/

    •  Scanpy (A python package) •  https://scanpy.readthedocs.io/

    •  Follow their online tutorial…easy to use

  • Batch effect removal

    Batch effect removal •  Seurat (all-purpose single cell R

    package) for very basic normalization •  Batch effect correction

    •  mnnCorrect •  ZINB-Wave •  scMerge

  • E9.5 E10.5 E11.5 E12.5 E13.5 E14.5 E15.5 E16.5 E17.5

    GSE87795Suetal.

    GSE90047Yangetal.

    GSE87038Dongetal.

    GSE96981Campetal.

    N=320cells

    N=389cells

    N=79cells

    N=448cells

    Liver fetal development 7me course datasets

  • tSNE of liver fetal development 7me course datasets

    Highlightedbycelltypes Highlightedbybatches

    Challenge:Strong“batcheffect”

  • scMerge

    scMergeRpackageandwebsite:h\ps://sydneybiox.github.io/scMerge/

    PNAS:h\ps://doi.org/10.1073/pnas.1820006116

  • Coming back to our mo7va7onal data – Liver fetal development 7me course datasets

    −20

    0

    20

    40

    −20 0 20tSNE1

    tSN

    E2

    logcounts

    −20

    0

    20

    −20 −10 0 10 20 30tSNE1

    tSN

    E2

    scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

    batchGSE87038GSE87795GSE90047GSE96981

    −20

    0

    20

    40

    −20 0 20tSNE1

    tSN

    E2logcounts

    −20

    0

    20

    −20 −10 0 10 20 30tSNE1

    tSN

    E2

    scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

    batchGSE87038GSE87795GSE90047GSE96981

    BeforescMerge AQerscMerge

    E10.5 hepatoblasts

    E17.5 cholangiocytes

    E17.5 hepatocytes

  • Cell assignment

    Science questions •  What cell types are present in the dataset?

    •  Can we identify the cell types?

    •  What is the cell type composition?

    •  Are the cells transitioning from one state to

    another?

    Analysis techniques •  Clustering (unsupervised learning)

    •  Classification (supervised learning)

    •  Dimension reduction

  • Dimension reduced plot of our data (tSNE plot)

    −20

    −10

    0

    10

    20

    −20 −10 0 10 20tsne1

    tsne

    2

    t−SNE plot

    How many cell types are there? What are the cell types?

  • k-means clustering

    −20

    −10

    0

    10

    20

    −20 −10 0 10 20tsne1

    tsne

    2

    t−SNE plot

    How many cell types are there? What are the cell types?

  • Clustering algorithms

    k-means

    Hierarchical

    RaceID

    SC3

    CIDR

    countClust

    RCA

    SIMLR

    Luke Zappia, et al. PLoS Comp. Bio. 2018

    25%+

    Clustering algorithms in single cell research

  • Phase 4: Gene iden7fica7on

    Science questions •  Which genes are differentially expressed between

    cell types?

    •  What are the marker genes for each cell type?

  • Differences between single cell and bulk RNAseq

    •  Singlecellgeneexpressionsshowabimodalexpressionpa\ern–abundantgenesareeitherhighlyexpressedorundetected.

    •  Thiscanbetechnical(drop-outs).• Drop-outsleadtotechnicalzeroesinthedata.•  TechnicalzeroesareduetolowcaptureefficiencyinscRNAseqexperiments.

    • Manymethodshavebeenproposedtodealwithdrop-outs

  • Differen7al expression analysis

    •  Simplesta7s7caltest• Wilcoxonranktest,t-test

    • MethodsdevelopedforbulkRNAseqDE•  DESeq2•  EdgeR•  Voom-Limma

    •  scRNAspecific•  Seurat•  MAST•  DECENT•  D3E•  ….manymore!

  • DE methods comparisons for scRNAseq

    SonesonandRobinson(2018)Naturemethods

  • LKS Faculty of Medicine

    Making scRNA-seq analysis more scalable

  • Cloud computing to enable scalability

    • Cloud computing + Big Data Framework •  Cloud computing

    •  A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

    •  Key characteristics – elasticity + pay-as-you-go model •  Advantages – low entry cost + scalability

    •  Big Data framework •  Hadoop – a software framework for distributed processing of big data in

    large scale cluster (YARN for resource management, HDFS for big data storage, and MapReduce for analytics engine)

    •  Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a speed up of up to 100x compared to MapReduce)

  • Falco framework

    MapReduce Spark

    Andrian Yang Yang et al (2017) Bioinforma)cs Michael Troup

  • Falco framework features•  Ease of use

    •  Falco provides helper script to launch EMR cluster and submit jobs to the cluster •  User can easily configure the cluster and jobs

    by modifying the configuration file passed to the helper script

    • Customisation •  Falco allows user to add custom

    alignment and/or quantification tools •  User will need to implement custom function to

    call the aligner/quantification tool •  Custom tool must be compatible with divide-

    and-conquer approach

    [job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !

    Sample configuration for running analysis job

  • Benchmarking•  Single-cell RNA-seq data sets

    •  Mouse embryonic stem cell (mESC) data (869 samples) •  200bp paired-end reads,1.28×1012 bases,

    1.02Tb FASTQ.gz files) •  Human brain data (466 samples)

    •  100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files

    •  Performance comparison of Falco against single-node •  STAR+featureCount (S+F)

    •  Mouse: speedup of 2.6x – 33.4x •  Brain: speedup of 5.1x – 145.4x

    •  HISAT2+HTSeq (H+H) •  Mouse: speedup of 2.5x – 58.4x •  Brain: speedup of 4.0x – 132.5x

    System Nodes Mouse - embryonic stem cell (hours)

    Human - brain (hours)

    S+F H+H S+F H+H

    Standalone

    1 (1 process) 93.7 154.7 85.67 65.34

    1 (5 processes) 29.3 33.8 99.09 67.08

    1 (12 processes) 21.1 16.4 115.71 55.15

    1 (16 processes) 18.5 13.6 114.11 67.98

    Falco

    10 7.0 2.7 32.13 65.34

    20 4.1 1.6 39.64 67.08

    30 3.3 1.4 57.68 67.68

    40 2.8 1.1 76.08 67.98

    Table 1. Runtime analysis of single cell datasets

  • Cost effectiveness using AWS spot instances

    • Utilising spot instances •  AWS allows utilisation of unused Amazon

    computing capacity – known as Spot instances •  Typically cheaper compared to ‘on-demand’ cost

    •  To use spot instance, user needs bid for the resource

    •  Use of spot instance for analysis provides a savings of ~65% compared to using ‘on-demand’ instances •  Alternative use - decrease runtime by utilising more

    instances for a given ‘on-demand’ price

    Figure 3. Spot instance price history for September to October

    Table2.Falcocostanalysis-on-demandvsspotinstances

    Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount

    Dataset Number of nodes

    Time (hours)

    On-demand cost (USD)

    Spot cost (USD)

    % Savings

    Mouse - ESC

    10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98

    Human - brain

    10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98

  • Scaling up to a larger data set •  Data set (for Standalone + Falco) •  Single-cell Mouse oligodendrocyte from

    central nervous system (SRP066613) •  6,283 samples of 50bp single-ended reads,

    totalling to 231.02 Gbp stored in 200 Gb of fastq.gz file.

    •  Standalone + Falco •  Preprocessing with Trimmomatic •  Alignment with STAR •  Quantification with featureCount •  Clustering with CIDR

    •  Cell Ranger – custom pipeline designed by chromium •  Alignment with STAR •  Timing is approximated from runtime of a

    different mouse scRNA-seq dataset

    0.0

    0.5

    1.0

    1.5

    1 Process 12 Processes16 Processes Cell Ranger

    Standalone

    10 Nodes 40 Nodes

    Num

    ber o

    f cel

    ls p

    roce

    ssed

    per

    sec

    onds

    Falco

  • Falco software • Source code

    •  Falco is available to download from Github •  Our work on Falco has been featured in a Nature

    Toolbox article

    Checkout Falco at github.com/VCCRI/Falco

  • starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality

    •  EnablingwidespreaduseofVRvisualisa7onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)

    •  Supportinterac7onusingheadmovement,keyboard,remotegamepad,andvoicecontrol

    JianfuLiYuYao

  • Using starmap to visualise a data set of 68,000 cells from a scRNA-seq data

  • starmap•  starmapdemo:h\ps://vccri.github.io/starmap/

    • 

    •  starmapsourcecode:h\ps://github.com/VCCRI/starmap

    •  bioRxivpreprint:h\ps://www.biorxiv.org/content/early/2018/05/17/324855

  • h\ps://www.abacbs.org/giw/

    Full-papersubmission(fororalpresenta7onandjournalpublica7on):Thisweek!Abstractsubmission(fororalorposterpresenta7on):1September2019?

  • THANK YOU We are recruiting: -  PhD students ($57K pa scholarship) -  Research assistants -  Postdoctoral fellows -  Bioinformaticians (staff) -  Faculty [email protected] https://holab-hku.github.io/ @joshuawkho

    HKU-USydneyStrategicPartnershipFund–‘SingleCellPlus’