Single-cell RNA-seq analysis -...

Single-cell RNA-seq analysis

Winter School on Mathematical and Computational Biology 2019, UQ, 2 July 2019

Dr Joshua W. K. Ho Associate Professor School of Biomedical Sciences The University of Hong Kong

Dr Kitty Lo Dr Pengyi Yang Prof Jean Yang

School of Mathematics and Statistics University of Sydney Sydney Australia

Groupminionsbasedontheirsimilarityofphysicalappearance–clusteringIden7fyingdis7nguishingfeaturesbetweendifferentgroupsofminions–differen.alexpressionanalysis

Example – diverse cell types in the mouse nervous system

Zeisel(2018),Cell

Exponen7al growth in single cell RNA seq technologies

Svenssonetal.NatureProtocols(2018)

Droplet based technologies are now domina7ng

Macoskoetal.(2015),Cell

10XGenomicsisacommercialproviderofdropletbasedscRNAseqplaNorm

scRNAseq experiments approaching 1 million cells

Saundersetal.,(2018)Cell

690,000individualcellsfrom9regionsofadultmousebrain

Number of scRNAseq tools also increasing rapidly

Downloadedfromwww.scrna-tools.org

Steps in scRNA-seq analysis

Zappiaetal.(2018)

Software •  CellRanger for 10X Genomics data •  https://support.10xgenomics.com/single-cell-

gene-expression/software/overview/welcome

•  Seurat (all-purpose single cell R package) •  https://satijalab.org/seurat/

•  Scanpy (A python package) •  https://scanpy.readthedocs.io/

•  Follow their online tutorial…easy to use

Batch effect removal

Batch effect removal •  Seurat (all-purpose single cell R

package) for very basic normalization •  Batch effect correction

•  mnnCorrect •  ZINB-Wave •  scMerge

E9.5 E10.5 E11.5 E12.5 E13.5 E14.5 E15.5 E16.5 E17.5

GSE87795Suetal.

GSE90047Yangetal.

GSE87038Dongetal.

GSE96981Campetal.

N=320cells

N=389cells

N=79cells

N=448cells

Liver fetal development 7me course datasets

tSNE of liver fetal development 7me course datasets

Highlightedbycelltypes Highlightedbybatches

Challenge:Strong“batcheffect”

scMerge

scMergeRpackageandwebsite:h\ps://sydneybiox.github.io/scMerge/

PNAS:h\ps://doi.org/10.1073/pnas.1820006116

Coming back to our mo7va7onal data – Liver fetal development 7me course datasets

−20

0

20

40

−20 0 20tSNE1

tSN

E2

logcounts

−20

0

20

−20 −10 0 10 20 30tSNE1

tSN

E2

scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

batchGSE87038GSE87795GSE90047GSE96981

−20

0

20

40

−20 0 20tSNE1

tSN

E2logcounts

−20

0

20

−20 −10 0 10 20 30tSNE1

tSN

E2

scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

batchGSE87038GSE87795GSE90047GSE96981

BeforescMerge AQerscMerge

E10.5 hepatoblasts

E17.5 cholangiocytes

E17.5 hepatocytes

Cell assignment

Science questions •  What cell types are present in the dataset?

•  Can we identify the cell types?

•  What is the cell type composition?

•  Are the cells transitioning from one state to

another?

Analysis techniques •  Clustering (unsupervised learning)

•  Classification (supervised learning)

•  Dimension reduction

Dimension reduced plot of our data (tSNE plot)

−20

−10

0

10

20

−20 −10 0 10 20tsne1

tsne

2

t−SNE plot

How many cell types are there? What are the cell types?

k-means clustering

−20

−10

0

10

20

−20 −10 0 10 20tsne1

tsne

2

t−SNE plot

How many cell types are there? What are the cell types?

Clustering algorithms

k-means

Hierarchical

RaceID

SC3

CIDR

countClust

RCA

SIMLR

Luke Zappia, et al. PLoS Comp. Bio. 2018

25%+

Clustering algorithms in single cell research

Phase 4: Gene iden7fica7on

Science questions •  Which genes are differentially expressed between

cell types?

•  What are the marker genes for each cell type?

Differences between single cell and bulk RNAseq

•  Singlecellgeneexpressionsshowabimodalexpressionpa\ern–abundantgenesareeitherhighlyexpressedorundetected.

•  Thiscanbetechnical(drop-outs).• Drop-outsleadtotechnicalzeroesinthedata.•  TechnicalzeroesareduetolowcaptureefficiencyinscRNAseqexperiments.

• Manymethodshavebeenproposedtodealwithdrop-outs

Differen7al expression analysis

•  Simplesta7s7caltest• Wilcoxonranktest,t-test

• MethodsdevelopedforbulkRNAseqDE•  DESeq2•  EdgeR•  Voom-Limma

•  scRNAspecific•  Seurat•  MAST•  DECENT•  D3E•  ….manymore!

DE methods comparisons for scRNAseq

SonesonandRobinson(2018)Naturemethods

LKS Faculty of Medicine

Making scRNA-seq analysis more scalable

Cloud computing to enable scalability

• Cloud computing + Big Data Framework •  Cloud computing

•  A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

•  Key characteristics – elasticity + pay-as-you-go model •  Advantages – low entry cost + scalability

•  Big Data framework •  Hadoop – a software framework for distributed processing of big data in

large scale cluster (YARN for resource management, HDFS for big data storage, and MapReduce for analytics engine)

•  Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a speed up of up to 100x compared to MapReduce)

Falco framework

MapReduce Spark

Andrian Yang Yang et al (2017) Bioinforma)cs Michael Troup

Falco framework features•  Ease of use

•  Falco provides helper script to launch EMR cluster and submit jobs to the cluster •  User can easily configure the cluster and jobs

by modifying the configuration file passed to the helper script

• Customisation •  Falco allows user to add custom

alignment and/or quantification tools •  User will need to implement custom function to

call the aligner/quantification tool •  Custom tool must be compatible with divide-

and-conquer approach

[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !

Sample configuration for running analysis job

Benchmarking•  Single-cell RNA-seq data sets

•  Mouse embryonic stem cell (mESC) data (869 samples) •  200bp paired-end reads,1.28×1012 bases,

1.02Tb FASTQ.gz files) •  Human brain data (466 samples)

•  100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files

•  Performance comparison of Falco against single-node •  STAR+featureCount (S+F)

•  Mouse: speedup of 2.6x – 33.4x •  Brain: speedup of 5.1x – 145.4x

•  HISAT2+HTSeq (H+H) •  Mouse: speedup of 2.5x – 58.4x •  Brain: speedup of 4.0x – 132.5x

System Nodes Mouse - embryonic stem cell (hours)

Human - brain (hours)

S+F H+H S+F H+H

Standalone

1 (1 process) 93.7 154.7 85.67 65.34

1 (5 processes) 29.3 33.8 99.09 67.08

1 (12 processes) 21.1 16.4 115.71 55.15

1 (16 processes) 18.5 13.6 114.11 67.98

Falco

10 7.0 2.7 32.13 65.34

20 4.1 1.6 39.64 67.08

30 3.3 1.4 57.68 67.68

40 2.8 1.1 76.08 67.98

Table 1. Runtime analysis of single cell datasets

Cost effectiveness using AWS spot instances

• Utilising spot instances •  AWS allows utilisation of unused Amazon

computing capacity – known as Spot instances •  Typically cheaper compared to ‘on-demand’ cost

•  To use spot instance, user needs bid for the resource

•  Use of spot instance for analysis provides a savings of ~65% compared to using ‘on-demand’ instances •  Alternative use - decrease runtime by utilising more

instances for a given ‘on-demand’ price

Figure 3. Spot instance price history for September to October

Table2.Falcocostanalysis-on-demandvsspotinstances

Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount

Dataset Number of nodes

Time (hours)

On-demand cost (USD)

Spot cost (USD)

% Savings

Mouse - ESC

10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98

Human - brain

10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98

Scaling up to a larger data set •  Data set (for Standalone + Falco) •  Single-cell Mouse oligodendrocyte from

central nervous system (SRP066613) •  6,283 samples of 50bp single-ended reads,

totalling to 231.02 Gbp stored in 200 Gb of fastq.gz file.

•  Standalone + Falco •  Preprocessing with Trimmomatic •  Alignment with STAR •  Quantification with featureCount •  Clustering with CIDR

•  Cell Ranger – custom pipeline designed by chromium •  Alignment with STAR •  Timing is approximated from runtime of a

different mouse scRNA-seq dataset

0.0

0.5

1.0

1.5

1 Process 12 Processes16 Processes Cell Ranger

Standalone

10 Nodes 40 Nodes

Num

ber o

f cel

ls p

roce

ssed

per

sec

onds

Falco

Falco software • Source code

•  Falco is available to download from Github •  Our work on Falco has been featured in a Nature

Toolbox article

Checkout Falco at github.com/VCCRI/Falco

starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality

•  EnablingwidespreaduseofVRvisualisa7onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)

•  Supportinterac7onusingheadmovement,keyboard,remotegamepad,andvoicecontrol

JianfuLiYuYao

Using starmap to visualise a data set of 68,000 cells from a scRNA-seq data

starmap•  starmapdemo:h\ps://vccri.github.io/starmap/

• 

•  starmapsourcecode:h\ps://github.com/VCCRI/starmap

•  bioRxivpreprint:h\ps://www.biorxiv.org/content/early/2018/05/17/324855

h\ps://www.abacbs.org/giw/

Full-papersubmission(fororalpresenta7onandjournalpublica7on):Thisweek!Abstractsubmission(fororalorposterpresenta7on):1September2019?

THANK YOU We are recruiting: -  PhD students ($57K pa scholarship) -  Research assistants -  Postdoctoral fellows -  Bioinformaticians (staff) -  Faculty [email protected] https://holab-hku.github.io/ @joshuawkho

HKU-USydneyStrategicPartnershipFund–‘SingleCellPlus’

Single-cell RNA-seq analysis -...

Documents

Transcript of Single-cell RNA-seq analysis -...