ACE: Explaining cluster from an adversarial perspective

12
ACE: Explaining cluster from an adversarial perspective Yang Young Lu 1 Timothy C. Yu 2 Giancarlo Bonora 1 William Stafford Noble 13 Abstract A common workflow in single-cell RNA-seq anal- ysis is to project the data to a latent space, cluster the cells in that space, and identify sets of mark- er genes that explain the differences among the discovered clusters. A primary drawback to this three-step procedure is that each step is carried out independently, thereby neglecting the effects of the nonlinear embedding and inter-gene depen- dencies on the selection of marker genes. Here we propose an integrated deep learning frame- work, Adversarial Clustering Explanation (ACE), that bundles all three steps into a single workflow. The method thus moves away from the notion of “marker genes” to instead identify a panel of ex- planatory genes. This panel may include genes that are not only enriched but also depleted rela- tive to other cell types, as well as genes that exhib- it differences between closely related cell types. Empirically, we demonstrate that ACE is able to identify gene panels that are both highly discrimi- native and nonredundant, and we demonstrate the applicability of ACE to an image recognition task. 1 1. Introduction Single-cell sequencing technology has enabled the high- throughput interrogation of many aspects of genome biolo- gy, including gene expression, DNA methylation, histone modification, chromatin accessibility and genome 3D archi- tecture (Stuart & Satija, 2019) In each of these cases, the resulting high-dimensional data can be represented as a s- parse matrix in which rows correspond to cells and columns 1 Department of Genome Sciences, University of Washington, Seattle, WA 2 Graduate Program in Molecular and Cellular Biology, University of Washington, Seattle, WA 3 Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA. Correspondence to: William Stafford Noble <william- [email protected]>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1 The Apache licensed source code of ACE will be available at bitbucket.org/noblelab/ace. correspond to features of those cells (gene expression val- ues, methylation events, etc.). Empirical evidence suggests that this data resides on a low-dimensional manifold with latent semantic structure (Welch et al., 2017). Accordingly, identifying groups of cells in terms of their inherent latent semantics and thereafter reasoning about the differences be- tween these groups is an important area of research (Plumb et al., 2020). In this study, we focus on the analysis of single cell RNA- seq (scRNA-seq) data. This is the most widely available type of single-cell sequencing data, and its analysis is chal- lenging not only because of the data’s high dimensionality but also due to noise, batch effects, and sparsity (Amodio et al., 2019). The scRNA-seq data itself is represented as a sparse, cell-by-gene matrix, typically with tens to hundreds of thousands of cells and tens of thousands of genes. A com- mon workflow in scRNA-seq analysis (Pliner et al., 2019) consists of three steps: (1) learn a compact representation of the data by projecting the cells to a lower-dimensional space; (2) identify groups of cells that are similar to each other in the low-dimensional representation, typically via clustering; and (3) characterize the differences in gene ex- pression among the groups, with the goal of understanding what biological processes are relevant to each group. Op- tionally, known “marker genes” may be used to assign cell type labels to the identified cell groups. A primary drawback to the above three-step procedure is that each step is carried out independently. Here, we pro- pose an integrated, deep learning framework for scRNA-seq analysis, Adversarial Clustering Explanation (ACE), that projects scRNA-seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the differences among the discovered clusters (Fig- ure 1). At a high level, ACE first “neuralizes” the clustering procedure by reformulating it as a functionally equivalent multi-layer neural network (Kauffmann et al., 2019). In this way, in concatenation with a deep autoencoder that gen- erates the low-dimensional representation, ACE is able to attribute the cell’s group assignments all the way back to the input genes by leveraging gradient-based neural network explanation methods. Next, for each sample, ACE seeks small perturbations of its input gene expression profile that lead the neuralized clustering model to alter the group as- signments. These adversarial perturbations allow ACE to

Transcript of ACE: Explaining cluster from an adversarial perspective

Page 1: ACE: Explaining cluster from an adversarial perspective

ACE: Explaining cluster from an adversarial perspective

Yang Young Lu 1 Timothy C. Yu 2 Giancarlo Bonora 1 William Stafford Noble 1 3

AbstractA common workflow in single-cell RNA-seq anal-ysis is to project the data to a latent space, clusterthe cells in that space, and identify sets of mark-er genes that explain the differences among thediscovered clusters. A primary drawback to thisthree-step procedure is that each step is carriedout independently, thereby neglecting the effectsof the nonlinear embedding and inter-gene depen-dencies on the selection of marker genes. Herewe propose an integrated deep learning frame-work, Adversarial Clustering Explanation (ACE),that bundles all three steps into a single workflow.The method thus moves away from the notion of“marker genes” to instead identify a panel of ex-planatory genes. This panel may include genesthat are not only enriched but also depleted rela-tive to other cell types, as well as genes that exhib-it differences between closely related cell types.Empirically, we demonstrate that ACE is able toidentify gene panels that are both highly discrimi-native and nonredundant, and we demonstrate theapplicability of ACE to an image recognition task.1

1. IntroductionSingle-cell sequencing technology has enabled the high-throughput interrogation of many aspects of genome biolo-gy, including gene expression, DNA methylation, histonemodification, chromatin accessibility and genome 3D archi-tecture (Stuart & Satija, 2019) In each of these cases, theresulting high-dimensional data can be represented as a s-parse matrix in which rows correspond to cells and columns

1Department of Genome Sciences, University of Washington,Seattle, WA 2Graduate Program in Molecular and Cellular Biology,University of Washington, Seattle, WA 3Paul G. Allen School ofComputer Science and Engineering, University of Washington,Seattle, WA. Correspondence to: William Stafford Noble <[email protected]>.

Proceedings of the 38 th International Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).

1The Apache licensed source code of ACE will be available atbitbucket.org/noblelab/ace.

correspond to features of those cells (gene expression val-ues, methylation events, etc.). Empirical evidence suggeststhat this data resides on a low-dimensional manifold withlatent semantic structure (Welch et al., 2017). Accordingly,identifying groups of cells in terms of their inherent latentsemantics and thereafter reasoning about the differences be-tween these groups is an important area of research (Plumbet al., 2020).

In this study, we focus on the analysis of single cell RNA-seq (scRNA-seq) data. This is the most widely availabletype of single-cell sequencing data, and its analysis is chal-lenging not only because of the data’s high dimensionalitybut also due to noise, batch effects, and sparsity (Amodioet al., 2019). The scRNA-seq data itself is represented as asparse, cell-by-gene matrix, typically with tens to hundredsof thousands of cells and tens of thousands of genes. A com-mon workflow in scRNA-seq analysis (Pliner et al., 2019)consists of three steps: (1) learn a compact representationof the data by projecting the cells to a lower-dimensionalspace; (2) identify groups of cells that are similar to eachother in the low-dimensional representation, typically viaclustering; and (3) characterize the differences in gene ex-pression among the groups, with the goal of understandingwhat biological processes are relevant to each group. Op-tionally, known “marker genes” may be used to assign celltype labels to the identified cell groups.

A primary drawback to the above three-step procedure isthat each step is carried out independently. Here, we pro-pose an integrated, deep learning framework for scRNA-seqanalysis, Adversarial Clustering Explanation (ACE), thatprojects scRNA-seq data to a latent space, clusters the cellsin that space, and identifies sets of genes that succinctlyexplain the differences among the discovered clusters (Fig-ure 1). At a high level, ACE first “neuralizes” the clusteringprocedure by reformulating it as a functionally equivalentmulti-layer neural network (Kauffmann et al., 2019). Inthis way, in concatenation with a deep autoencoder that gen-erates the low-dimensional representation, ACE is able toattribute the cell’s group assignments all the way back tothe input genes by leveraging gradient-based neural networkexplanation methods. Next, for each sample, ACE seekssmall perturbations of its input gene expression profile thatlead the neuralized clustering model to alter the group as-signments. These adversarial perturbations allow ACE to

Page 2: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

define a concise gene set signature for each cluster or pair ofclusters. In particular, ACE attempts to answer the question,“For a given cell cluster, can we identify a subset of geneswhose expression profiles are sufficient to identify membersof this cluster?” We frame this problem as a ranking task,where thresholding the ranked list yields a set of explanatorygenes.

ACE’s joint modeling approach offers several benefits rela-tive to the existing state of the art. First, most existing meth-ods for the third step of the analysis pipeline—identifyinggenes associated with a given group of cells—treat eachgene independently (Love et al., 2014). These approach-es ignore the dependencies among genes that are inducedby gene networks, and often yield lists of genes that arehighly redundant. ACE, in contrast, aims to find a smal-l set of genes that jointly explain a given cluster or pairof clusters. Second, most current methods identify genesassociated with a group of cells without considering thenonlinear embedding model which maps the gene expres-sion to the low-dimensional representation where the groupsare defined in the first place. To our knowledge, the onlyexception is the global counterfactual explanation (GCE)algorithm (Plumb et al., 2020), but that algorithm is limitedto using a linear transformation. A third advantage of ACE’sintegrated approach is its ability to take into account batcheffects during the assignment of genes to clusters. Stan-dard nonlinear embedding methods, such as t-SNE (Van derMaaten & Hinton, 2008) and UMAP (McInnes & Healy,2018; Becht et al., 2019), cannot take such structure intoaccount and hence may lead to incorrect interpretation ofthe data (Amodio et al., 2019; Li et al., 2020). To addressthis problem, deep autoencoders with integrated denoisingand batch correction can be used for scRNA-seq analysis(Lopez et al., 2018; Amodio et al., 2019; Li et al., 2020).We demonstrate below that batch effect structure can beusefully incorporated into the ACE model.

A notable feature of ACE’s approach is that, by identify-ing genes jointly, the method moves away from the notionof a “marker gene” to instead identify a “gene panel”. Assuch, genes in the panel may not be solely enriched in asingle cluster, but may together be predictive of the clus-ter. In particular, in addition to a ranking of genes, ACEassigns a Boolean to each gene indicating whether its inclu-sion in the panel is positive or negative, i.e., whether thegene’s expression is enriched or depleted relative to clus-ter membership. We have applied ACE to both simulatedand real datasets to demonstrate its empirical utility. Ourexperiments demonstrate that ACE identifies gene panelsthat are highly discriminative and exhibit low redundancy.We further provide results suggesting that ACE is useful indomains beyond biology, such as image recognition.

2. Related workACE falls into the paradigm of deep neural network interpre-tation methods, which have been developed primarily in thecontext of classification problems. These methods can beloosely categorized into three types: feature attribution meth-ods, counterfactual-based methods, and model-agnostic ap-proximation methods. Feature attribution methods assignan importance score to individual features so that higherscores indicate higher importance to the output prediction(Simonyan et al., 2013; Shrikumar et al., 2017; Lundberg& Lee, 2017). Counterfactual-based methods typically i-dentify the important subregions within an input sample byperturbing the subregions (by adding noise, rescaling (Sun-dararajan et al., 2017), blurring (Fong & Vedaldi, 2017), orinpainting (Chang et al., 2018)) and measuring the resultingchanges in the predictions. Lastly, model-agnostic approxi-mation methods approximate the model being explained byusing a simpler, surrogate function which is self-explainable(e.g., a sparse linear model, etc.) (Ribeiro et al., 2016).Recently, some interpretation methods have emerged to un-derstand models beyond classification tasks (Samek et al.,2020; Kauffmann et al., 2020; 2019), including the one wepresent in this paper for the purpose of cluster explanation.

ACE’s perturbation approach draws inspiration from ad-versarial machine learning (Xu et al., 2020) where imper-ceivable perturbations are maliciously crafted to misleada machine learning model to predict incorrect outputs. Inparticular, ACE’s approach is closest to the setting of a“white-box attack,” which assumes complete knowledge tothe model, including its parameters, architecture, gradients,etc. (Szegedy et al., 2013; Kurakin et al., 2016; Madry et al.,2017; Carlini & Wagner, 2017). In contrast to these meth-ods, ACE re-purposes the malicious adversarial attack for aconstructive purpose, identifying sets of genes that explainclusters in scRNA-seq data.

ACE operates in concatenation with a deep autoencoder thatgenerates the low-dimensional representation. In this paper,ACE uses SAUCIE (Amodio et al., 2019), a commonly-used scRNA-seq embedding method that incorporates batchcorrection. In principle, ACE is generalizable to any off-the-shelf scRNA-seq embedding methods, including SLICER(Welch et al., 2016), scVI (Lopez et al., 2018), scANVI (Xuet al., 2021), DESC (Li et al., 2020), and ItClust (Hu et al.,2020).

3. Approach3.1. Problem setup

We aim to carry out three analysis steps for a given scRNA-seq dataset, producing a low-dimensional representationof each cell’s expression profile, a cluster assignment foreach cell, and a concise set of “explanatory genes” for each

Page 3: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

GenesC

ells

Encoder Decoder

Gene 1

...

Embeddings

Gene 2

Gene 3

Gene p

Cell

...

Gene 1

...

Gene 2

Gene 3

Gene p

Neuralized clusteringEncoderCell

...

Source group assignment

Target group assignment

Gene p

Rank Gene Score#1

Gene 1#2Gene 2#3Gene 3#4

... ... ...

Input: gene expression matrix Deep autoencoder learns low-dimensional representation Embedding clustering

Clustering is neuralized and concatenated with the encoder Differentiation analysis by ACE Output: gene relevance

+

... ...

Pertu

rbat

ion

vesus

...Figure 1. ACE workflow. ACE takes as input a single-cell gene expression matrix and learns a low-dimensional representation for eachcell. Next, a neuralized version of the k-means algorithm is applied to the learned representation to identify cell groups. Finally, forpairs of groups of interest (either each group compared to its complement, or all pairs of groups), ACE seeks small perturbations of itsinput gene expression profile that lead the neuralized clustering model to alter the assignment from one group to the other. The workflowemploys a combined objective function to induce the nonlinear embedding and clustering jointly. ACE produces as output the learnedembedding, the cell group assignments, and a ranked list of explanatory genes for each cell group.

cluster or pair of clusters. Let X = (x1, x2, · · · , xn)T ∈

Rn×p be the normalized gene expression matrix obtainedfrom a scRNA-seq experiment, where rows correspond to ncells and columns correspond to p genes. ACE relies on thefollowing three components: (1) an autoencoder to learn alow-dimensional representation of the scRNA-seq data, (2) aneuralized clustering algorithm to identify groups of cells inthe low-dimensional representation, and (3) an adversarialperturbation scheme to explain differences between groupsby identifying explanatory gene sets.

3.2. Learning the low-dimensional representation

Embedding scRNA-seq expression data into a low-dimensional space aims to capture the underlying structureof the data, based upon the assumption that the biologicalmanifold on which cellular expression profiles lie is inher-ently low-dimensional. Specifically, ACE aims to learn amapping f(·) : Rp 7→ Rd that transforms the cells from thehigh-dimensional input space Rp to a lower-dimensionalembedding space Rd, where d p. To accurately representthe data in Rd, we use an autoencoder consisting of twocomponents, an encoder f(·) : Rp 7→ Rd and a decoderg(·) : Rd 7→ Rp. This autoencoder optimizes the genericloss

minθ

n∑i=1

‖xi − g(f(xi))‖22 (1)

Finally, we denote Z = (z1, z2, · · · , zn)T ∈ Rn×d as the

low-dimensional representation obtained from the encoder,where zi ∈ Rd = f(xi) is the embedded representation ofcell xi.

The autoencoder in ACE can be extended in several impor-tant ways. For example, in some settings, Equation 1 isaugmented with a task-specific regularizer Ω(X):

minθ

n∑i=1

‖xi − g(f(xi))‖22 + Ω(X). (2)

As mentioned in Section 2, the scRNA-seq embeddingmethod used by ACE, SAUCIE, encodes in Ω(X) a batchcorrection regularizer by using maximum mean discrepancy.In this paper, ACE uses SAUCIE coupled with a feature se-lection layer (Abid et al., 2019), with the aim of minimizingredundancy and facilitating selection of diverse explanatorygene sets.

3.3. Neuralizing the clustering step

To carry out clustering in the low-dimensional space learnedby the autoencoder, ACE uses a neuralized version of thek-means algorithm. This clustering step aims to partitionZ ∈ Rn×d into C groups, where each group potentiallycorresponds to a distinct cell type.

The standard k-means algorithm aims to minimize the fol-

Page 4: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

lowing objective function by identifying a set of group cen-troids

µc ∈ Rd : c = 1, 2, · · · , C

:

min∑ic

δicoc(zi) (3)

where δic indicates whether cell zi belongs to group c andthe “outlierness” measure oc(zi) of cell zi relative to groupc is defined as oc(zi) = ‖zi − µc‖2.

Following Kauffmann et al. (2019), we neuralize the k-means algorithm by creating a neural network containingC modules, each with two layers. The architecture is mo-tivated by a soft assignment function that quantifies, for aparticular cell zi and a specified group c, the group assign-ment probability score

pc(zi) =exp(−βoc(zi))∑k exp(−βok(zi))

(4)

where the hyperparameter β controls the clustering fuzzi-ness. As β approaches infinity, Equation 4 approaches theindicator function for the closest centroid and thus reducesto hard clustering. To measure the confidence of groupassignment, we use a logit function written as

mc(zi) = log

(pc(zi)

1− pc(zi)

)= β ·

β

mink 6=c

‖zi − µk‖2 − ‖zi − µc‖2

(5)

where minβk 6=c · = − 1β log

∑exp(−β(·)) indicates a

soft min-pooling layer. (See Kauffmann et al. (2019) for adetailed derivation.) The rationale for using the logit func-tion is that if there is as much confidence supporting thegroup membership as against it, then the confidence scoremc(z) = 0. Additionally, Equation 5 has the followinginterpretation: the data point z belongs to the group c ifand only if the distance to its centroid is smaller than thedistance to all other competing groups. Equation 5 furtherdecomposes into a two-layer neural network module:

hck(zi) = wTckzi + bck

mc(zi) = β ·β

mink 6=chck(zi)

(6)

where the first layer is a linear transformation layer withparameters wck = 2 · (µc − µk) and bck = ‖µk‖2 − ‖µc‖2,and the second layer is the soft min-pooling layer introducedin Equation 5. ACE constructs one such module for each ofthe C clusters, as illustrated in Figure 1.

3.4. Explaining the groups

ACE’s final step aims to induce, for each cluster identifiedby the neuralized k-mean algorithm, a ranking on genes

such that highly ranked genes best explain that cluster. Weconsider two variants of this task: the one-vs-rest settingcompares the group of interest Zs = f(Xs) ⊆ Z to itscomplement set Zt = f(Xt) ⊆ Z, where Xt = X\Xs; theone-vs-one setting compares one group of interest in Zs =f(Xs) ⊆ Z to a second group of interest Zt = f(Xt) ⊆ Z.In each setting, the goal is to identify the key differencesbetween the source group Xs ⊆ X and the target groupXt ⊆ X in the input space, i.e., in terms of the genes.

We treat this as a neural network explanation problem byfinding the minimal perturbation within the group of interest,x ∈ Xs, that alters the group assignment from the sourcegroup s to the target group t. Specifically, we optimize anobjective function that is a mixture of two terms: the firstterm is the difference between the current sample x and theperturbed sample x = x+ δ where δ ∈ Rp, and the secondterm quantifies the difference in group assignments inducedby the perturbation.

The objective function for the one-vs-one setting is

minδ‖δ‖1 + λmax(0, α+ms(x+ δ)−mt(x+ δ)) (7)

where λ > 0 is a tradeoff coefficient to either encourage asmall perturbation of x when small or a stronger alternationto the target group when large. The second term penalizesthe situation where the group logit for the source group sis still larger than the target group t, up to a pre-specifiedmargin α > 0. In this paper we fix α = 1.0. The differencebetween the current sample x and the potentially perturbedx is measured by the L1 norm to encourage sparsity andnon-redundancy. Note that Equation 7 assumes that theinput expression matrix is normalized so that a perturbationadded to one gene is equivalent to that same perturbationadded to a different gene.

Analogously, in the one-vs-rest case, the objective functionfor the optimization is

minδ‖δ‖1+λmax(0, α+ms(x+δ)−max

t6=smt(x+δ)) (8)

where the second term penalizes the situation in which thegroup logit for the source group s is larger than all non-source target groups.

Finally, with the δ ∈ Rp obtained by optimizing eitherEquation 7 or Equation 8, ACE quantifies the importance ofthe ith gene relative to a perturbation from source group s totarget group t as the absolute value of δi, thereby inducinga ranking in which highly ranked genes are more specific tothe group of interest.

4. Baseline methodsWe compare ACE against six methodologically distinct base-line methods, each of which induces a ranking on genes interms of group-specific importance, analogous to ACE.

Page 5: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

DESeq2 (Love et al., 2014) is a representative statisticalhypothesis testing method that tests for differential geneexpression based on a negative binomial model. The maincaveat of DESeq2 is that it treats each gene as independent.

The Jensen-Shannon Distance (JSD) (Cabili et al., 2011) isa representative distribution distance-based method whichquantifies the specificity of a gene to a cell group. Similarto DESeq2, JSD considers each gene independently.

Global counterfactual explanation (GCE) (Plumb et al.,2020) is a compressed sensing method that aims to identifyconsistent differences among all pairs of groups. UnlikeACE, GCE requires a linear embedding of the scRNA-seqdata.

The gene relevance score (GRS) (Angerer et al., 2020) is agradient-based explanation method that aims to attribute alow-dimensional embedding back to the genes. The mainlimitations of GRS are two-fold. First, the embedding usedin GRS is constrained to be a diffusion map, which is chosenspecifically to make the gradient easy to calculate. Second,taking the gradient with respect to the embedding only indi-rectly measures the group differentiation compared to takingthe gradient with respect to the group difference directly, asin ACE.

SmoothGrad (Smilkov et al., 2017) and SHAP (Lundberg &Lee, 2017), which are designed primarily for classificationproblems, are two representative feature attribution methods.Each one computes an importance score that indicates eachgene’s contribution to the clustering assignment. Smooth-Grad relies on knowledge to the model, whereas SHAP doesnot.

5. Results5.1. Performance on simulated data

To compare ACE to each of the baseline methods, we useda recently reported simulation method, SymSim (Zhanget al., 2019), to generate two synthetic scRNA-seq datasets:one “clean” dataset and one “complex” dataset. In bothcases, we simulated many redundant genes, in order toadequately challenge methods that aim to detect a minimalset of informative genes.

The simulation of the clean dataset uses a protocol similarto that of Plumb et al. (2020). We first used SymSim togenerate a background matrix containing simulated countsfrom 500 cells, 2000 genes, and five distinct clusters. Wethen used this background matrix to construct our simu-lated dataset of 500 cells by 220 genes. The simulateddata is comprised of three sets of genes: 20 causal genes,100 dependent genes, and 100 noise genes. To select thecausal genes, we identified all genes that are differentiallyexpressed by SymSim’s criteria (nDiff-EVFgene > 0 and

| log2 fold-change| > 0.8) between at least one pair of clus-ters, and we selected the 20 genes that exhibit the largestaverage fold-change across all pairs of clusters in which thegene was differentially expressed. A UMAP embedding onthese causal genes alone confirms that they are jointly capa-ble of separating cells into their respective clusters (Fig. 2A).Next, we simulated 100 dependent genes, which are weight-ed sums of 1–10 randomly selected causal genes, with addedgaussian noise. As such, a dependent gene is highly cor-related with a causal gene or with a linear combination ofmultiple causal genes. The weights were sampled from acontinuous uniform distribution, U(0.01, 0.8), and the gaus-sian noise was sampled from N(0, 1). As expected, thedependent genes are also jointly capable of separating cellsinto their respective clusters (Fig. 2A). Lastly, we found allgenes that were not differentially expressed between anycluster pair in the ground truth, and we randomly sampled100 noise genes. These genes provide no explanation of theclustering structure (Fig. 2A).

To simulate the complex dataset, we used SymSim to adddropout events and batch effects to the background ma-trix generated previously. We then selected the same exactcausal and noise genes as in the clean dataset, and used thesame exact random combinations and weights to generatethe dependent genes. Thus, the clean and complex datasetscontain the same 220 genes; however, the complex datasetenables us to gauge how robust ACE is to artifacts of tech-nical noise observed in real single-cell RNA-seq datasets(Fig. 2B).

To compare the different gene ranking methods, we need tospecify the ground truth cluster labels and a performancemeasure. We observe that the embedding representationlearned by ACE exhibits clear cluster patterns even in the p-resence of dropout events and batch effects, and thus ACE’sk-means clustering is able to recover these clusters (Ap-pendix Figure A.1). Accordingly, to compare different meth-ods for inducing gene rankings, we provide ACE and eachbaseline method with the ground truth clustering labels fromthe original study (Zheng et al., 2017). ACE then calculatesthe group centroid used in Equation 3 by averaging the datapoints of the corresponding ground truth cluster. The em-bedding layer together with the group centroids are thenused to build the neuralized clustering model (Equation 6).Each method produces gene rankings for every cluster in aone-vs-rest fashion. To measure how well a gene rankingcaptures clustering structure, we use the Jaccard distance tomeasure the similarity between a cell’s k nearest neighbors(k-NN) when using a subset of top-ranked genes and a cell’sk-NN when using all genes. To compute the k-NN, weuse the Euclidean distance metric. The Jaccard distance isdefined as

JD(i) = 1− Sfull ∩ SsubSfull ∪ Ssub

(9)

Page 6: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Figure 2. Comparing ACE to baseline methods on simulated scRNA-seq datasets. Each dataset consists of 20 causal genes, 100dependent genes, and 100 noise genes. (A) UMAP embeddings of cells composing the clean dataset. Panels correspond to embeddingsusing the three subsets of genes (causal, dependent, and noise), as well as all of the genes together. (B) Same as panel A, but for thecomplex dataset. (C) Comparison of methods via Jaccard distance as a function of the number of genes in the ranking. ACE performssubstantially better than each of the baseline methods on the clean dataset. The gray dashed line indicates the mean Jaccard distanceachieved by the 20 causal genes alone. (D) Same as panel C but for the complex dataset.

where Sfull represents cell i’s k-NN’s when using all genes,and Ssub represents cell i’s k-NN’s when using a subset oftop-ranked genes. If the subset of top-ranked genes does agood job of explaining a cluster of cells, then Sfull ∩ Ssuband Sfull ∪ Ssub should be nearly equal, and the Jaccarddistance should approach 0. We select the gene rankingused to derive a subset of top-ranked genes based on thecell cluster assignment. For example, if the cell belongs incluster 2, we use the cluster 2 vs. rest gene ranking. Thus, toobtain a global measure of how well a clustering structureis captured on a subset of top-ranked genes, we report themean Jaccard distance across all cells.

Our analysis shows that ACE considerably outperforms eachof the baseline methods on the clean dataset, indicating thatit is superior at identifying the minimal set of informativegenes (Fig. 2B). Notably, ACE outperforms the mean Jac-card distance achieved by the causal genes alone beforereaching 20 genes used, suggesting that the method success-fully identifies dependent genes that are more informativethan individual causal genes. ACE also performs stronglyon the complex dataset, though it appears to perform onpar with SmoothGrad and SHAP) (Fig. 2D). Notably, thesethree methods —ACE, SHAP, and SmoothGrad —share acommon feature, employing the SAUCIE framework thatfacilitates automatic batch effect correction, highlightingthe utility of DNN-based dimensionality reduction and in-terpretation methods for single-cell RNA-seq applications.

5.2. Real data analysis

We next applied ACE to a real dataset of peripheral bloodmononuclear cells (PBMCs) (Zheng et al., 2017), repre-sented as a cell-by-gene log-normalized expression matrixcontaining 2638 cells and 1838 highly variable genes. Thecells in the dataset were previously categorized into eightcell types, obtained by performing Louvain clustering (Blon-del et al., 2008) and annotating each cluster on the basisof differentially expressed marker genes. As shown in Fig-ure 3A and Appendix Figure A.2, ACE’s k-means clusteringsuccessfully recovers the reported cell types based upon the10-dimensional embedding learned by SAUCIE.

We first aimed to quantify the discriminative power of thetop-ranked genes identified by ACE in comparison to thesix baseline methods. To do this, we applied all the sixbaseline methods to the PBMC dataset using the groupsidentified by the k-means clustering based on the SAUCIEembedding. For each group of cells, we extracted the top-k group-specific genes reported by each method, where kranges from 1%, 2%, · · · , 100% among all genes. Giventhe selected gene subset, we then trained a support vectormachine (SVM) classifier with a radial basis function kernelto separate the target group from the remaining groups. TheSVM training involves two hyperparameters, the regular-ization coefficient C and the bandwidth parameter σ. Theσ parameter is adaptively chosen so that the training datais Z-score normalized, using the default settings in Scikit-learn (Pedregosa et al., 2011). The C parameter is selected

Page 7: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Figure 3. Comparing ACE to baseline methods on PBMC dataset. (A) UMAP embedding of PBMC cells labelled by ACE’s k-meansclustering assignment. (B) Classification performance of each method, as measured by AUROC, as a function of the number of genes inthe set. Error bars correspond to the standard error of the mean of AUROC scores from each test split across different target groups. (C)Redundancy among the top k genes, as measured by Pearson correlation, as a function of k. Error bars correspond to the standard error ofthe mean calculated from the group-specific correlations. (D) The figure plots overlaps among the top 18 genes (corresponding to 1% of1838 genes) identified by all seven methods with respect to the CD4 T cell cluster.

by grid search from 5−5, 5−4, · · · , 50, · · · , 54, 55 . Theclassification performance, in terms of area under the receiv-er operating characteristic curve (AUROC), is evaluated by3-fold stratified cross-validation, and an additional 3-foldcross-validation is applied within each training split to de-termine the optimal C hyperparameter. Finally, AUROCscores from each test split across different target groups areaggregated and reported, in terms of the mean and the stan-dard error of the mean. Two cell types—megakaryocytesand dendritic cells—are excluded due to insufficient samplesize (< 50). As shown in Figure 3B, the top-ranked genesreported by ACE are among the most discriminative acrossall methods, particularly when the inclusion size is small(≤ 3%). The only method that yields superior performanceis DESeq2.

We next tested the redundancy of top-ranked genes, as itis desirable to identify diverse explanatory gene sets withminimum redundancy. Specifically, for each target group ofcells, we calculate the Pearson correlations between all genepairs within top k genes, for varying values of k. The meanand standard error of the mean of these correlations arecomputed within each group and then averaged across dif-ferent target groups. The results of this analysis (Figure 3C)suggest that the top-ranked genes reported by ACE are a-mong the least redundant across all methods. Other methodsthat exhibit low redundancy include GRS and the two meth-ods that use the same SAUCIE model (i.e., SmoothGradand SHAP). In conjunction with the discriminative poweranalysis in Figure 3B, we conclude that ACE achieves apowerful combination of high discriminative power and lowredundancy.

Finally, to better understand how these methods differ fromone another, we investigated the consistency among the top-ranked genes reported by each method. For this analysis, wefocused on one particular group, CD4 T cells. We discoverstrong disagreement among the methods (Figure 3D). Sur-

prisingly, no single gene is selected among the top 1% by allmethods. Among all methods, ACE covers the most that arereported by at least one other method (14 out of 18 genes).The four genes that ACE uniquely identifies (red bar in Fig-ure 3D)—CCL5, GZMK, SPOCD1, and SNRNP27—aredepleted rather than enriched relative to other cell types. It isworth mentioning that both CCL5 and GZMK are enrichedin CD8 T cells (Thul et al., 2017), the closest cell type toCD4 T cell (Figure 3A). This observation suggests ACEidentifies cells that exhibit highly discriminative changesin expression between two closely related cell types. In-deed, among ACE’s 18-gene panel, 15 genes are depletedrather than enriched, suggesting that much of CD4’s cellidentity may be due to inhibition rather than activation ofspecific genes. In summary, ACE is able to move away fromthe notion of a “marker gene” to instead identify a highlydiscriminative, nonredundant gene panel.

5.3. Image analysis

Although we developed ACE for application to scRNA-seqdata, we hypothesized that the method would be useful in do-mains beyond biology. Explanation methods are potentiallyuseful, for example, in the analysis of biomedical images,where the explanations can identify regions of the imageresponsible for assignment of the image to a particular phe-notypic category. As a proof of principle for this generaldomain, we applied ACE to the MNIST handwritten digitsdataset (LeCun, 1998), with the aim of studying whetherACE can identify which pixels in a given image explain whythe image was assigned to one digit versus another. Specif-ically, we solve the optimization problem for each inputimage in Equation 7, seeking an image-specific set of pixelmodifications, subject to the constraint that the perturbedimage pixel values are restricted to lie in the range [0, 1].Note that this task is somewhat different from the scRNA-seq case: in the MNIST case, ACE finds a different set ofexplanatory pixels for each image, whereas in the scRNA-

Page 8: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Figure 4. Applying ACE to the MNIST dataset. ACE is able to explain 20 types of digit transitions in a pixel-wise manner. These digittransitions are chosen such that each digit category is covered at least once in both directions.

seq case, ACE seeks a single set of genes that explains labeldifferences across all cells in the dataset.

ACE was applied to this dataset as follows. We used a sim-ple convolution neural network architecture containing twoconvolution layers, each with a modest filter size (5, 5), amodest number of filters (32) and ReLU activation, followedby a max pooling layer with a pool size (2, 2), a fully con-nected layer, and a softmax layer. The model was trainedon the MNIST training set (60,000 examples) for 10 epochs,using Adam (Kingma & Ba, 2015) with an initial learningrate of 0.001. The network achieves 98.7% classificationaccuracy on the test test of 10,000 images. We observe thatthe embedding representation in the last pooling layer ex-hibits well-separated cluster patterns (Appendix Figure A.3).Since our goal is not to learn the cluster structure per se, forsimplicity, we fixed the number of groups to be the numberof digit categories (i.e., 10) and calculated the group cen-troid used in Equation 3 by averaging the data points of thecorresponding category. The embedding layer together withthe group centroids are then used to build the neuralizedclustering model (Equation 6.)

The results of this analysis show that ACE does a good jobof identifying sets of pixels that accurately explain differ-ences between pairs of digits. We examined the pixel-wiseexplanations of 20 pairs of digits, randomly selected to cov-er each digit category at least once in both directions (Fig. 4).For example, to convert “8” to “5,” ACE disconnects thetop right and bottom left of “8,” as expected. Similarly, toconvert “8” to “3,” ACE disconnects the top left and bottomleft of “8.” It is worth noting that the modifications intro-duced by ACE are inherently symmetric. For example, toconvert “1” to “7” and back again, ACE suggests addingand removing the same part of “7.”

6. Discussion and conclusionIn this work, we have proposed a deep learning-basedscRNA-seq analysis pipeline, ACE, that projects scRNA-seq data to a latent space, clusters the cells in that space,and identifies sets of genes that succinctly explain the d-ifferences among the discovered clusters. Compared toexisting state-of-the-art methods, ACE jointly takes intoconsideration both the nonlinear embedding of cells to alow-dimensional representation and the intrinsic dependen-cies among genes. As such, the method moves away fromthe notion of a “marker gene” to instead identify a panelof genes. This panel may include genes that are not onlyenriched but also depleted relative to other cell types, aswell as genes that exhibit important differences betweenclosely related cell types. Our experiments demonstrate thatACE identifies gene panels that are highly discriminativesets and exhibit low redundancy. We also provide resultssuggesting that ACE’s approach may be useful in domainsbeyond biology, such as image recognition.

This work points to several promising directions for futureresearch. In principle, ACE can be used in conjunctionwith any off-the-shelf scRNA-seq embedding method. Thus,empirical investigation of the utility of generalizing ACEto use embedders other than SAUCIE would be interest-ing. Another possible extension is to apply neuralization toalternative clustering algorithms. For example, in the con-text of scRNA-seq analysis the Louvain algorithm (Blondelet al., 2008) is commonly used and may be a good candidatefor neuralization. A promising direction for future work isto provide confidence estimation for the top-ranked group-specific genes, in terms of q-values (Storey, 2003), with thehelp of the recently proposed knockoffs framework (Barber& Candès, 2015; Lu et al., 2018).

Page 9: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

ReferencesAbid, A., Balin, M. F., and Zou, J. Concrete autoencoders

for differentiable feature selection and reconstruction.International Conference on Machine Learning, 2019.

Amodio, M., Dijk, D. V., Srinivasan, K., Chen, W. S.,Mohsen, H., Moon, K. R., Campbell, A., Zhao, Y., Wang,X., Venkataswamy, M., and Krishnaswamy, S. Exploringsingle-cell data with deep multitasking neural networks.Nature Methods, pp. 1–7, 2019.

Angerer, P., Fischer, D. S., Theis, F. J., Scialdone, A., andMarr, C. Automatic identification of relevant genes fromlow-dimensional embeddings of single cell rnaseq data.Bioinformatics, 2020.

Barber, R. F. and Candès, E. J. Controlling the false discov-ery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.

Becht, E., McInnes, L., Healy, J., Dutertre, C., Kwok, I.W. H., Ng, L. G., Ginhoux, F., and Newell, E. W. Dimen-sionality reduction for visualizing single-cell data usingUMAP. Nature Biotechnology, 37(1):38–44, 2019.

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefeb-vre, E. Fast unfolding of communities in large net-works. Journal of Statistical Mechanics: Theory andExperiment, 2008(10):P10008, 2008.

Cabili, M. N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., and Rinn, J. L. Integrative annotationof human large intergenic noncoding RNAs reveals globalproperties and specific subclasses. Genes Dev, 25(18):1915–1927, 2011.

Carlini, N. and Wagner, D. Towards evaluating the robust-ness of neural networks. In 2017 IEEE Symposium onSecurity and Privacy (SP), pp. 39–57. IEEE, 2017.

Chang, C., Creager, E., Goldenberg, A., and Duvenaud, D.Explaining image classifiers by counterfactual generation.arXiv preprint arXiv:1807.08024, 2018.

Fong, R. and Vedaldi, A. Interpretable explanations of blackboxes by meaningful perturbation. In Proceedings of theIEEE International Conference on Computer Vision, pp.3429–3437, 2017.

Hu, J., Li, X., Hu, G., Lyu, Y., Susztak, K., and Li, M.Iterative transfer learning with neural network for clus-tering and cell type classification in single-cell RNA-seqanalysis. Nature Machine Intelligence, 2(10):607–618,2020.

Kauffmann, J., Esders, M., Montavon, G., Samek, W., andMüller, K. From clustering to cluster explanations vianeural networks. arXiv preprint arXiv:1906.07633, 2019.

Kauffmann, J., Müller, K., and Montavon, G. Towards ex-plaining anomalies: a deep taylor decomposition of one-class models. Pattern Recognition, 101:107198, 2020.

Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. In Proceedings of the 3rd InternationalConference on Learning Representations, 2015.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversar-ial examples in the physical world. arXiv preprintarXiv:1607.02533, 2016.

LeCun, Y. The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.

Li, X., Wang, K., Lyu, Y., Pan, H., Zhang, J., Stambo-lian, D., Susztak, K., Reilly, M. P., Hu, G., and Li, M.Deep learning enables accurate clustering with batch ef-fect removal in single-cell RNA-seq analysis. NatureCommunications, 11(1):1–14, 2020.

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef,N. Deep generative modeling for single-cell transcrip-tomics. Nature Methods, 15(12):1053–1058, 2018.

Love, M., Huker, W., and Anders, S. Moderated estimationof fold change and dispersion for RNA-seq data withdeseq2. Genome Biology, 15(550), 2014.

Lu, Y. Y., Fan, Y., Lv, J., and Noble, W. S. DeepPINK:reproducible feature selection in deep neural networks.In Advances in Neural Information Processing Systems,2018.

Lundberg, S. M. and Lee, S. A unified approach to interpret-ing model predictions. Advances in Neural InformationProcessing Systems, 2017.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant to ad-versarial attacks. arXiv preprint arXiv:1706.06083, 2017.

McInnes, L. and Healy, J. UMAP: Uniform manifoldapproximation and projection for dimension reduction.arXiv, 2018.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.Scikit-learn: Machine learning in Python. Journal ofMachine Learning Research, 12:2825–2830, 2011.

Pliner, H. A., Shendure, J., and Trapnell, C. Supervised clas-sification enables rapid annotation of cell atlases. NatureMethods, 16(10):983–986, 2019.

Page 10: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Plumb, G., Terhorst, J., Sankararaman, S., and Talwalka-r, A. Explaining groups of points in low-dimensionalrepresentations. ICML, 2020.

Ribeiro, M., Singh, S., and Guestrin, C. "why should I trustyou?": Explaining the predictions of any classifier. InProceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,KDD ’16, pp. 1135–1144, New York, NY, USA, 2016.ACM.

Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J.,and Müller, K. R. Toward interpretable machine learning:Transparent deep neural networks and beyond. arXivpreprint arXiv:2003.07631, 2020.

Shrikumar, A., Greenside, P., Shcherbina, A., and Kunda-je, A. Learning important features through propagatingactivation differences. In International Conference onMachine Learning, 2017.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-sification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.

Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Watten-berg, M. Smoothgrad: removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017.

Storey, J. D. The positive false discovery rate: A bayesianinterpretation and the q-value. The Annals of Statistics,31(6):2013–2035, 2003.

Stuart, T. and Satija, R. Integrative single-cell analysis.Nature Reviews Genetics, 20:252–272, 2019.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribu-tion for deep networks. In International Conference onMachine Learning, 2017.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing properties ofneural networks. arXiv preprint arXiv:1312.6199, 2013.

Thul, P., Åkesson, L., Wiking, M., Mahdessian, D., Gelada-ki, A., Blal, H., Alm, T., Asplund, A., Björk, L., Breckels,L., et al. A subcellular map of the human proteome.Science, 356(6340), 2017.

Van der Maaten, L. and Hinton, G. Visualizing data usingt-SNE. Journal of Machine Learning Research, 9(2579-2605):85, 2008.

Way, G. and Greene, C. Bayesian deep learning for single-cell analysis. Nature Methods, 15(12):1009–1010, 2018.

Welch, J., Hartemink, A., and Prins, J. SLICER: inferringbranched, nonlinear cellular trajectories from single cellrna-seq data. Genome Biology, 17(1):1–15, 2016.

Welch, J. D., Hartemink, A. J., and Prins, J. F. MATCHER:manifold alignment reveals correspondence between sin-gle cell transcriptome and epigenome dynamics. Genomebiology, 18(1):138, 2017.

Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M.,and Yosef, N. Probabilistic harmonization and annotationof single-cell transcriptomics data with deep generativemodels. Molecular Systems Biology, 17(1):e9620, 2021.

Xu, H., Ma, Y., Liu, D., Liu, H., Tang, J., and Jain, A.Adversarial attacks and defenses in images, graphs andtext: A review. International Journal of Automation andComputing, 17(2):151–178, 2020.

Zhang, X., Xu, C., and Yosef, N. Simulating multiplefaceted variability in single cell RNA sequencing. NatureCommunications, 10(1):1–16, 2019.

Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P.,Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D.,McDermott, G. P., Zhu, J., Gregory, M. T., Shuga, J.,Montesclaros, L., Underwood, J. G., Masquelier, D. A.,Nishimura, S. Y., Schnall-Levin, M., Wyatt, P. W., Hind-son, C. M., Bharadwaj, R., Wong, A., Ness, K. D., Beppu,L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Va-lente, W. J., Ericson, N. G., Stevens, E. A., Radich, J. P.,Mikkelsen, T. S., Hindson, B. J., and Biela, J. H. Mas-sively parallel digital transcriptional profiling of singlecells. Nature Communications, 8:14049, 2017.

Page 11: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Figure A.3. The embedding representation in the last pooling layerof the convolutional neural network exhibits well-separated clusterpatterns among 10 digits on the MNIST dataset.

Page 12: ACE: Explaining cluster from an adversarial perspective

Adversarial clustering explanation

Figure A.1. The embedding representation learned by SAUCIE exhibits well-separated cluster patterns on both (A) clean and (B) complexsimulated scRNA-seq datasets.

Figure A.2. The embedding representation learned by SAUCIE exhibits similar cluster patterns by using either (A) the Louvain algorithmor (B) k-means clustering on the PBMC dataset.