Parallax: Visualizing and Understanding the Semantics of ...quantum logic not operator it is...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 165–180Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

165

Parallax: Visualizing and Understanding the Semantics of EmbeddingSpaces via Algebraic Formulae

Piero MolinoUber AI Labs

San Francisco, CA, [email protected]

Yang WangUber Technologies Inc.

San Francisco, CA, [email protected]

Jiawei Zhang∗Facebook

Menlo Park, CA, [email protected]

Abstract

Embeddings are a fundamental component ofmany modern machine learning and naturallanguage processing models. Understandingthem and visualizing them is essential for gath-ering insights about the information they cap-ture and the behavior of the models. In thispaper, we introduce Parallax1, a tool explicitlydesigned for this task. Parallax allows the userto use both state-of-the-art embedding analy-sis methods (PCA and t-SNE) and a simple yeteffective task-oriented approach where userscan explicitly define the axes of the projectionthrough algebraic formulae. In this approach,embeddings are projected into a semanticallymeaningful subspace, which enhances inter-pretability and allows for more fine-grainedanalysis. We demonstrate2 the power of thetool and the proposed methodology through aseries of case studies and a user study.

1 Introduction

Learning representations is an important part ofmodern machine learning and natural languageprocessing. These representations are often real-valued vectors also called embeddings and are ob-tained both as byproducts of supervised learningor as the direct goal of unsupervised methods. In-dependently of how the embeddings are learned,there is much value in understanding what infor-mation they capture, how they relate to each otherand how the data they are learned from influencesthem. A better understanding of the embeddedspace may lead to a better understanding of thedata, of the problem and the behavior of the model,and may lead to critical insights in improving suchmodels. Because of their high-dimensional nature,they are hard to visualize effectively.

∗Work done while at Purdue University1http://github.com/uber-research/parallax

2https://youtu.be/CSkJGVsFPIg

Figure 1: Screenshot of Parallax.

In this paper, we introduce Parallax, a tool forvisualizing embedding spaces. The most widelyadopted projection techniques (Principal Com-ponent Analysis (PCA) (Pearson, 1901) and t-Distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008)) areavailable in Parallax. They are useful for obtainingan overall view of the embedding space, but theyhave a few shortcomings: 1) projections may notpreserve distance in the original space, 2) they arenot comparable across models and 3) do not pro-vide interpretable axes, preventing more detailedanalysis and understanding.

PCA projects embeddings on a lower dimen-sional space that has the directions of the high-est variance in the dataset as axes. Those dimen-sions do not carry any interpretable meaning, soby visualizing the first two dimensions of a PCAprojection, the only insight obtainable is semanticrelatedness (Budanitsky and Hirst, 2006) betweenpoints by observing their relative closeness, andtherefore, topical clusters can be identified. More-over, as the directions of highest variance differfrom embedding space to embedding space, theprojections are incompatible among different em-beddings spaces, and this makes them incompara-ble, a common issue among dimensionality reduc-tion techniques.

http://github.com/uber-research/parallax

https://youtu.be/CSkJGVsFPIg

166

t-SNE, differently from PCA, optimizes a lossthat encourages embeddings that are in their re-spective close neighborhoods in the original high-dimensional space to be close in the lower di-mensional projection space. t-SNE projections vi-sually approximate better the original embeddingspace and topical clusters are more clearly distin-guishable, but do not solve the issue of compara-bility of two different sets of embeddings, nor dothey solve the lack of interpretability of the axesor allow for fine-grained inspection.

For these reasons, there is value in mapping em-beddings into a more specific, controllable and in-terpretable semantic space. In this paper, a newand simple method to inspect, explore and debugembedding spaces at a fine-grained level is pro-posed. This technique is made available in Par-allax alongside PCA and t-SNE for goal-orientedanalysis of the embedding spaces. It consists ofexplicitly defining the axes of projection throughformulae in vector algebra that use embedding la-bels as atoms. Explicit axis definition assigns in-terpretable and fine-grained semantics to the axesof projection. This makes it possible to analyze indetail how embeddings relate to each other withrespect to interpretable dimensions of variability,as carefully crafted formulas can map (to a certainextent) to semantically meaningful portions of thespace. The explicit axes definition also allows forthe comparison of embeddings obtained from dif-ferent datasets, as long as they have common la-bels and are equally normalized.

We demonstrate three visualizations that Par-allax provides for analyzing subspaces of inter-est of embedding spaces and a set of examplecase studies including bias detection, polysemyanalysis and fine-grained embedding analysis, butadditional ones, like diachronic analysis and theanalysis of representations obtained through graphlearning or any other means, may be performed aseasily. Moreover, the proposed visualizations canbe used for debugging purposes and, in general,for obtaining a better understanding of the embed-ding spaces learned by different models and rep-resentation learning approaches. We show howthis methodology can be widely used through aseries of case studies on well known models anddata, and furthermore, we validate its usefulnessfor goal-oriented analysis through a user study.

Parallax interface, shown in Figure 1, presents aplot on the left side (scatter or polar) and controls

on the right side that allow users to define parame-ters of the projection (what measure to use, valuesfor the hyperparameters, the formuale for the axesin case of explicit axes projections are selected,etc.) and additional filtering and visualization pa-rameters. Filtering parameters define logic rulesapplied to embeddings metadata to decide whichof them should be visualized, e.g., the user can de-cide to visualize only the most frequent words oronly verbs if metadata about part-of-speech tags ismade available. Filters on the embeddings them-selves can also be defined, e.g., the user can decideto visualize only the embeddings with cosine sim-ilarity above 0.5 to the embedding of “horse”.

Figure 2: In the top we show professions plotted on“male” and “female” axes in Wikipedia embeddings.In the bottom we show their comparison in Wikipediaand Twitter datasets.

In particular, Parallax’s capability of explicitlydefining axes is useful for goal-oriented analyses,e.g., when the user has a specific analysis goal inmind, like detecting bias in the embeddings space.Goals are defined in terms of dimensions of vari-ability (axes of projection) and items to visualize(all the embeddings that are projected, after filter-ing). In the case of a few dimensions of variability

167

(up to three) and potentially many items of inter-est, a Cartesian view is ideal. Each axis is the vec-tor obtained by evaluating the algebraic formula itis associated with, and the coordinates displayedare similarities or distances of the items with re-spect to each axis. Figure 2 shows an example of abi-dimensional Cartesian view. In the case wherethe goal is defined in terms of many dimensionsof variability, a polar view is preferred. The polarview can visualize many more axes by showingthem in a circle, but it is limited in the number ofitems it can display, as each item will be displayedas a polygon with each vertex lying on a differ-ent axis and too many overlapping polygons wouldmake the visualization cluttered. Figure 5 showsan example of a five-dimensional polar view.

The use of explicit axes allows for interpretablecomparison of different embedding spaces, trainedon different corpora or on the same corpora butwith different models, or even trained on two dif-ferent time slices of the same corpora. The onlyrequirement for embedding spaces to be compara-ble is that they contain embeddings for all labelspresent in the formulae defining the axes. More-over, embeddings in the two spaces do not needto be of the same dimension, but they need to benormalized. Items will now have two sets of co-ordinates, one for each embedding space, and thusthey will be displayed as lines. Short lines are in-terpreted as items being embedded similarly in thesubspaces defined by the axes in both embeddingspaces, while long lines are interpreted as reallydifferent locations in the subspaces, and their di-rection gives insight on how items shift in the twosubspaces. The bottom side of Figure 2 shows anexample of how to use the Cartesian comparisonview to compare embeddings in two datasets.

2 Case Studies

In this section, a few goal-oriented use cases arepresented, but Parallax’s flexiblity allows for manyothers. We used 50-dimensional publicly avail-able GloVe (Pennington et al., 2014) embeddingstrained on Wikipedia and Gigaword 5 summing to6 billion tokens (for short Wikipedia) and 2 billiontweets containing 27 billion tokens (Twitter).

Bias detection The task of bias detection is toidentify, and in some cases correct for, bias in datathat is reflected in the embeddings trained on suchdata. Studies have shown how embeddings incor-porate gender and ethnic biases ((Garg et al., 2018;

Bolukbasi et al., 2016; Islam et al., 2017)), whileother studies focused on warping spaces in orderto de-bias the resulting embeddings ((Bolukbasiet al., 2016; Zhao et al., 2017)). We show how ourproposed methodology can help visualize biases.

To visualize gender bias with respect to pro-fessions, the goal is defined with the formulaeavg(he, him) and avg(she, her) as two dimen-sions of variability, in a similar vein to (Garget al., 2018). A subset of the professions usedby (Bolukbasi et al., 2016) is selected as itemsand cosine similarity is adopted as the measurefor the projection. The Cartesian view visualizingWikipedia embeddings is shown in the left of Fig-ure 2. Nurse, dancer, and maid are the professionscloser to the “female” axis, while boss, captain,and commander end up closer to the “male” axis.

The Cartesian comparison view comparing theembeddings trained on Wikipedia and Twitter isshown in the right side of Figure 2. Only the em-beddings with a line length above 0.05 are dis-played. The most interesting words in this visual-ization are the ones that shift the most in the direc-tion of negative slope. In this case, chef and doc-tor are closer to the “male” axis in Twitter than inWikipedia, while dancer and secretary are closerto the bisector in Twitter than in Wikipedia.

Polysemy analysis Methods for representingwords with multiple vectors by clustering con-texts have been proposed (Huang et al., 2012; Nee-lakantan et al., 2014), but widely used pre-trainedvectors conflate meanings in the same embedding.

Widdows (2003) showed how using a binary or-thonormalization operator that has ties with thequantum logic not operator it is possible to removepart of the conflated meaning from the embeddingof a polysemous word. The authors define the op-erator nqnot(a, b) = a − a·b

|b|2 b and we show witha comparison plot how it can help distinguish thedifferent meanings of a word.

For illustrative purposes, we choose the samepolysemous word used by (Widdows, 2003), suit,and use the nqnot operator to orthonormalize withrespect to lawsuit and dress, the two main mean-ings used as dimensions of variability. The itemsin our goal are the 20,000 most frequent wordsin the Wikipedia embedding space removing stop-words. In Figure 3, we show the overall plot andwe zoom on the items that are closer to each axis.Words closer to the axis negating lawsuit are all re-lated to dresses and the act of wearing something,

168

Figure 3: Plot of embeddings in Wikipedia with suit negated with respect to lawsuit and dress respectively as axes.

Figure 4: The top figure is a fine-grained comparisonof the subspace on the axis google and microsoft inWikipedia, the bottom one is the t-SNE counterpart.

while words closer to the axis negating dress arerelated to law. This visualization clearly confirmsthe ability of the nqnot operator to disentanglemultiple meanings from polysemous embeddings.

Fine-grained embedding analysis We considerembeddings that are close to be semantically re-lated, but even close embeddings may have nu-ances that distinguish them. When projecting intwo dimensions through PCA or t-SNE we areconflating a multidimensional notion of similar-

Figure 5: Two polar views of countries and foods.

ity to a bi-dimensional one, losing the fine-graineddistinctions. The Cartesian view allows for amore fine-grained visualization that emphasizesnuances that could otherwise go unnoticed.

To demonstrate this capability, we select asdimensions of variability single words in closevicinity to each other in the Wikipedia embeddingspace: google and microsoft, as google is the clos-est word to microsoft and microsoft is the 3rd clos-est word to google. As items, we pick the 30,000most frequent words removing stop-words and re-move the 500 most frequent words (as they are toogeneric) and keeping only the words that have acosine similarity of at least 0.4 with both googleand microsoft and a cosine similarity below 0.75with respect to google+microsoft, as we are in-terested in the most polarized words.

The left side of Figure 4 shows how even ifthose embeddings are close to each other, it iseasy to identify peculiar words (highlighted withred dots). The ones that relate to web companiesand services (twitter, youtube, myspace) are muchcloser to the google axis. Words related to bothlegal issues (lawsuit, antitrust) and videogames(ps3, nintendo, xbox) and traditional IT companiesare closer to the microsoft axis.

For contrast, the t-SNE projection is shown inthe right side of Figure 4: it is hard to appreciatethe similarities and differences among those em-beddings other than seeing them being close in theprojected space. This confirms on one hand thatthe notion of similarity between terms in an em-

169

Accuracy Factor F(1,91) p-value

Projection Projection 46.11 0.000∗∗∗

× Task 1.709 0.194Task Projection × Task 3.452 0.066

Projection Projection 57.73 0.000∗∗∗

× Obfuscation 23.93 0.000∗∗∗

Obfuscation Projection × Obf 5.731 0.019∗

Table 1: Two-way ANOVA analyses of Task (Com-monality vs. Polarization) and Obfuscation (Obfus-cated vs. Non-obfuscated) over Projection (ExplicitFormulae vs. t-SNE).

bedding space hides many nuances that are cap-tured in those representations, and on the otherhand, that the proposed methodology enables fora more detailed inspection of the embedded space.

Multi-dimensional similarity nuances can be vi-sualized using the polar view. In Figure 5, weshow how to use Parallax to visualize a small num-ber of items on more than two axes, specificallyfive food-related items compared over five coun-tries’ axes. The most typical food from a spe-cific country is the closest to the country axis,with sushi being predominantly close to Japan andChina, dumplings being close to both Asian coun-tries and Italy, pasta being predominantly closer toItaly, chocolate being close to European countriesand champagne being closer to France and Italy.This same approach could be also be used for biasdetection among different ethnicities, for instance.

3 User Study

We conducted a user study to find out if andhow visualizations using user-defined semanti-cally meaningful algebraic formulae help usersachieve their analysis goals. What we are not test-ing for is the projection quality itself, as in PCAand t-SNE it is obtained algorithmically, while inour case it is explicitly defined by the user. We for-malized the research questions as: Q1) Does Ex-plicit Formulae outperform t-SNE in goal-orientedtasks? Q2) Which visualization do users prefer?

To answer these questions we invited twelvesubjects among data scientists and machine learn-ing researchers, all acquainted with interpretingdimensionality reduction results. We defined twotypes of tasks, namely Commonality and Polariza-tion, in which subjects were given a visualizationtogether with a pair of words (used as axes in Ex-plicit Formulae or highlighted with a big font andred dot in case of t-SNE). We asked the subjects toidentify either common or polarized words w.r.t.the two provided ones. The provided pairs were:

banana & strawberry, google & microsoft, nerd &geek, book & magazine. The test subjects weregiven a list of eight questions, four per task type,and their proposed lists of five words are comparedwith a gold standard provided by a committee oftwo computational linguistics experts. The tasksare fully randomized within the subject to preventfrom learning effects. In addition, we obfuscatedhalf of our questions by replacing the words witha random numeric ID to prevent prior knowledgefrom affecting the judgment. We track the accu-racy of the subjects by calculating the number ofwords provided that are present in the gold stan-dard set, and we also collected an overall prefer-ence for either visualizations.

As reported in Table 1, two-way ANOVA testsrevealed significant differences in accuracy for thefactor of Projection and t-SNE against both Taskand Obfuscation, which is a strong indicator thatthe proposed Explicit Formulae method outper-forms t-SNE in terms of accuracy in both Com-monality and Polarization tasks. We also observedsignificant differences in Obfuscation: subjectstend to have better accuracy when the words arenot obfuscated. We run post-hoc t-tests that con-firmed how the accuracy of Explicit Formulae onNon-obfuscated is significantly better than Obfus-cated, which in turn is significantly better that t-SNE Non-obfuscated, which is significantly betterthan t-SNE Obfuscated. Concerning Preference,nine out of all twelve (75%) subjects chose Ex-plicit Formulae over t-SNE. In conclusion, our an-swers to the research questions are that (Q1) Ex-plicit Formulae leads to better accuracy in goal-oriented tasks, (Q2) users prefer Explicit Formu-lae over t-SNE.

4 Related Work

A consistent body of research went into research-ing distributional semantics and embedding meth-ods ((Lenci, 2018) for a comprehensive overview),but we will focus om the embedding visualizationliterature. In their recent paper, (Heimerl and Gle-icher, 2018) extracted a list of routinely conductedtasks where embeddings are employed in visualanalytics for NLP, such as compare concepts, find-ing analogies, and predict contexts. iVisCluster-ing (Lee et al., 2012) represents topic clusters astheir most representative keywords and displaysthem as a 2D scatter plot and a set of linked visual-ization components supporting interactively con-

170

structing topic hierarchies. ConceptVector (Parket al., 2018) makes use of multiple keyword setsto encode the relevance scores of documents andtopics: positive words, negative words, and irrel-evant words. It allows users to select and build aconcept iteratively. (Liu et al., 2018) display pairsof analogous words obtained through analogy byprojecting them on a 2D plane obtained through aPCA and an SVM to find the plane that separateswords on the two sides of the analogy. Besidesword embeddings, visualization has been used tounderstand topic modeling (Chuang et al., 2012)and how topic models evolve over time (Havreet al., 2002). Compared to existing literature, ourwork allows for more fine-grained direct controlover the conceptual axes and the filtering logic, al-lowing users to: 1) define concepts based on ex-plicit algebraic formulae beyond single keywords,2) filter depending on metadata, 3) perform mul-tidimensional projections beyond the common 2Dscatter plot view using the polar view, and 4) per-form comparisons between embeddings from dif-ferent data sources. Those features are absent inother proposed tools.

5 Conclusions

We presented Parallax, a tool for embedding visu-alization, and a simple methodology for projectingembeddings into lower-dimensional semantically-meaningful subspaces through explicit algebraicformulae. We showed how this approach al-lows goal-oriented analyses, more fine-grainedand cross-dataset comparisons through a series ofcase studies and a user study.

ReferencesTolga Bolukbasi, Kai-Wei Chang, James Y. Zou,

Venkatesh Saligrama, and Adam Tauman Kalai.2016. Man is to computer programmer as womanis to homemaker? debiasing word embeddings. InNIPS, pages 4349–4357.

Alexander Budanitsky and Graeme Hirst. 2006. Eval-uating wordnet-based measures of lexical semanticrelatedness. Comput. Ling., 32(1):13–47.

Jason Chuang, Christopher D. Manning, and JeffreyHeer. 2012. Termite: visualization techniques forassessing textual topic models. In AVI, pages 74–77.

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, andJames Zou. 2018. Word embeddings quantify 100years of gender and ethnic stereotypes. Proceedingsof the National Academy of Sciences.

Susan Havre, Elizabeth G. Hetzler, Paul Whitney, andLucy T. Nowell. 2002. Themeriver: Visualizing the-matic changes in large document collections. IEEETrans. Vis. Comput. Graph., 8(1):9–20.

Florian Heimerl and Michael Gleicher. 2018. Interac-tive analysis of word vector embeddings. Comput.Graph. Forum, 37(3):253–265.

Eric H. Huang, Richard Socher, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In ACL.

Aylin Caliskan Islam, Joanna J. Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora necessarily contain humanbiases. Science, 356:183–186.

Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John T.Stasko, and Haesun Park. 2012. ivisclustering:An interactive visual document clustering via topicmodeling. Comput. Graph. Forum, 31(3):1155–1164.

Alessandro Lenci. 2018. Distributional models of wordmeaning. Annual Review of Linguistics, 4(1):151–171.

Shusen Liu, Peer-Timo Bremer, Jayaraman J. Thia-garajan, Vivek Srikumar, Bei Wang, Yarden Livnat,and Valerio Pascucci. 2018. Visual exploration ofsemantic relationships in neural word embeddings.IEEE Trans. Vis. Comput. Graph., 24(1):553–562.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9:2579–2605.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In EMNLP, pages 1059–1069.

Deok Gun Park, Seungyeon Kim, Jurim Lee, JaegulChoo, Nicholas Diakopoulos, and Niklas Elmqvist.2018. Conceptvector: Text visual analytics via in-teractive lexicon building using word embedding.IEEE Trans. Vis. Comput. Graph., 24(1):361–370.

K. Pearson. 1901. On lines and planes of closest fit tosystems of points in space. Philosophical Magazine,2:559–572.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.

Dominic Widdows. 2003. Orthogonal negation in vec-tor spaces for modelling word-meanings and docu-ment retrieval. In ACL, pages 136–143.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias amplification usingcorpus-level constraints. In EMNLP, pages 2979–2989.

171

A Appendix

In this appendix we show all the images presentedin the main body of the paper in full size for mak-ing reading them easier.

172

Figure 6: Screenshot of Parallax.

173

Figure 7: Professions plotted on “male” and “female” axes in Wikipedia embeddings.

174

Figure 8: Professions plotted on “male” and “female” axes in Wikipedia and Twitter embeddings.

175

Figu

re9:

Plot

ofem

bedd

ings

inW

ikip

edia

with

suit

nega

ted

with

resp

ectt

ola

wsu

itan

ddr

ess

resp

ectiv

ely

asax

es.

176

Figu

re10

:Plo

tofe

mbe

ddin

gsin

Wik

iped

iaw

ithap

ple

nega

ted

with

resp

ectt

ofr

uita

ndco

mpu

ter

resp

ectiv

ely

asax

es.

177

Figure 11: Fine-grained comparison of the subspace on the axis google and microsoft in Wikipedia.

178

Figu

re12

:Fin

e-gr

aine

dco

mpa

riso

nof

the

subs

pace

onth

eax

isnqn

ot(google,m

icrosoft)

andnqn

ot(m

icrosoft,google)

inW

ikip

edia

.

179

Figure 13: t-SNE visualization of google and microsoft in Wikipedia.

180

Figure 14: Two polar view of countries and foods in Wikipedia.

Parallax: Visualizing and Understanding the Semantics of ...quantum logic not operator it is...

Documents

Transcript of Parallax: Visualizing and Understanding the Semantics of ...quantum logic not operator it is...