ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica...

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia

22

Principal Word Vectors

Ali Basirat

Dissertation presented at Uppsala University to be publicly examined in Room 22-0008,Humanistiska teatern, 752 38, Uppsala, Saturday, 8 September 2018 at 09:00 for the degreeof Doctor of Philosophy. The examination will be conducted in English. Faculty examiner:Professor Hinrich Schütze.

AbstractBasirat, A. 2018. Principal Word Vectors. Studia Linguistica Upsaliensia 22. 159 pp. Uppsala:Acta Universitatis Upsaliensis. ISBN 978-91-513-0365-9.

Word embedding is a technique for associating the words of a language with real-valued vectors,enabling us to use algebraic methods to reason about their semantic and grammatical properties.This thesis introduces a word embedding method called principal word embedding, whichmakes use of principal component analysis (PCA) to train a set of word embeddings for wordsof a language. The principal word embedding method involves performing a PCA on a datamatrix whose elements are the frequency of seeing words in different contexts. We address twochallenges that arise in the application of PCA to create word embeddings. The first challengeis related to the size of the data matrix on which PCA is performed and affects the efficiencyof the word embedding method. The data matrix is usually a large matrix that requires a verylarge amount of memory and CPU time to be processed. The second challenge is related tothe distribution of word frequencies in the data matrix and affects the quality of the wordembeddings. We provide an extensive study of the distribution of the elements of the data matrixand show that it is unsuitable for PCA in its unmodified form.

We overcome the two challenges in principal word embedding by using a generalized PCAmethod. The problem with the size of the data matrix is mitigated by a randomized singularvalue decomposition (SVD) procedure, which improves the performance of PCA on the datamatrix. The data distribution is reshaped by an adaptive transformation function, which makes itmore suitable for PCA. These techniques, together with a weighting mechanism that generalizesmany different weighting and transformation approaches used in literature, enable the principalword embedding to train high quality word embeddings in an efficient way.

We also provide a study on how principal word embedding is connected to other wordembedding methods. We compare it to a number of word embedding methods and study how thetwo challenges in principal word embedding are addressed in those methods. We show that theother word embedding methods are closely related to principal word embedding and, in manyinstances, they can be seen as special cases of it.

The principal word embeddings are evaluated in both intrinsic and extrinsic ways. Theintrinsic evaluations are directed towards the study of the distribution of word vectors. Theextrinsic evaluations measure the contribution of principal word embeddings to some standardNLP tasks. The experimental results confirm that the newly proposed features of principal wordembedding (i.e., the randomized SVD algorithm, the adaptive transformation function, andthe weighting mechanism) are beneficial to the method and lead to significant improvementsin the results. A comparison between principal word embedding and other popular wordembedding methods shows that, in many instances, the proposed method is able to generateword embeddings that are better than or as good as other word embeddings while being fasterthan several popular word embedding methods.

Keywords: word, context, word embedding, principal component analysis, PCA, sparsematrix, singular value decomposition, SVD, entropy

Ali Basirat, Department of Linguistics and Philology, Box 635, Uppsala University, SE-75126Uppsala, Sweden.

© Ali Basirat 2018

ISSN 1652-1366ISBN 978-91-513-0365-9urn:nbn:se:uu:diva-353866 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-353866)

To Marzieh

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1 Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Relation to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Part I: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.1 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.2 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4 Generalized Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Words and Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.1 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.1 General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.2 Word Embeddings in Distributional Semantics . . . . . . . . . . . . . 473.2.3 Word Embedding in Language Modelling . . . . . . . . . . . . . . . . . . . . . 513.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Part II: Principal Component Analysis for Word Embeddings . . . . . . . . . . . . . . . . . . . . . 57

4 Contextual Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3 Feature and Word Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4 Mixtures of Contextual Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.5 Distributions of Mixtures of Contextual Word Vectors . . . . . . . . . . . . . . . 664.6 Combined Feature Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Principal Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.1 Word Embedding through Generalized PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Parameters of Generalized PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Metric and Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.2 Transformation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.3 Eigenvalue Weighting Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3 Centred Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Connections with Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.1 HAL and HPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 Random Indexing (RI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4 RSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Part III: Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.1.1 Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.1.2 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2 Initial Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.3 Comparison Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4 Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118.1 Parameters of Principal Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.1.1 Feature Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118.1.2 Number of Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.1.3 Weighting and Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.1 Limitations of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.2 Effective and Efficient PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1499.3 PCA Limitations in Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1499.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1509.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

AcknowledgementI would like to thank many people who helped and encouraged me while work-ing on this thesis. First and foremost, I want to thank my advisors JoakimNivre and Christian Hardmeier, who gave me enough freedom to develop myown ideas. They have always supported me and shown an interest in my ideas.I have had many inspiring discussions with them, not only on this thesis, butalso on how to manage and develop ideas. I am extremely grateful to Joakimfor his special support on the technical aspects of my research and his effortsto make enough computational resources available for me. I have always re-ceived invaluable advice from Joakim. Both Joakim and Christian made agreat effort to improve this thesis.

I also would like to thank Magnus Sahlgren for his valuable commentsand advice on the first draft of this thesis. I am extremely grateful to allmy colleagues in the computational linguistics group: Bengt Dahlqvist, BeátaMegyesi, Eva Petterson, Fabienne Cap, Gongbo Tang, Jörg Tiedemann, MatsDahllöf, Marie Dubremetz, Miryam de Lhoneux, Sara Stymne, Sharid Loaiciga,and Yan Shao. I thank them for their scholarly interactions and helpful com-ments on my work. Among the people in our group, I had the opportunity toclosely collaborate with Miryam, Sara, and Yan. That was a great experiencefor me. I am also greatly thankful for the technical support provided by PerStarbäck. I deeply thank Marc Tang, my officemate, for the great discussionswe had on the connections between linguistics and mathematics. We consultedMichael Dunn and Joakim Nivre about many interesting ideas emerging fromour discussions. This section gives me an opportunity to thank both Michaeland Joakim for the advice and guidelines on the ideas. I benefited from thelinguistic and stylistic suggestions provided by Esther Bond. I thank her forproofreading the text.

I would like to express my gratitude to everybody in the Department ofLinguistics and Philology. I owe them a great debt of gratitude. Special ac-knowledgements go to Alexander Nilsson, Ali Yildiz, Carina Jahani, ForoghHashabeiky, Heinz Werner Wessler, Inga-Lill Holmberg, Jaroslava Obrtelova,Johan Heldt, Josefin Lindgren, Karin Koltay, Linnéa Öberg, Maryam Nourzaei,Niklas Edenmyr, and Rima Haddad. I would also like to thank people belong-ing to our local choir in the department. The choir, named LingSing, wasmanaged by Karin Koltay. That was a great experience between all of theacademic and scholarly activities.

The experiments in the thesis were carried out on two high performancecomputing (HPC) clusters, name Abel and Kebnekaise, in Oslo and Umeå,respectively. The Abel cluster is a high performance computing facility atUiO hosted at USIT (Universitetets senter for informasjonsteknologi) by theResearch Infrastructure Services group. The Kebnekaise cluster is a super-computer at HPC2N (High Performance Computing Center North). I wouldlike to express my deep sense of gratitude to Stephan Oepen who gave me the

permission to access large amounts of resources on the Abel cluster within theNeIC-NLPL project.

I am immensely grateful to my parents for their lifelong support, especiallyfor their assistance during the final steps of writing this thesis. They have al-ways inspired me to follow my dreams and to pursue my scientific ideas. Iwould also like to thank my parents in law for all their aid, and for takingcare of my family while I was writing the final version of the thesis. I wouldnever have been able to complete this work without their support and encour-agements. My final and special thanks go to my wife, Marzieh Homayoun, towhom I owe a debt of gratitude. She gave me love in my life and supportedme while I was working on this thesis.

1. Introduction

The distributional representation of words plays an important role in mostmodern approaches to natural language processing. In this representation,words are associated with real-valued vectors, called word embeddings orword vectors, which capture global syntactic and semantic dependencies be-tween words in a corpus. It enables the application of powerful machine learn-ing techniques, developed for continuous data, on the discrete and symbolicobservations of words. It creates a bridge between mathematics and linguisticsand enables the use of algebraic methods for reasoning about words.

In the literature, the processing of associating words with vectors is calledword embedding. Several word embedding methods have been proposed byresearchers in different areas of linguistics, cognitive science, and computerscience. They are based on the distributional hypothesis, which states wordswith similar meanings tends to occur in similar contexts. The occurrences ofwords in contexts are modelled by word count data represented by high di-mensionalvectors. These vectors then undergo dimensionality reduction tech-niques to generate low-dimensional word vectors.

In this thesis, we study the use of principal component analysis (PCA) forword embedding. The limitations of using PCA for word embedding are stud-ied and a generalise PCA is proposed to mitigate the limitations. We alsostudy how the proposed method is related to other popular word embeddingmethods.

1.1 Word RepresentationsThe word is one of the basic elements of language. It is studied in differentresearch areas such as philosophy, theology, linguistics and mathematics. Inphilosophy of language, words are studied as the meaningful elements of lan-guage that reflect meanings in the mind. In theology and religious schools ofthought, the concept of the word plays an important role in understanding holyconcepts and the world in general. John’s Gospel describes the nature of Godas a word and the Quran introduces Jesus as the word of God. In linguistics,the word is often defined as the minimal syntactic unit of a language. In com-puter science and related fields, a word is often regarded simply as a string ofsymbols. This is related to the form that words take in written languages, suchas English, where words are rendered as sequence of characters.

11

In written languages, words are formed by placing characters of the lan-guage next to each other. Words in this representation are related to each otherwith regard to the morphological rules of the language. The relationship be-tween words in a written language is governed by the morphological rules ofthe language. For example, the words pruned and pruning are different inflec-tions of the word prune. Not considering the morphological relations betweenthe words, words in the written languages can be seen as distinct symbolsformed by the characters of the language. In this symbolic representation, thesymbols associated with the word pruned and pruning are completely inde-pendent of each other. This discrete representation has been used as one thebasic approaches to word representation in computer programs that processnatural languages.

The lack of information about relationships between words is a weaknessof the symbolic representation of words. This weakness can be addressed intwo ways. The first is to use a morphological analyser that determines themorphological relationships between the words. The second is to use a vectorrepresentation of the words. In this thesis, we focus on the second solution andstudy an approach to generating vector representations for words. In the vectorrepresentation of words, each word (i.e. each sequence of letters) is associatedwith a vector. A vector is a mathematical data structure that is characterizedby its magnitude and its direction. The association between words and vectorsis created so that word similarities are reflected in vector similarities. In otherwords, similar words are associated with similar vectors.

From a geometric point of view, vectors can be seen as arrows that referto points in a multi-dimensional space. In this model, vectors are equivalentto the point they refer to and the similarity between vectors is measured bythe distance between their corresponding end points. Forming an associationbetween the words and vectors can be seen as embedding or distributing thewords in the space formed by the vectors in such a way that the vectors asso-ciated with similar words are clustered together in the vector space. Hence,different names are used to refer to the vectors associated with words, suchas word vectors, word embeddings, or distributional representations of words.Figure 1.1 shows an example of word embeddings in a two-dimensional vec-tor space. We see that similar words are clustered together. For example, a setof verbs are clustered at the top of the figure and a set of nouns are clusteredat the bottom of the figure. Within the verb cluster, there are two sub-clusters:auxiliary verbs and past tense verbs. In the noun cluster, there are multiplesub-clusters of nouns including week days, country names, family members,and furniture names.

12

Figure 1.1. A two-dimensional representation of word embeddings. Words with sim-ilar meanings are close to each other.

1.2 Research QuestionsTechniques for word embedding have been developed in two research areas:1) distributional semantics, and 2) language modelling. Techniques developedin the area of distributional semantics mostly use algebraic matrix factoriza-tion methods to create the word embeddings. These techniques are known asspectral methods of word embedding. Conversely, techniques developed in thearea of language modelling mostly use neural networks to create the word em-beddings. Levy and Goldberg (2014b) show that both techniques are closelyrelated to each other. In both cases, the word embeddings are generated fromthe low-rank factors of a co-occurrence matrix, i.e., counting the frequency ofwords appearing together.

Although all word embedding techniques follow this principle in theory, inpractice some of the techniques that do not explicitly follow the factorizationprinciple result in word embeddings that are more meaningful for certain tasks.These successful techniques have been developed in both areas of word em-bedding. For example, successful word embedding techniques GloVe (Pen-nington et al., 2014) and word2vec (Mikolov et al., 2013a) differ in thatone is a distributional semantics method ( GloVe) and the other a languagemodelling method (word2vec). However, those methods that explicitly fol-

13

-100 -50 0 50 100 150

-250

-200

-150

-100

-50

0

50

100

150

200

250

can

could

havehas

paris

stockholm

table

chair

desk

sunday

monday

tuesday

wednesdayday

week

month

brother

sister father

mother

sweden portugal

germany

iran

italy

gas

liquid solid

went

came

gaveput

johnpeter

markjimmy

low the matrix factorization principle (e.g. LSA (Landauer and Dumais, 1997)and HPCA (Lebret and Collobert, 2014)) are not among the most successfulmethods of word embedding. In a more general way, one might ask whyare methods that explicitly follow the factorization principle not as good asother methods. In order to shed some light on this question, we study a simplespectral method of word embedding, which makes use of principal componentanalysis (PCA) to create word embeddings. We try to answer the question us-ing the chosen simple spectral method. More formally, we formulate the mainresearch questions of this thesis as follows:

1. What are the limitations of using PCA for word embeddings?2. How can we make effective and efficient use of PCA for word embed-

dings?3. How are the limitation factors of PCA avoided/handled in other word

embedding methods?4. How do word embeddings based on PCA perform empirically with re-

spect to intrinsic and extrinsic evaluation criteria?

In order to answer the first research question, we study principal componentanalysis in detail and show what makes it unsuitable for data based on wordco-occurrences. The second question is answered by proposing a PCA-basedword embedding method, which transforms the co-occurrence data in such away that it becomes suitable for PCA. We also propose an algorithm that en-ables principal component analysis to be used on the data to create the wordembeddings in an efficient way. This method is referred to as the principalword embedding method or the principal word embedding for short. We an-swer the third question by studying the relationships between the principalword embedding method and other popular word embedding methods. In thisstudy, we investigate how the limiting factor of using PCA for word embed-ding are addressed in the other word embedding methods. In order to answerthe fourth question, we organize a series of experiments and generate differentsets of PCA-based word embedding. These embeddings are then evaluatedwith multiple intrinsic and extrinsic evaluation metrics.

1.3 ContributionsIn this thesis, we propose a spectral word embedding method that employsprincipal component analysis to train word embeddings from both raw andannotated corpora. The main contributions of the thesis can be summarized insix points:

1. A detailed study of the limitations of PCA for word embedding. Weanalyse the distribution of word co-occurrence data and show that it isnot well suited for performing PCA.

14

2. An adaptive transformation function that mitigates the problem with thedata distribution. The transformation function reshapes the data distri-bution into a distribution that is more suitable for PCA.

3. A generalized method for PCA. The method includes the adaptive trans-formation function and generalizes the concept of a corpus to make itpossible to train word embeddings from both raw and annotated corpora.

4. An efficient method of singular value decomposition. A randomizedsingular value decomposition is proposed which makes it possible theprincipal components of a sparse mean-centred matrix without explic-itly performing the mean-centring step. This makes it easer to computethe principal components of a large sparse matrix while preserving thesparsity of the matrix.

5. Two intrinsic evaluation metrics. The metrics provide a statistical viewof the spread and the discriminability of the word embeddings.

6. An empirical study of the proposed principal word embedding methodand its contribution to different NLP tasks.

1.4 OutlineThis thesis consists of an introduction chapter, Chapter 1, three main parts,and a conclusion chapter, Chapter 9. In this outline, we provide a bird’s eyeview of the topics presented in different parts of this thesis.

Part I presents the theoretical background of the thesis. It starts with a chap-ter focused on the basic mathematical concepts in linear algebra, probabilityand statistics. In linear algebra, we provide a short introduction to vectorsand matrices. Our introductory chapter on probability and statistics is focusedon random variables and their descriptors such as the mean, variance, distri-bution, and entropy. In addition, individual sections are devoted to principalcomponent analysis (PCA) and generalized principal component analysis. Wealso provide a brief introduction to topics related to PCA such as the singularvalue decomposition and the eigenvalue decomposition.

The second chapter of Part I, Chapter 3, provides an introduction to wordsand word embeddings. It introduces words from a linguistic point of view andreviews related works on word embeddings. In this chapter, word embeddingsare introduced as vector representations of words that bridges the gap betweenlinguistics and mathematics.

Part II is about the use of principal component analysis for word embed-ding. This part can be seen as the core of the thesis where we answer mostof our research questions. Chapter 4 introduces the basic concepts needed toformulate the PCA-based word embedding method. In this chapter, we intro-duce contextual word vectors as basic mathematical abstractions to representwords. Then we introduce a mixture of contextual word vectors to which PCAis applied to generate desired low-dimensional word vectors (also called word

15

embeddings). In this chapter, we address our first research questions by show-ing why raw contextual word vectors are normally unsuited for PCA. Chap-ter 5 introduces principal word vectors as low-dimensional vectors obtainedfrom principal component analysis of mixtures of contextual word vectors. Inthis chapter, we answer our second research question and explain how wordembeddings can efficiently and effectively be trained by PCA. Chapter 6 stud-ies the connections between the principal word embedding method and otherpopular word embedding methods. This chapter is mostly focused on our thirdresearch question.

Part III, which consists of two chapters, describes our experimental settingsand reports the experimental results. Chapter 7 describes the experimentalsettings and introduces multiple intrinsic and extrinsic evaluation metrics toassess principal word vectors. These metrics are used to answer our fourthresearch question. It also introduces comparison settings, which are used tocompare the principal word embedding method with other popular word em-bedding methods. Chapter 8 reports and discusses the results obtained fromexperiments. Finally, Chapter 9 concludes the thesis.

1.5 Relation to Published WorkThe principal word embedding method introduced in this thesis is connectedto our previous publications as follows. Basirat and Nivre (2017) investigatethe effect of different transformation functions on principal word vectors. Inthat paper, word vectors are evaluated based on their contribution to the task ofdependency parsing. The empirical results show that the best transformationfunction using the dependency parsing evaluation metric is the seventh roottransformation function. The word embedding method introduced by Basiratand Nivre (2017) is then also used by de Lhoneux et al. (2017) to train a uni-versal dependency parsing model. De Lhoneux et al. (2017) make use of theuniversal dependency features of words to train a set of cross-lingual wordvectors for all of the languages represented in the CoNLL-2017 shared taskon parsing to universal dependencies. Basirat and Tang (2018) investigate thepresence of lexical and morpho-syntactic features in the word vectors trainedusing the approach of Basirat and Nivre (2017). Basirat and Tang (2018) showthat the information encoded into the principal word vectors is very rich in thatit contains some specific morpho-syntactic features of nouns such as grammat-ical genders, but it is very poor in terms of other types of features such as themass/count feature. This thesis introduces a word embedding method thatgeneralizes the previous method of Basirat and Nivre (2017).

16

Part I:Background

2. Mathematical Concepts

In this chapter, we provide an introduction to the mathematical concepts re-quired to read the remainder this thesis. In Section 2.1, we introduce basicconcepts in linear algebra such as vector and matrix. In Section 2.2, we in-troduce basic concepts in probability and statistics such as random variable,probability distribution and entropy. Readers who are already familiar withthese concepts can skip this chapter.

2.1 Linear AlgebraWe provide a brief introduction to the basic concepts of linear algebra thatare used in the thesis: the vector and the matrix. Both concepts are studiedextensively in linear algebra and other areas of mathematics. However, wefocus only on those aspects of vectors and matrices which are relevant to theremaining parts of the thesis.

2.1.1 VectorA vector is a mathematical object that is characterized by its magnitude anddirection. In analytic geometry, a vector is represented by an arrow originatingfrom the centre of a Cartesian coordinate system and pointing to a positionreferenced by a tuple. In this representation, the number of dimensions ofa vector is equal to the dimensionality of the coordinate system. Figure 2.1represents an example of a two-dimensional Cartesian coordinate system withtwo vectors u = (1,1) and v = (−1,1).

A vector in an n-dimensional coordinate system is represented by an or-dered tuple such as (u1, . . . ,un) referring to a point in the coordinate sys-tem. Based on the Pythagorean theorem, the magnitude of the vector u =(u1, . . . ,un) is equal to the length of its corresponding line segment:

‖u‖=√

n

∑i=1

u2i (2.1)

Other terms used for the magnitude of a vector are the norm or the size of thevector. The direction of the vector u is characterized by a unit length vectorcalled the direction vector:

u =u

‖u‖ (2.2)

19

x

y

b1

b2 uv

Figure 2.1. Two vectors u = (1,1) and v = (−1,1) in a two-dimensional Cartesiancoordinate system. The vectors b1 = (1,0) and b2 = (0,1) are the basis vectors of thecoordinate system.

The origin of a Cartesian coordinate system is referenced by a vector whoseelements are zero. This vector is called the null vector of the coordinate sys-tem. The null vector or zero vector of a Cartesian coordinate system is a vectorwith zero magnitude and an arbitrary direction. The common notation for them-dimensional null vector is 0 = (0, . . . ,0).

In linear algebra, a vector is defined as the most basic element of a lin-ear space, also called a vector space. The two basic operations defined bya vector space are addition and multiplication. The addition of two vectorsu = (u1, . . . ,un) and v = (v1, . . . ,vn) results in a vector as below:

u+ v = (u1 + v1, . . . ,un + vn) (2.3)

The multiplication of a scalar number k ∈ R and a vector u = (u1, . . . ,un) isdefined as:

ku = (ku1, . . . ,kun) (2.4)

The vector ku in Equation 2.4 has the same direction as u and its magnitudeis k‖u‖. Given the addition and multiplication operations, each vector u =(u1, . . . ,un) in an n-dimensional vector space can be expressed by a linearcombination of a set of basis vectors {b1, . . . ,bn} that span the entire vectorspace:

u =n

∑i=1

uibi (2.5)

The basis vector bi = (0, . . . ,1, . . . ,0) in Equation 2.5 is a unit size vector withall elements equal to zero except for one at position i.

20

Vectors can be compared with each other in different ways. For example,two vectors can be compared with each other based on the Euclidean distancebetween them. The Euclidean distance between two vectors u = (u1, . . . ,un)and v = (v1, . . . ,vn) is equal to the magnitude of the vector u− v computed asbelow:

‖u− v‖=√

n

∑i=1

(ui− vi)2 (2.6)

Vectors can also be compared with regard to the cosine of the angle betweenthem. The cosine of the angle between two vectors u = (u1, . . . ,un) and v =(v1, . . . ,vn) is computed as below:

cosθ =u · v

‖u‖‖v‖ (2.7)

where the dot product u · v in the nominator of the fraction is defined as:

u · v =n

∑i=1

uivi (2.8)

If the vectors u and v are unit-length vectors, then the dot product betweenthem is equal to the cosine of the angle between them. Two non-zero vectors u

and v are orthogonal if the dot product between them is equal to zero, i.e., thecosine of the angle between them is equal to zero. In other words, two vec-tors are orthogonal if they are perpendicular to each other. A set of mutuallyorthogonal and unit-length vectors are orthonormal.

Vectors can be dependent on or independent from each other. A set ofvectors are said to be dependent on each other if one of the vectors is a linearcombination of other vectors. Formally, the non-zero vectors u1, . . . ,un aredependent on each other if there exists real numbers α1, . . . ,αn with at leastone non-zero value, such that:

n

∑i=1

αiui = 0 (2.9)

Correspondingly, the non-zero vectors u1, . . . ,un are independent from eachother if the only solution to Equation 2.9 is αi = 0 for i = 1, . . . ,n. An exampleof a set of independent vectors is the basis vectors of a vector space.

The space spanned by a set of independent vectors A = {u1, . . . ,un} withui �= 0 is formed by all vectors obtained from linear combinations of all vectorsin A as follows:

n

∑i=1

αiui (2.10)

where αi ∈ R.

21

2.1.2 MatrixA matrix is a mathematical object formed by a two-dimensional array of num-bers. An m×n matrix A is represented by a rectangular array consisting of m

rows and n columns as follows:

A =

⎡⎢⎣

a1,1 . . . a1,n...

. . ....

a1,m . . . am,n

⎤⎥⎦

A matrix is called a real matrix if the elements of the matrix are real numbers.The element ai, j of the matrix A refers to the entry in the i-th row and the j-thcolumn of the matrix A. We also use the notation A(i, j) to index the element(i, j) of A.

In the remainder of this part, we introduce some algebraic characteristics ofa matrix, which are used for describing the matrix. Then we introduce specialmatrices that appear in matrix calculations. Finally, we provide an introductionto the different matrix decomposition methods used in this thesis.

A matrix is characterized by several factors such as size, determinant, norm,and rank of the matrix. The size of a matrix is determined by the number ofrows and the number of columns of a matrix. A matrix with m rows and n

columns is referred to as an m× n matrix. The notation used for an m× n

matrix A is [A]m×n. Another common notation used for an m×n real matrix A

is (ai, j) ∈ Rm×n. A matrix with equal numbers of rows and columns is called

a square matrix.Each column of an m×n matrix A can be seen as an m-dimensional vector

and each row of the matrix can be seen as an n-dimensional vector. In thisrespect, a matrix is an array of column or row vectors. The number of dimen-sions of the vector space spanned by the vectors in a matrix is called the rank

of the matrix. More formally, the rank of a matrix is the maximum numberof linearly independent column or row vectors in the matrix. The rank of anm× n matrix A is smaller than or equal to the minimum number of rows andcolumns of the matrix, i.e., Rank(A)≤min(m,n).

Another characteristic of a square matrix A is the determinant of the matrix,denoted det(A). The determinant of a square matrix measures the volume ofthe area formed by the vectors in the matrix. The determinant is also used totest the linear dependencies between the vectors of a matrix. The determinantof a matrix is equal to zero if and only if the vectors of the matrix are linearlydependent.

The matrix norm generalizes the concept of the vector norm. A matrix norm‖A‖ is a non-negative real number associated with the matrix A. A special typeof a matrix norm is the Frobenius norm defined as:

‖A‖F =

√m

∑i=1

n

∑j=1

A(i, j)2 (2.11)

22

Some special types of matrices that are often seen in the matrix calculationsare zero matrices, one matrices, sparse matrices, diagonal matrices, identitymatrices, and triangular matrices. A matrix with all elements equal to zerois called the zero matrix. An m× n zero matrix is denoted [0]m×n or simply0 if the size of the matrix is clear from the context. Similarly, a matrix withall elements equal to one is called a one matrix and is denoted [1]m×n or 1.A matrix with many zero elements and a few non-zero elements is called asparse matrix. The degree of sparsity (or the sparsity) of a matrix is measuredby the ratio of zero elements to total elements in the matrix. On the otherhand, the density of a matrix is measured by the ratio non-zero elements tototal elements in the matrix. The density of an m× n matrix with k < mn

non-zero elements is kmn

and its sparsity is equal to 1− kmn

.A matrix with all elements equal to zero except for the entries on the main

diagonal of the matrix is called a diagonal matrix. The main diagonal entriesof an m×n matrix A are the (i, i)-th entries i = 1, . . . ,m. An example of a 3×3diagonal matrix is: ⎡

⎣a1,1 0 00 a2,2 00 0 a3,3

⎤⎦

A square diagonal matrix with all elements on the main diagonal of the matrixequal to one is called an identity matrix. The m×m identity matrix is denotedas Im. An example of a 3×3 identity matrix is:⎡

⎣1 0 00 1 00 0 1

⎤⎦

A triangular matrix is a matrix with all elements above or below the main diag-onal of the matrix equal to zero. If the zero elements are below (or above) themain diagonal of the matrix, the matrix is called an (lower) upper triangular

matrix. An example of a 3×3 upper triangular matrix is:⎡⎣a1,1 a1,2 a1,3

0 a2,2 a2,3

0 0 a3,3

⎤⎦

Among the basic operations on matrices are matrix transposition, matrixaddition, scalar multiplication, matrix multiplication, and the general element-wise transformation. Matrix transposition flips the rows and the columns of amatrix. The transposition of the m× n matrix A, denoted by AT , is the n×m

matrix whose (i, j)-th element is equal to the ( j, i)-th element of A:

AT (i, j) = A( j, i) (2.12)

The addition of two m×n matrices A = (ai, j) ∈ Rm×n and B = (bi, j) ∈ R

m×n

results in a matrix of the same size as A and B. The addition of A and B is a

23

matrix whose elements are the sum of the corresponding elements in A and B:

A+B = (ai, j +bi, j) (2.13)

The addition operation on matrices is both commutative and associative. Ac-cording to the commutativity of the matrix addition, for every two matrices A

and B, if A and B have the same size, then A+B = B+A. According to theassociativity of the matrix addition, for every three matrices A, B, and C, if A,B, and C have the same size, then (A+B)+C = A+(B+C).

The scalar multiplication of the scalar number k and the matrix A = (ai, j) ∈R

m×n is defined as:kA = (kai, j) (2.14)

In the scalar multiplication of a scalar and a matrix, all elements of the matrixare multiplied by the scalar.

The multiplication of two matrices A = (ai, j) ∈Rm×l and B = (bi, j) ∈R

l×n

results in an m×n matrix C whose (i, j)-th element is equal to the dot productof the vector in the i-th row of A and the vector in the j-th column of B:

Ci, j =l

∑k=1

ai,kbk, j (2.15)

Matrix multiplication is an associative operation, i.e., for every three matricesA, B, and C, if both AB and BC are defined, we have (AB)C = A(BC). How-ever, matrix multiplication is not commutative, i.e., the matrix products AB

and BA are not equal in general. For each square matrix A with det(A) �= 0,there is an inverse matrix A−1 whose multiplication with A results in the iden-tity matrix, i.e., AA−1 = I.

Any real-valued function f : R→ R can be applied to the elements of areal matrix. This can be seen as an element-wise transformation of a ma-trix. An example of the element-wise transformation function is the powertransformation function f (x) = xk. The element-wise application of the powertransformation function on a real matrix such as:

A =

⎡⎢⎣

a1,1 . . . a1,n...

. . ....

a1,m . . . am,n

⎤⎥⎦

results in:

f (A) =

⎡⎢⎣

ak1,1 . . . ak

1,n...

. . ....

ak1,m . . . ak

m,n

⎤⎥⎦

in which each element of A is raised to the power value of k.A matrix can be decomposed into a product of matrices called factors. The

matrix decomposition is often used to make it easer to solve matrix equations.

24

It is also used to approximate a matrix by its factors. For example, an exten-sively used matrix decomposition is the singular value decomposition enablinga low-rank approximation of a matrix. Let A be an m×n matrix of rank k. Arank-k′ approximation of A with 0 < k′ < k is an m×n matrix B of rank k′ thatminimizes the Frobenius norm ‖A−B‖F .

There are several types of matrix decomposition. In this part, we providean introduction to three matrix decomposition methods that are used in thisthesis:

• QR decomposition,• Eigen-decomposition, and• Singular Value Decomposition (SVD)

Any m× n real matrix A can be decomposed into two matrices Q and R asfollows:

A = QR (2.16)

where Q is an m×m orthonormal matrix and R is an m× n upper triangularmatrix. The orthonormal matrix Q is also called a basis matrix for the columnvectors of A. If A is a square matrix (i.e. m = n) and A has k linearly inde-pendent column vectors, then the first k columns of Q span the space of thecolumn vectors of A. If A is a rectangular matrix with m < n, then the columnsof Q span the space of the column vectors of A. Otherwise, if A is a rectan-gular matrix with m > n, then the first n columns of Q span the space of thecolumn vectors of A. The first n columns of Q in this case do not span the m-dimensional space of the column vectors of A, but they span an n-dimensionalsubspace in the space spanned by the column vectors of A.

A vector x �= 0 is an eigenvector of the square matrix A if:

Ax = λx (2.17)

where λ is a number called the eigenvalue of the matrix A correspondingto the eigenvector x. An n× n matrix with n ∈ N has a set of eigenvaluesΛ = {λ1, . . . ,λn} of size n where each eigenvalue λi ∈Λ is a complex number.In this thesis, we deal with the real part of the eigenvalues and ignore theirimaginary parts. Therefore, we assume that the eigenvalues of a matrix arereal numbers. In Equation 2.17, the product of A and x results in a vector inthe same space as x. If the resulting vector obtained from Ax has the samedirection as x, then the vector x is called an eigenvector of A and the scalingfactor between Ax and x is equal to the eigenvalue λ . In other words, if weconsider A as a linear transformer, a vector x �= 0 is an eigenvector of A if itstransformation under A lies on the one-dimensional subspace along x. A ma-trix can have multiple eigenvectors and eigenvalues. The number of positiveeigenvalues of a matrix is smaller than or equal to the rank of the matrix. Theproduct of the eigenvalues of a matrix is equal to the determinant of the ma-trix. More concretely, let A be an n× n matrix (n ∈ N) and Λ = {λ1, . . . ,λn}

25

be the set of eigenvalues of A, then the determinant of A and its eigenvaluesare related to each other as follows:

det(A) =n

∏i=1

λi (2.18)

The ordered set of positive eigenvalues of a matrix is called the spectrum of theeigenvalues of the matrix. The eigenvalue decomposition plays an importantrole in principal component analysis, which will be described in Section 2.3.

Any m× n matrix A can be decomposed into the product of three matricesas follows:

A =UΣV T (2.19)

where U is an m×m orthonormal matrix (i.e. UTU = Im), Σ is an m×n diago-nal matrix with non-negative entries, and V is an n×n orthonormal matrix (i.e.V TV = In). This matrix decomposition is called the singular value decompo-sition. The column vectors of U and the column vectors of V are the left andthe right singular vectors of A, respectively, and the diagonal entries of Σ arethe singular values of A. Each of the column vectors in U and V is associatedwith one of the singular values in Σ. The singular vectors associated with thetop singular values are called the top or leading singular vectors. The spacespanned by the first 1 ≤ k ≤ m top positive left singular vectors of A is thebest k-dimensional subspace that minimises the sum of the squared distancebetween the column vectors of A and the subspace. For example, if A containstwo-dimensional column vectors (i.e. m = 2), then the space spanned by thefirst top singular vector is a line. The sum of the squared distance between thecolumn vectors of A and this line is smaller than any other line in the space ofthe column vectors of A. Similarly, the space spanned by the first 1 ≤ k ≤ n

top positive right singular vectors of A is the best k-dimensional subspace thatminimises the sum of the squared distance between the row vectors of A andthe subspace.

Unlike the eigen-decomposition, which is applicable only to square matri-ces, the singular value decomposition can be performed on any matrix (rect-angular or square). The singular vectors and the singular values of a matrix A

are related to the eigenvector and the eigenvalues of the square of A. The leftsingular vectors of A are equal to the eigenvectors of AAT and the right sin-gular vectors of A are equal to the eigenvectors of AT A. The positive singularvalues of A are equal to the square roots of the eigenvalues of AAT or AT A,which are the same.

2.2 Probability and StatisticsIn this section, we provide a brief introduction to some basic concepts of prob-ability and statistics: sample spaces, random variables, probability distribu-tions, and the entropy of random variables.

26

A sample space is a set consisting of all possible outcomes of an experi-ment. For example, the sample space of tossing two coins is:

S = {HH,TH,HT,TT}where H and T are short notations for Head and Tail, and each element of S isa possible outcome, e.g., HH denotes both coins showing Head. An event is asubset of a sample space. It is common to describe an event with a declarativestatement for which the elements of the event are true. For example, the eventA = {HH,TH,HT} is associated with the statement at least one of the coins

shows Head.A random variable x is a mapping from a sample space to a set of values.

A numerical random variable is one that maps a sample space to a numericalset:

x : S→ R (2.20)

where S is a sample set and R is a numerical set such as the set of real numbersR. In other words, a numerical random variable is the numerical representationof the outcome of an experiment. For example, one may define the randomvariable x as representing the number of heads in the experiment of tossingtwo coins. In this example, the possible values of x are {0,1,2}. From nowon, we use the term random variable to mean numerical random variable.

Random variables are usually denoted by capital letters (e.g. X , Y , and Z)and their values are denoted by the corresponding lowercase letters (e.g. x,y, and z). It is also quite common to represent a random variable with boldlowercase letters (e.g. x, y, z). In this thesis, we adopt the latter notation,i.e., bold lowercase letters are used as random variables and their values aredenoted by the normal (non-bold) lowercase letters. The notation x = x refersto an event such as E ⊆ S for which the value of x is x. In other words, for everyelement e ∈ E , we have x(e) = x. The value x in this example is known as arealization of the random variable x when the event e has occurred randomlyaccording to a probability function that describes the random variable. Lateron in this section, we will study the probability distribution functions thatdescribe random variables.

Random variables are generally divided into two classes: discrete randomvariables and continuous random variables. A random variable is said to bea discrete random variable if its range is a finite or countably infinite set. Anexample of a discrete random variable is a random variable that counts thenumber of heads in the experiment of tossing two coins. A random variable issaid to be a continuous random variable if its range is an uncountably infiniteset (e.g. the set of real numbers R). An example of a continuous randomvariable is a random variable that measures the blood pressure of a randomlyselected person.

A discrete random variable is characterized by a probability function thatassigns a probability to each of the elements in the output range of the random

27

x Event fx(x)0 {TT} 0.251 {TH,HT} 0.52 {HH} 0.25

Table 2.1. The probability function of the random variable x that counts the number

of heads in tossing two coins.

variable. The common notation for a probability function defined on the ran-dom variable x is fx(x), i.e., fx(x) = p(x = x). An example of a probabilityfunction is shown in Table 2.1.

A continuous random variable is characterized by a probability densityfunction that assigns a real value to the output range of the random variable.Similar to the probability distribution, the probability density function definedon the random variable x is fx(x). However, unlike the probability distributionfunction, the values of a probability density function are not probability values,but they are used to compute the probability of the values of a random vari-able in an interval. In other words, the values of a probability density functionare not probability values, but they indicate how the probability is distributedover the range of the random variable. The probability of a continuous ran-dom variable x taking a value in the real value interval (a,b) is computed byintegrating the probability density function as below:

p(x ∈ (a,b)) =

∫ b

afx(x)dx (2.21)

where fx(x) is the probability density function defined on the values of therandom variable x.

The two main ways to estimating the value of a probability density func-tion for a data point are parametric and non-parametric estimation. Let x be arandom point sampled from the random variable x. Parametric approaches toprobability density estimation assume that x follows a certain probability den-sity function with a fixed set of parameters and use the function to estimatethe probability density of x. By contrast, non-parametric methods make noassumption about the probability density function of x and try to estimate thefunction from a sampled dataset. A non-parametric way to estimate the prob-ability density function of a random variable is the kernel density estimation(KDE), also known as Parzen’s window (Parzen, 1962). Let {x1, . . . ,xn} beindependent, identically distributed random points sampled from the randomvariable x. KDE estimates the underlying probability density function of x as

28

follows:

fx(x) =1

nh

n

∑i=1

K(x− xi

h) (2.22)

where K : R→ R is a kernel function and h > 0 is a smoothing parameter,called the bandwidth. A common example of the kernel function K is theGaussian kernel defined as

K(x) =exp(−x2/2)

s(2.23)

where s is a normalizing parameter defined as

s =∫

exp(−x2/2)dx (2.24)

The expected value or the mean value of a random variable is a measure ofthe central tendency of the values of the random variable when the associatedexperiment is repeated. The expected value of a discrete random variable x

with the probability function fx is computed as:

E[x] = ∑x∈Rx

x fx(x) (2.25)

where Rx represents the range of x. Similarly, the expected value of a continu-ous random variable x with the probability density function fx is computed as

E[x] =∫ + 8

− 8

x fx(x) (2.26)

The variance of a random variable x is defined as the expected value of thesquared distance between the values of the random variable and the expectedvalue of the random variable:

σ2x = E[(x−E[x])2] (2.27)

Variance is a measure of variation in the values of a random variable. A smallvalue of the variance of a random variable indicates that the values of the ran-dom variable are massed around their mean value. The joint variation of tworandom variables is measured by the covariance between the random variables.The co-variance between two random variables x and y is defined as:

Cov(x,y) = E[(x−E[x])(y−E[y])] (2.28)

where E[x] and E[y] are the expected values of x and y.The amount of uncertainty in a random variable is measured by the entropy

of the random variable. The entropy of a random variable x with is defined as:

H(x) =−E[log fx(x)] (2.29)

29

where fx(x) is the probability (distribution or density) function of x. If x is adiscrete random variable, then the entropy of x is computed as:

H(x) =−x

∑i=1

fx(x) log fx(x) (2.30)

A high entropy value in Equation 2.30 indicates that the values of the ran-dom variable x are difficult to predict. An example of a random variable withhigh entropy is a binary random variable that takes the value of 1 with prob-ability p = 0.5 and takes the value 0 with probability p = 0.5. This uniformprobability distribution shows that the value of the random variable is highlyunpredictable, i.e., at each trial, both success and failure are equally likely tohappen. The entropy of this Bernoulli random variable is H ≈ 0.7. A lowentropy value shows that the value of the random variable is more predictable.For example, if x takes the value of 1 with a probability close to zero, then wecan confidently predict the value of the random variable as 0. If the probabilityof taking the value of 1 is equal to 0 (i.e. fx(1) = 0), then the only outcomeof the random variable will be 0. In this case the entropy of a random is zero.This indicates that the random variable is completely certain about its values.

Among different types of random variables, we provide an introduction tothose variables that are used in this thesis. A Bernoulli random variable hastwo possible values, 0 and 1. A Bernoulli random variable x is defined by asingle parameter p ∈ [0,1] that determines the probability of x taking the valueof 1, i.e., p(x = 1) = p and p(x = 0) = 1− p. The parameter p is sometimescalled the probability of success. The mean of the Bernoulli random variablex is equal to its probability of success, i.e., E(x) = p. The variance of theBernoulli random variable x with parameter p is p(1− p).

The sum of n ∈N trials of the Bernoulli random variable x with parametersp is a binomial random variable:

y = x1 + · · ·+xn (2.31)

The probability distribution of a binomial random variable defined by the pa-rameters p and n is:

fy(y; p,n) =

(n

y

)py(1− p)n−y (2.32)

The joint probability of m binomial random variables xi ∼ B(ni, pi) for i =1, . . . ,m with ∑m

i=1 ni = N is given by:

f (x1, . . . ,xm;N,n1, . . . ,nm) =

(N

n1, . . . ,nm

)m

∏i=1

pxii (2.33)

A vector of random variables x = (x1, . . . ,xm) with the joint distribution func-tion in Equation 2.33 is called a multinomial random variable.

30

A multivariate random variable is a vector of random variables representinga single mathematical system. For example, the temperature and the volumeof the gas inside a balloon form a mathematical system that models a specificstate of the balloon. We use bold capital letters to represent both multivariaterandom variables and vectors of random variables. In the example above, ifwe use t and v as the random variables representing the temperature and thevolume of the balloon, then the multivariate B representing the state of theballoon is B = (t,v)T .

All of the statistical and probabilistic concepts we defined for a scalar ran-dom variable generalize to multivariate random variables. In fact, a scalarrandom variable can be seen as a single-dimensional multivariate random vari-able. The probability (distribution or density) function of a random variable isdefined as the joint probability of the elements of the random variable. If weconsider X = (x1, . . . ,xm)

T as an m-dimensional random variable, its proba-bility (density) function is defined as:

fX((x1, . . . ,xm)T ) = p(x1, . . . ,xm)

where (x1, . . . ,xm)T is a realization of the multivariate random variable X.

Similar to scalar random variables, the probability density of a multivari-ate random variable can be estimated in parametric and non-parametric ways.In parametric estimation, we assume that the random variable follows a cer-tain probability density function on the basis of a set of parameters. In non-parametric estimation, a probability density function of a random variable isestimated on the basis of a dataset sampled from the random variable usinga kernel density estimator. Let {X1, . . . ,Xn} be independent, identically dis-tributed random points sampled from the m-dimensional multivariate randomvariable X. The underlying probability density function of X:

fX(X) =1

nhm

n

∑i=1

K(X−Xi

h) (2.34)

where K : Rm → R is a kernel function and h > 0 is a bandwidth parameter.The multivariate version of the Gaussian kernel function K is defined as:

K(X) ∝ exp(−‖X‖2

2) (2.35)

The covariance between the elements of a multivariate random variable (ormore generally between the elements of a vector of random variables) is mea-sured by a covariance matrix. The covariance matrix of an m-dimensionalrandom variable X = (x1, . . . ,xm)

T is an m×m matrix whose (i, j)th elementis the covariance between xi and x j. The covariance matrix of a multivariaterandom variable x is computed as follows:

Cov(X) = E[(X−E[X])(X−E[X])T ] (2.36)

31

(a) (b)

Figure 2.2. An example of data points with different values of generalized variance(GV). The data points in (a) have a higher GV than the data points in (b).

We use the symbol ΣX to denote the covariance matrix of the multivariaterandom variable X.

Generalized variance measures the variance of a vector of random variables.Unlike the covariance matrix, which measures the joint variation of each pairof random variables in a vector of random variables, the generalized varianceis a scalar that provides an overall view of the variation of the entire randomvector. The generalized variance of a vector of random variables is definedas the determinant of the covariance matrix of the random variables. Let X =(x1, . . . ,xm)

T be a vector of random variables. The generalized variance of X

is defined as:GV(X) = det(ΣX) (2.37)

where ΣX is the covariance matrix of X. As mentioned in Section 2.1.2, thedeterminant of a square matrix is equal to the product of the eigenvalues of thematrix. Using this, Equation 2.37 can be written as:

GV(X) =m

∏i=1

λi (2.38)

where m is the dimensionality of the random vector X and λi (i = 1, . . . ,m) isan eigenvalue of the covariance matrix ΣX. Figure 2.2 depicts an example ofdata samples with different values of GV.

The above statistical metrics provide a statistical description to a single vec-tor of random variables modelling a mathematical system. The statistical rela-tionships between two vectors of random variables can be described by other

32

statistical concepts such as distribution distances, and the discriminant ratiobetween the vectors. Distribution distance measures the similarity betweentwo probability distributions. One type of statistical distance that is often usedfor word embedding is the Hellinger distance. The Hellinger distance betweentwo probability distribution functions f and g defined over a discrete randomvariable x is defined as:

D2x( f ,g) =

12 ∑

x∈x

(√

f (x)−√

g(x))2 (2.39)

The discriminant ratio between two random variables measures the separationbetween the distributions of the random variables. The Fisher discriminant ra-tio (FDR) is a special type of discriminant ratio that is widely used in machinelearning. It is used as a feature selection metric to evaluate the integrity of a setof features. For a given number of data items and their category labels, FDRis equal to the sum of the positive eigenvalues of the between class covariance

matrix of the data (ΣB) multiplied by the inverse of the within class covariance

matrix of the data ΣW :FDR = ∑λi (2.40)

where λi > 0 is the ith top positive eigenvalue of ΣBΣ−1W , which can be com-

puted through the general eigenvalue equation ΣBAi = λiΣW Ai with Ai and λi

being the ith eigenvector and eigenvalue respectively. The within class covari-ance is computed as the sum of the covariance of data in each class weightedby the class probabilities:

ΣW =c

∑i=1

piΣi (2.41)

where pi is the marginal probability of seeing the data belonging to the ithclass, Σi is the covariance of data belonging to the ith class and c is the totalnumber of classes. The between class covariance is computed as the covari-ance of the mean vectors of the classes:

ΣB = E[(μ i−μ)(μ i−μ)T ] (2.42)

where μ i is the mean vector of the ith class and μ is the overall mean ofthe data. A high value of FDR indicates that the date is well separated intodifferent classes depending on their labels. A small value of FDR, by contrast,indicates that the classes overlap with each other and the boundary betweenthe classes is not clear. Figure 2.3 depicts two examples of data sets withdifferent values of FDR.

2.3 Principal Component AnalysisPrincipal component analysis (PCA) is a method used to study the structure ofa data matrix. The main aim of PCA is to reduce the dimensionality of a data

33

(a) (b)

Figure 2.3. An example of data samples with different values of Fisher discriminantratio (FDR). In each figure, the data belongs to two classes, circle (black) and square(red). FDR measures the separability of the data in two classes. (a) The data has asmall value of FDR, i.e., the boundary between the two classes is not clear and theclasses overlap with each other. (b) The data has a relatively larger value of FDR, i.e.,the boundary between the classes is very clear and the data is well separated.

set in such a way that most of the variance in the data is retained. A detailedstudy on PCA is by Jolliffe (2002). PCA is studied from different views whichdiffer in the way that the data is interpreted. In the geometric view, the datamatrix is interpreted as a collection of vectors in the Euclidean space spannedby the column or row vectors of the matrix. In the statistical view, the datamatrix is interpreted as a sample of a multivariate distribution.

We focus on the statistical view of PCA, dealing with the study of the struc-ture of the covariance between a vector of random variables X= (x1, . . . ,xm)

T .The vector of random variables may or may not be a multivariate random vari-able, i.e., it may or may not model a mathematical system. PCA looks for avector of independent latent variables Y = (y1, . . . ,yk) (k�m), inferred fromthe original variables, X, that retains most of the variation in the original data.The latent variables, called principal components, are linear functions of theoriginal variables:

Y = AT (X−E[X]) (2.43)

In Equation 2.43, E[X] is the expected vector of the vector of random vari-ables, X, and the m× k matrix A = [A1 . . .Ak] is composed of the k dominanteigenvectors of the covariance matrix of X, i.e., ΣXA j = λ jA j, where ΣX isthe covariance matrix of X and λ j is the jth top eigenvalue of ΣX. The meansubtraction step in Equation 2.43 centres the data sampled from X around the

34

x1

x2

α1

α2

Figure 2.4. An example of two-dimensional data sampled from the vector of randomvariables X = (x1,x2)

T .The black circles are the original data sampled from X and thered circles are their corresponding one-dimensional principal components. The vectorαi = λiAi (i = 1,2) is in the direction of the ith eigenvector of the covariance matrix ofX, ΣX. Ai is the ith eigenvector of ΣX associated with λi, the ith top eigenvalue of ΣA.

mean vector E[X]. In other words, the mean vector of the vector of randomvariables Z = X−E[X] is the null vector 0.

The orthonormal columns of A make the components of the latent multi-variate random variable Y independent from each other. This basically meansthat the covariance matrix ΣY is a diagonal matrix whose diagonal elementsare equal to the eigenvalues of ΣX, i.e., ΣY

( j, j) = λ j ( j = 1, . . . ,k), where λ j isthe eigenvalue associated with A j.

Figure 2.4 shows an example of principal component analysis of two-di-mensional random vectors X = (x1,x2)

T . The data sampled from X are shownwith black circles. The vectors α1 and α2 correspond to the eigenvalue decom-position of the covariance matrix ΣX. Let the columns of the matrix A= [A1A2]be the eigenvectors of ΣX and their corresponding eigenvalues be λ1 and λ2

with λ1 > λ2. The vector α1 = λ1A1 is the scalar product of the top eigenvalueλ1 and its corresponding eigenvector A1. This vector shows the direction ofthe largest variance in the data. The red dots are the one-dimensional principalcomponents of the data sampled from X. The principal components lie on aline, hence their dimensionality is equal to one. The principal components areobtained from the projection of the original data, black circles, onto the vectorA1, which is in the direction of α1.

From the above argument about PCA, we see that computing the principalcomponents of a data matrix sampled from a random vector X involves com-

35

puting the covariance matrix of the random vector, ΣX, and computing its ma-trix of eigenvectors, A. Depending on the dimensionality of X, computing thecovariance matrix ΣX can be very expensive. An alternative way to computethe matrix of eigenvectors A is to perform the singular value decompositionof a data matrix X sampled from the random vector X. Given the m×n sam-ple matrix X drawn from the random vector X = (x1, . . . ,xm)

T , the matrix ofeigenvectors A can be efficiently computed by singular value decomposition ofthe mean-centred sample matrix X= X−X1T

n , where X = 1nX1n is the sample

mean vector, and 1n is the vector of n ones. The left singular vectors of X arethe eigenvectors of the covariance matrix ΣX, and the singular values are equalto

√(n−1)Λ, where Λ is the vector of eigenvalues of ΣX. Given the singular

value decomposition of X =UΣV T , the k-dimensional principal componentsof X can be efficiently computed by:

Y =UTk X (2.44)

where the columns of the m× k matrix Uk are the left singular vectors corre-sponding to the k dominant singular values of X. Replacing X with the productof its SVD factors, UΣV T , the principal components can be computed by:

Y = ΣkVTk (2.45)

where the columns of the n× k matrix Vk are the right singular vectors corre-sponding to the k dominant singular values of X on the main diagonal of thek× k diagonal matrix Σk.

Algorithm 1 outlines the three main steps for performing PCA on a samplematrix X . The inputs to this algorithm are:

• the m×n matrix X sampled from the vector of random variables• the number of dimensions k� m

The first step consists of computing the mean vector E and forming the mean-centred data matrix X. On Line 2, 1n is an n-dimensional vector of ones. Thenin the second step, the singular factors of the mean-centred data are computed.Finally in the third step, the k-dimensional principal components of X areconstructed. In the next section, we show how this classic definition of PCAcan be generalized.

Although PCA does not make any assumption about the distribution of data,Jolliffe (2002, Chapter 2,3) states that many of the interesting properties ofPCA are derived from an assumption about the normality of the data. Forexample, PCA maximizes the mutual information between the desired low-dimensional random vectors and the original high-dimensional random vec-tors, if the original data is distributed normally. In practice too, it has beenseen that the classic PCA in Algorithm 1 is unable to extract the most infor-mative features of the data. In order to extend the applicability of the classicPCA, different types of generalizations have been proposed by researchers.

36

Algorithm 1 Principal component analysis.1: procedure PCA(X ,k)2: Compute the mean vector E and form the centred data X←X−E1T

n

3: Compute the rank-k singular value decomposition of (Uk,Σk,Vk)←SVD(X) such that X≈UkΣkV

Tk

4: Form the k-dimensional principal components Y ← ΣkVT

k

5: return Y

6: end procedure

An extensive study of the adaptation and generalization of PCA on differenttasks has been done by Jolliffe (2002, Chapter 14). In the next section, weintroduce a generalized version of PCA, enabling the application of PCA inmany practical environments.

2.4 Generalized Principal Component AnalysisPCA is generalized in various ways (Jolliffe, 2002; Vidal et al., 2016). Thegeneralizations are often done to enable the analysis to make use of a pri-

ori knowledge about data and to capture the non-linearities in a co-variancematrix. Four common ways to generalize the classical PCA are:

• to add a metric matrix and a weight matrix to the classic model• to perform a transformation on the vector of random variables• to form the principal components from an arbitrary set of eigenvalues

instead of the top eigenvalues

A common way to generalize PCA is to add an m×m metric matrix Φ, andan n× n weight matrix Ω to the classical definition of PCA. These matricesprovide for using a priori knowledge in the PCA. The metric matrix Φ is usedfor weighting the random variables. A metric matrix relates different metricsystems together. For a vector of random variables consisting of m randomvariables, we define a metric matrix as an m×m matrix whose (i, j)th elementsrelate the metric system of the ith and jth random variables. Using a metricmatrix Φ defined over a vector of random variables, the distance between twoobservations x and y sampled from the vector of random variables is definedas:

distance(x,y) = (x− y)TΦ(x− y) (2.46)

The weight matrix Ω is used to weight the observations. For example, onemay use this matrix to reduce the effect of noisy data on the analysis of data.For a sample data of size n, we define a weight matrix as an n× n diagonal

37

matrix whose diagonal elements are the weights of the observed data in thesample.

The other common way to generalize PCA is to perform a transformationon the sample data (or the vector of random variables). This transformationis often performed before computing the singular value decomposition of thesample data. This adds a degree of non-linearity to the classical PCA and ishelpful when the non-linear relationship between the variables is of interest.

Another generalization is to give the method enough flexibility to form thefinal low-dimensional vectors Y not only on the basis of the top eigenvaluesbut also on the basis of an arbitrary set of eigenvalues and their correspond-ing eigenvectors. This can be done through an m×m diagonal matrix, whichis used to weight the eigenvalues in Σ. We refer to this eigenvalue weight-ing matrix as Λ. The diagonal elements of Λ are non-negative real numbersthat control the variance of data along the corresponding eigenvectors. Thepresence of zero values among the diagonal elements is basically equivalentto eliminating the dimension formed by the corresponding eigenvector. So ifthe desired number of dimensions is to be k� m, then we must have exactlyk non-zero values on the diagonal elements of Λ.

The generalizations of PCA are modelled through a set of parameters. Al-gorithm 2 is a generalized version of PCA, called GPCA, based on these pa-rameters. The inputs to the algorithm are:

• the m×n matrix X sampled from the vector of random variables,• the m×m metric matrix Φ,• the n×n weight matrix Ω,• the m×m diagonal eigenvalue weighting matrix Λ, and• the transformation function f

Algorithm 2 Generalized principal component analysis.1: procedure GPCA(X ,Φ,Ω,Λ,f)2: Form X← f(ΦXΩ)3: Compute the mean vector E and form the centred data X←X−E1T

n

4: Compute the singular value decomposition of (U,Σ,V )← SVD(X)5: Weight the singular values with Λ, Σ1 ← ΛΣ6: Form the low-dimensional principal components Y ← Σ1V T

7: return Y

8: end procedure

On Line 2, the algorithm applies several transformations on the sample matrixX using the weight matrices and the transformation function f. On Line 3, thecolumn vectors of the transformed matrix X are centred around their mean.Then on Line 4, the singular value decomposition of the mean-centred datamatrix is computed. The singular values from which the low-dimensionalvectors are computed are weighted and selected on Line 5 via the weights

38

provided by the input matrix Λ. Usually, in practice, the matrix Λ is not fed tothe algorithm as an input argument, but rather it is computed in the algorithmas a function of the matrix Σ. In this case, the number of dimensions shouldbe specified in the input argument list. The final low-dimensional vectors arecomputed on Line 6. The number of dimensions of these vectors is equal tothe number of positive elements in the diagonal matrix Λ. If we substitute Σ1

on Line 6 with ΛΣ (see Line 5), then the low-dimensional vectors in Y will beas:

Y = ΛΣV T (2.47)

where Σ and V are the matrix of singular values and the matrix of right singularvectors of X, respectively.

39

3. Words and Word Embeddings

This chapter is devoted to the concept of the word as one of the fundamentalunits of a language. As mentioned in Section 1.1, the concept of the wordhas been deeply studied in various fields such as linguistics, philosophy, the-ology, and mathematics. In Section 3.1, we provide a brief introduction to theconcept of the word from the linguistic point of view. This is followed by in-troducing an algebraic representation of words known as word embeddings inSection 3.2. Our study of word embeddings is focused on the popular methodsand evaluation metrics of words embeddings used in literature.

3.1 WordsIn this section, we introduce the concept of the word from different linguis-tic perspectives. First, we provide an introduction to the word concept froma morphological point of view. Then we give a syntactic view of the wordand, finally, we define the word from a semantic perspective. In this part,we assume that the reader is already familiar with morphology, syntax, andsemantics from linguistics.

3.1.1 MorphologyMorphology is a branch of linguistics that studies the structure of words andthe relationships between words in a language (Matthews, 1991). In morphol-ogy, the definition of a word is connected to the notions word form, morpheme,and lexeme. In written languages, a word form is a sequence of characters thatis understandable by the readers of the language and follows the morpholog-ical rules of the language. For example, the sequence of letters ‘p’, ‘r’, ‘u’,‘n’, ‘e’, and ‘d’ forms the word pruned which is understandable by Englishreaders. Word-forms are divided into smaller morphological units called mor-

phemes. A morpheme is the minimal morphological unit of a language. In theexample above, the word form pruned consists of two morphemes prune anded:

pruned← prune+ ed

where the silent letter e at the end of prune is eliminated according to the mor-phological rules of English. That is, the silent ending e of a verb is eliminated

40

when the verb has the suffix -ed. The morphemes that convey the core mean-ings of a word form are called the roots of word forms and the morphemesthat convey less lexical meanings are called the affixes of word forms. A wordform can include multiple roots and affixes. For example, the compound wordblack-board comprises two root morphemes black and board, and the com-plex word rewritable comprise two affixes re and able and the root morphemewrite:

rewritable← re+write+ able

The two main types of affixes are the derivational and the inflectional affixes.Derivational affixes change the meaning or the class of word forms. For exam-ple, the affix able in the word rewritable changes the class of the word formrewrite from a verb to an adjective. Inflectional affixes, however, are grammat-ical units of a language that do not change the meaning of the word forms. Thepart that remains unchanged in all morphological inflections of a word form iscalled the stem of the word form. For example, the stem of the words pruned,prunes, and pruning is prune. The set of morphological inflections of a stemis associated with a basic unit of meaning called lexeme. More formally, alexeme is a basic unit of meaning which conveys the meaning of a set of wordforms regardless of their inflections.

In this thesis, we do not deal with processing the internal structure of wordforms. We consider a word form to be a word. In other words, a word is de-fined as a sequence of connected morphemes. In this definition, the compoundwords (e.g. black board) are considered one single word if their comprisingmorphemes are written in a connected way, i.e., with no space between themorphemes. If the morphemes are separated by a space, then each morpheme(or sequence of connected morphemes) is considered as a word. For example,the sequence of morphemes black board is considered to be made up of twowords black and board while the word form black-board is considered as asingle word. We treat the punctuation marks and symbols of a language thesame as word forms. A sequence of connected punctuation marks and sym-bols are treated as a word form. For example, the period ‘.’ by itself is treatedas a single word but if it is in a sequence of symbols (e.g. ‘. . . ’), the entiresequence is treated as a word.

3.1.2 SyntaxIn linguistics, syntax looks at the structure of sentences (Matthews, 1981).Sentences, as the maximal syntactic units of a language, are hierarchicallyformed by smaller syntactic units such as phrases and words. The phrasesthemselves are recursively formed by other phrases and/or words. An exampleof a syntactic hierarchical structure is shown in Figure 3.1. In this figure,the hierarchy is shown in the form of a tree structure whose top (root) nodedenoted by S represents the abstract notion of a sentence. The internal nodes

41

S

NP

PPR$

his

NN

garden

VP

VBZ

is

PP

IN

near

NP

DT

the

NN

village

.

.

Figure 3.1. The hierarchical syntactic tree structure of the sentence his garden is near

the village.. The root node S denotes a sentence as the maximal syntactic unit. Thesentence is formed by a noun phrase denoted by NP, a verb phrase denoted by VP, andthe sentence ending punctuation denoted by a dot. These phrases themselves consist ofsmaller syntactic units. The noun phrase NP comprises a possessive pronoun denotedby PRP$ and a noun denoted by NN. The verb phrase VP comprises a verb denotedby VBZ, and a prepositional phrase denoted by PP, which itself includes sub-phrasesIN and NP, etc. Words are the smallest syntactic units in this hierarchy.

represent different types of phrases and syntactic categories of words. Theleaves of the tree represent the words of the sentence. This tree structure isknown as a parse tree or more concretely as a constituency parse tree, whichis generated by a parser (Jurafsky and Martin, 2009, Part 3).

In the hierarchy of syntactic units, words are defined as the minimal syntac-tic units of a language. In other words, words are the atomic units of syntaxwhich cannot be divided into smaller units. In Figure 3.1, the sentence his

garden is near the village. is made up of the words his, garden, is, near, the,

village and the final punctuation. These words are the leaves of the hierarchi-cal syntactic tree. As mentioned in Section 3.1.1, the final punctuation ‘.’ istreated as a word form. In this example, the sentence is tokenised by spacesand each token is equivalent to a word.

The task of identifying the word boundaries in a sentence is called tokenisa-tion or segmentation and the tool that finds the word boundaries is a tokeniser.A tokeniser may need both morphological and syntactic information of a lan-guage to find the word boundaries. The word boundaries in many languages,including many of the Indo-European languages, are partially marked by somespace characters. Some languages, such as Chinese, however, do not reserveany character for word boundaries.

42

In this thesis, we assume that word boundaries in sentences are marked bya space delimiter. In other words, we assume that sentences are tokenised bya tokeniser, which is able to determine word boundaries in a given sentence.Thus, the syntactic words in a sentence are sequences of characters separatedby space delimiters. This includes the punctuation marks and the word formsas explained in Section 3.1.1.

Our syntactic definition of words is directly influenced by the performanceof tokenisers in determining the word boundaries. A tokeniser may incor-rectly identify a sequence of characters as a word or it may not be able todetermine some sequences of characters as individual words. The tokeniserscurrently available often show weak performance in determining the bound-aries of multi-token words. A multi-token word is a word formed by multiplewords that function as a single syntactic unit. Some examples of multi-tokenwords are phrasal verbs (e.g. move on, take off, and write down) and com-pound words (e.g. farmhouse, birdhouse, watermelon, and domain name sys-

tem). The elements of a multi-token word may or may not be connected toeach other. For example, the elements of the multi-token word domain name

system are usually written separately but the element of the multi-token wordfarmhouse, farm + house, are often connected to each other. The identificationof disconnected multi-token units is a non-trivial task that is ignored by manyof the tokenisers available. Depending on the performance of a tokeniser, amulti-token word may or may not be considered a single word in the sentence.If the multi-token word is not recognized as a single word then its elementsare considered individual words. For example, the multi-token word domain

name system is considered to be three words, domain+name+system, but it isconsidered to be a single word if it is written without a space, e.g., domain-

name-system.Words are classified into different syntactic categories called parts of speech

or POS. In Figure 3.1, part-of-speech tags are the nodes directly connected tothe words. For example, the part-of-speech tag of the word garden is noun,denoted by NN. This means that the syntactic role of the word garden in theexample can be taken by other words in the category of noun (e.g. cat, com-

puter, or school).A word form in its isolated form may associate with different part-of-speech

tags, but when it appears in a sentence as a word it takes only one syntacticrole. For example, the word form stone may appear as both a VERB and aNOUN. However, in the sentence Mary threw a stone, it is a NOUN, and inthe sentence Mary stoned the olives, it appears as a VERB. The process ofassigning part-of-speech tags to words is called part-of-speech tagging andthe tool that does part-of-speech tagging is called a part-of-speech tagger.

43

3.1.3 SemanticsSemantics is the study of meaning (Lyons, 1995; Cann et al., 2009). In lin-guistics, semantics is concerned with the meaning of linguistic units such aswords, phrases, and sentences. In philosophy (Wittgenstein, 1953), LudwigWittgenstein defined the meaning of a word in a language as its use in thelanguage: “the meaning of a word is its use in the language”. There are manyinterpretations of this definition of the meaning of a word. An interpretationof this definition is as follow: a word takes its meaning from its relationshipswith other words. This interpretation suggests that the meaning of a word in alanguage is not absolute but relative to other words in the language.

Distributional semantics (Clark, 2015; Sahlgren, 2006; Ó Séaghdha and Ko-rhonen, 2014) is a branch of semantics that looks for a distributional represen-tation of meanings. In distributional semantics, the relative interpretation ofmeaning is used to represent a word in a language based on its presence withother words in the language, i.e., the meaning of a word comes from its usagein a language. More concretely, each word is represented by a vector whoseelements are functions of the frequency of seeing the word in the context ofother words. This vector representation of words provides a systematic wayto model the relative meaning of words in a language. In this representation,the vectors associated with semantically similar words are close to each other.This relative closeness is the key factor for representing the relative meaningof the words. In other words, each word takes its meaning based on its distancefrom other words.

In written languages, word vectors are often associated with the syntacticwords, sequences of characters seen with space delimiters, as described inSection 3.1.2. Word vectors are affected by the problem with multi-tokenwords also seen in syntactic words. If a multi-token word is not recognisedas a single word, each element of the multi-token word will be seen as anindividual word. This can result in noisy information in the vectors associatedwith the words comprising the multi-token word.

Multi-word expressions (Sag et al., 2002) are some of the multi-token wordsthat are harmed by the deficiency of tokenisers. A multi-word expression isa multi-token word whose meaning is not predictable from its comprisingwords. Hence, we define the concept of multi-word expression as a seman-tic concept, unlike the concept of multi-token word, which is defined as asyntactic concept. Not all multi-token words are considered to be multi-wordexpressions. For example, the multi-token word domain name system is nota multi-word expression since its meaning is predictable from its comprisingwords. Similarly to other types of multi-token words, the comprising elementsof a multi-word expression may or may not be separated by the space delim-iter. Disconnected multi-word expressions are often treated as multiple words.A typical kind of multi-word expression is an idiom. For example, the id-iom kick the bucket in English means to die. If this idiom is recognised as a

44

multi-word expression, then a vector is associated with the entire word kick-

the-bucket. Otherwise, each of the comprising words, kick, the and bucket, areassociated with their own vectors.

In this thesis, we treat multi-word expressions similarly to multi-token words.If a multi-word expression is recognised as a single word, it is treated as a sin-gle word. Otherwise, it is considered as multiple words.

3.2 Word EmbeddingsWord embedding is the process of embedding words into a vector space. Aword embedding method associates each word in a language with a vectorin such a way that the similarities between words are reflected through thesimilarities between vectors, i.e., similar words are associated with similarvectors. The vectors associated with words are called word embeddings, orword vectors. Word embeddings are built on the distributional information ofwords in different contextual environments. Hence, the entire representationof words through vectors is called the distributional representation of words.

Word embeddings have shown great improvements in NLP tasks. The con-tinuous vector representation of words make it possible to apply machinelearning tools such as neural networks to language related tasks (Collobertet al., 2011; Kalchbrenner and Blunsom, 2013; Chen and Manning, 2014).Word embedding also opens a new window to study the linguistic aspects ofwords from an algebraic point of view (Mikolov et al., 2013c; Hartung et al.,2017; Basirat and Tang, 2018).

As we discussed earlier, words are the most basic elements of syntax andvectors are the most basic elements of a linear space. In this respect, one canargue that word embedding bridges the gap between linguistics (more specif-ically syntax) and mathematics (more specifically linear algebra). In otherwords, word embedding enables the application of algebraic syntactic mod-elling. In this section, we do not study how algebraic operations can be appliedto word embeddings. Instead, we provide an introduction to how word embed-dings are associated with words. To this end, first, we outline the general ideaof word embedding. Then we give an overview of some of the popular wordembedding methods that are related to the principal word embedding methodintroduced in this thesis. We finish this part by looking at the standard evalua-tion metrics used for assessing the quality of a set of word embeddings.

3.2.1 General IdeaThe numerical representation of words is not a recent idea. An example ofnumerical systems being used to represent letters and words can be seen in theold scripting languages such as Arabic. In Arabic, before the eighth century, anumerical system was used to associate the letters of the language with natural

45

numbers. This system is called the Abjad and is used to associate every wordin the language with an integer. The numerical system of Arabic words hasbeen studied in a pseudoscientific area called Aljafr.

The modern numerical representation of words, known as word embed-dings, uses a vector space to represent words. A word embedding methodassociates each word in a language with a vector. The association betweenwords and vectors is done based on the contextual environments of the words.Words that appear in similar contextual environments are associated with sim-ilar vectors.

Depending on their area of development, word embedding methods can bedivided into two main types:

• methods that are developed in distributional semantics (Schütze, 1992;Lund and Burgess, 1996; Landauer and Dumais, 1997; Sahlgren, 2006;Pennington et al., 2014; Lebret and Collobert, 2014; Basirat and Nivre,2017)

• methods that are developed in the area of language modelling (Bengioet al., 2003; Collobert et al., 2011; Mikolov et al., 2013a)

Levy and Goldberg (2014b) show that these two types of method are con-nected to each other. In both areas, word embeddings are created by apply-ing dimensionality reduction techniques to a high-dimensional vector spacethat models the contextual environments of words. The dimensions of thehigh-dimensional vector space correspond to a set of contextual features (e.g.words, and syntactic categories of words) and its vectors correspond to thewords in a language. A high-dimensional vector space that models the con-textual environment of words is referred to as a contextual vector space. Thevectors of a contextual vector space are referred to as the contextual word

vectors.The definition of context is a key factor determining the type of informa-

tion that is encoded into the contextual word vectors. The two main types ofcontext used in literature are window-based context, also known as the bag-of-word context, and dependency context (Padó and Lapata, 2007; Vulic andKorhonen, 2016; Levy and Goldberg, 2014a). A window-based context of aword is formed by all words in a sequence (or window) of surrounding words.A dependency-based context is formed by words that are in certain depen-dency relations with the word in question. The relative position of words iseither completely ignored by these types of context or, in some cases, they aremodelled through some weighting mechanisms based on the positional dis-tance between the words in a sentence.

The two types of word embedding method methods differ in the way thatthey form a contextual word vector, and the dimensionality reduction tech-nique they apply on the vector space. The word embedding methods devel-oped in the area of distributional semantics first form a matrix of contextualword vectors. Then they apply a dimensionality reduction technique on the

46

word vectors. These methods mostly use spectral methods of dimensional-ity reduction such as principal component analysis. The methods developedin the area of language modelling form the vector space in parallel with theapplication of a dimensionality reduction technique. These methods mostlymake use of neural networks and auto-encoders for dimensionality reduction.

The principal word embedding method, introduced in this thesis, is a dis-tributional semantic method. In Section 3.2.2, we outline some of the popu-lar word embedding methods developed in distributional semantics. In Sec-tion 3.2.3, we provide a brief introduction to the word embedding methodsdeveloped in the area of language modelling. A more elaborate introductionto these methods is provided by Goldberg and Hirst (2017).

3.2.2 Word Embeddings in Distributional SemanticsDistributional semantics studies the semantic similarities between words froma statistical and algebraic point of view (Sahlgren, 2006; Clark, 2015). In dis-tributional semantics, the process of embedding words into a vector space isdone such that similar words are associated with similar vectors. Algorithm 3outlines the main steps shared by almost all word embedding methods devel-oped in the area if distributional semantics. The algorithm consists of threesteps and takes four input arguments. The input arguments are:

1. a corpus E = {e1, . . . ,eT} of size T ,2. a set of contextual features F = { f1, . . . , fm} of size m,3. a vocabulary list V = {v1, . . . ,vn} of size n,4. an integer k� m indicating the number of dimensions of the final word

vectors.

First, a matrix of contextual word vectors is created and then a transforma-tion on the elements of the matrix of contextual word vectors. Finally, low-dimensional word vectors are constructed by applying a dimensionality reduc-tion technique on the column vectors of the matrix.

Algorithm 3 The main steps shared by all word embedding methods. E is acorpus, F is a feature set, V is a vocabulary list, and k is the dimensionality ofword embeddings.

1: procedure WORD-EMBEDDING(E,F,V,k)2: M ← BUILD-CONTEXTUAL-MATRIX(E,F,V )3: M← TRANSFORM(M)4: Y ← REDUCE-DIMENSION(M,k)5: return Y

6: end procedure

The function BUILD-CONTEXTUAL-MATRIX scans the input corpus E andcounts the frequency of the vocabulary items in V in the contextual environ-

47

ment of the contextual features in F . This function returns an m× n matrixwhose elements are the frequency of seeing words in the domain of differentcontextual features. This matrix is referred to as a co-occurrence matrix whenthe contextual features are the same as vocabulary items, i.e., when F = V .Since the columns and the rows of a co-occurrence matrix correspond to thevocabulary items in a language, a co-occurrence matrix is also called a word-word matrix. The size of a co-occurrence matrix is in the quadratic order ofthe vocabulary size and it is usually very sparse. Despite the large size of a co-occurrence matrix, its high degree of sparsity means that it is feasible to pro-cess a co-occurrence matrix with limited memory and computing resources.When the contextual features are not the same as the word forms (i.e. F �=V ),the matrix is called a contextual matrix. An example of a contextual matrix isa document-word matrix, which counts the frequency of seeing words in dif-ferent documents. The contextual features in a document-word matrix are thedocuments. A co-occurrence matrix is a special case of a contextual matrix.Depending on the contextual features in use, a contextual matrix may or maynot be a sparse matrix. In general, the degree of sparsity of a contextual matrixis proportional to the number of contextual features. As the number of contex-tual features is increased, it is more likely that the sparsity of the contextualfeatures will also increase.

The word embedding methods developed in the area of distributional se-mantics make use of both window-based and dependency contexts. Some ex-amples of the methods that use the window-based context are HAL (Lund andBurgess, 1996), HPCA (Lebret and Collobert, 2014), and GloVe (Penningtonet al., 2014). These methods are designed to process raw corpora, i.e., theset of contextual features F is equal to the set of vocabulary size V . In otherwords, the contextual features in these methods are the vocabulary units. Padóand Lapata (2007) define the context of a word in a sentence as all words inthe sentence that are in a dependency relationship with the target word. Thismethod uses a dependency parsed corpus as its training corpus. Although Padóand Lapata (2007) show that the dependency context can result in higher accu-racies in certain tasks, Kiela and Clark (2014) argue that better results can beobtained from the window-based context than the dependency context if thevectors are extracted from a fairly large corpus with a small window size. Theidea of using rich contextual features is explored by Kiela and Clark (2014).They use different types of contextual features such as lemma, part-of-speechtags, and supertags. This can also be seen in LSA (Landauer and Dumais,1997) where the contextual features and the context of a word are the labeland the content of the document the word belongs to. LSA has been designedfor document clustering. It uses a document-word matrix for embedding doc-uments into a vector space. This matrix can be used for embedding wordsbased on the documents they belong to. In this view, LSA can be seen as aword embedding method that uses the document labels as contextual featuresand the word’s documents as the context of the words.

48

The goal of the transformation function TRANSFORM in Line 3 of Algo-rithm 3 is to reduce the dominance of high-frequency words in a contextualmatrix. The function defined as in Equation 3.1 is one of the logarithmic trans-formation functions used commonly for this purpose. This function drasticallyreduces the value of very large numbers:

Mi, j =

{log(Mi, j) Mi, j > 00 Otherwise

(3.1)

Salton and Buckley (1988) propose using a weighting mechanism called tf-idf

to reduce the negative effect of the most frequently used words that are seenin many different contexts. The tf-idf transformation is defined as follows:

Mi, j = log(Mi, j) log(n

∑nl=1 sign(Mi,l)

) (3.2)

tf-ifd can be seen as an extension of the log transformation in Equation 3.1.The first log in tf-idf is similar to the logarithmic function in Equation 3.1and the second log decreases the weight of those elements that are seen indifferent contexts. tf-idf was initially used for document embedding. Thesecond logarithmic term in tf-idf does not take into account the frequencyof words. It only relies on the total number of words, n, and the number ofcontextual environment types that a word is seen in, not the frequency of seeingthe word in different contextual environments. tf-icf (Reed et al., 2006) is ageneralization of tf-idf that takes the frequency of seeing words in differentcontexts into account. The tf-icf transformation is defined as follow:

Mi, j = log(Mi, j) log(∑m

k=1 ∑nl=1 Mk,l

∑nl=1 Mi,l

) (3.3)

Positive pointwise mutual information (PPMI) (Church and Hanks, 1990) isdefined as follow:

Mi, j = max(0, log(Mi, j∑m

k=1 ∑nl=1 Mk,l

∑mk=1 Mk, j ∑n

l=1 Mi,l)) (3.4)

PPMI measures the association between two random variables. In the contextof word embeddings, it measures the association between two words. Levyand Goldberg (2014b) and Melamud and Goldberger (2017) provide a detailedanalysis of the effect of pointwise mutual transformation on word embeddings.Lebret and Collobert (2014) propose using the Hellinger transformation, de-fined in Equation 3.5, on a probability co-occurrence matrix:

Mi, j =

√Mi, j

∑mj=1 Ml, j

(3.5)

where m ≤ n is the number of most frequent words used as context and thesquare root is an element-wise function. The elements of a probability co-occurrence matrix are the probabilities of seeing context words after words.

49

The probabilities are estimated by the fraction inside the square root in Equa-tion 3.5. Each column of a probability co-occurrence matrix defines a prob-ability distribution function on different context words that are seen with aword. This is because the sum of the elements of each column is equal to one.Lebret and Collobert (2014) use this property of the probability co-occurrencedata to explain why the Hellinger transformation is good. They argue that theHellinger transformation works well with the probability co-occurrence datasince the Hellinger transformation provides for computing the Hellinger dis-tance between the distributions. However, Basirat and Nivre (2017) argue thatthe Hellinger transformation makes the distribution of the co-occurrence datamore suitable for performing the dimensionality reduction step. They showthat better results can be obtained with the seventh root transformation as inEquation 3.6, instead of the square root transformation:

Mi, j = 7

√Mi, j

∑nl=1 Mi,l

(3.6)

In addition to the degree of transformation, another difference between thetransformation functions in Equation 3.5 and Equation 3.6 is in the probabil-ity values. The Hellinger transformation is applied to the probability valuesof seeing context words after words, but the seventh root transformation inEquation 3.6 is applied to the probability of seeing words after context words.

Among the popular methods of dimensionality reduction, principal compo-nent analysis (PCA) is widely used to reduce the dimensionality of contextualword vectors (Lund and Burgess, 1996; Landauer and Dumais, 1997; Lebretand Collobert, 2014; Basirat and Nivre, 2017). Dahl et al. (2012) and Hintonand Salakhutdinov (2006) use restricted Boltzmann machine (RBM). It can beshown that RBM is connected to PCA (Jolliffe, 2002; Hinton and Salakhutdi-nov, 2006). Mixture models (Hofmann, 1999; Blei et al., 2003) and non-linearmethods (Roweis and Saul, 2000; Hinton and Salakhutdinov, 2006) are someof the other approaches that have been used for reducing the dimensionalityof a semantic space. GloVe (Pennington et al., 2014) formulates the prob-lem of dimensionality reduction as a regression problem. Although GloVe

does not explicitly use PCA, Basirat and Nivre (2017) show that the regres-sion formulation is equivalent to a kernel principal component analysis of thecolumn vectors. HPCA (Lebret and Collobert, 2014) uses the singular valuedecomposition to estimate the principal components of the co-occurrence ma-trix. Lebret and Collobert (2014) ignore the centring step in PCA in order totake the advantage of the data sparsity. RSV (Basirat and Nivre, 2017) uses thesame approach to compute the principal components. However, it performs themean subtraction step to centre the column vectors around their mean. Thisresults in a dense matrix that requires a lot of memory space.

50

3.2.3 Word Embedding in Language ModellingA language model is a statistics tool that estimates a probability function onthe strings of a language. For a given string of size n, s = w1,w2, . . . ,wn, alanguage model estimates the probability p(s) using the chain rule:

p(s = w1,w2, . . . ,wn) =n

∏t=1

p(wt |wt−11 ) (3.7)

where wt−11 is the substring w1 . . .wt−1. The conditional probability p(wt |wt−1

1 )can be estimated using the Markov assumption:

p(wt|wt−11 )≈ p(wt|wt−1) (3.8)

Equation 3.8 is referred to as the first order Markov assumption. A languagemodel that uses the first order Markov assumption relies only on a single pre-ceding word to estimate the string probability in Equation 3.7. This weakensthe language model when there are longer dependencies between the elementsof the string. This weakness is mitigated by higher order Markov assumptions.An kth order Markov assumption is defined as:

p(wt|wt−11 )≈ p(wt|wt−1

t−k) (3.9)

The parameter k is sometimes referred to as the context size of the languagemodel. In practice, the value of k is often smaller than 4. A language modelwith k = n− 1 is called an n-gram language model. A high value of n needsa lot of computational resources which may not always be available. Anotherproblem with the Markov-based language models is the coverage of the model.Due to the generative nature of languages, it is often impossible to memorizeall sequences in a language. A remedy to this problem is to use the back-offmodels (Katz, 1987; Kneser and Ney, 1995) and smoothing techniques (Chenand Goodman, 1996) to estimate the probability of a not-seen word sequencefrom the models with smaller contexts.

Another way to mitigate the weaknesses of the n-gram language modelsis to augment the language model with a neural network that is able to esti-mate the string probabilities. In the remainder of this section, we provide anoverview of the neural network architectures used for this aim.

Bengio et al. (2003) propose a neural network architecture with h+ 1 hid-den layers that learns a distributed representation for words and a probabilityfunction for word sequences. In this approach, each word in a vocabulary isassociated with a real-valued vector. These vectors are then used to expressthe joint probability function in Equation 3.7. The word vectors and the prob-ability function are learnt together to maximise the log-likelihood of a trainingdata. A |V |×m matrix is used to store m-dimensional word vectors. The rowi of this matrix is used as the vector associated with the word i in the vocabu-lary. This matrix serves as the first hidden layer of the neural network, called

51

the projection layer. The remaining h hidden layers are used to estimate thejoint probability function in Equation 3.7. Bengio et al. (2003) use the valueof h = 1 in their experiments, i.e., the entire neural network has two hiddenlayers. The first hidden layer modelling the word vectors uses a linear activa-tion function but the other hidden layers use the ordinary hyperbolic tangentactivation function. The word vectors in this architecture are trained when theneural network is trained.

Collobert et al. (2011) propose a task-independent feed-forward neural net-work for sequence labelling tasks such as part-of-speech tagging, chunking,and named-entity recognition. They use a multi-layer neural network whoseinput is a sequence of words and whose output is the label of a word insidethe sequence (e.g. the word in the middle of the sequence). The first layerof this neural network extracts features for each word in the input sequence.The second layer extract features from the entire sequence and the remaininglayers are the standard feed-forward neural network layers. Similar to the ar-chitecture proposed by Bengio et al. (2003), the word vectors are stored in am× |V | matrix whose ith column is the word vector associated with the ithword in the vocabulary list V . This matrix serves as the shared projection ma-trix of the neural network and is initialized by a set of word vectors that aretrained with a language model. The word vectors are then updated during thetraining phase of the entire neural network, the objective function of which isto maximise a training set. The neural network is trained with a labelled dataset but the initial word vectors are trained with a large unlabelled data set. Alanguage model similar to Bengio et al. (2003) is used for generating the initialword vectors. The language model is trained with a pairwise ranking criterionthat returns higher scores for correct phrases than incorrect phrases. In orderto speed up the learning process of the language model, Bengio et al. (2003)use the curriculum learning (Bengio et al., 2009), which initializes the matrixof word vectors with previously trained word vectors.

Mikolov et al. (2013c) introduce two neural network architectures to com-pute word vectors. One of the proposed architectures is a feed-forward neuralnetwork similar to the language model of Bengio et al. (2003) where the non-linear hidden layers are removed. This model is called a continuous bag-of-word (CBOW) model. For a sequence of words, CBOW predicts the word in themiddle of the sequence. It uses the average of the vectors of the surroundingwords as the input to the hidden layer of the neural network and predicts themiddle word in the output layer. The other method is similar to CBOW, butinstead of predicting the middle word, it predicts the surrounding words. Thismodel is called the continuous skip-gram model. For a sequence of words, thecontinuous skip-gram model takes the word in the middle of the sequence asinput and predicts the words within a certain range before or after the inputword. The continuous skip-gram model uses a log-linear classifier togetherwith a projection layer for this aim. The projection matrix in the hidden layerof the classifier forms the word embeddings.

52

3.2.4 EvaluationDifferent evaluation metrics are used to assess the information captured bya set of word embeddings. The evaluation metrics used in literature can beclassified into three groups:

• Word similarity metrics• Word analogy metrics• Application-based metrics

Lai et al. (2016) argue that the word embeddings’ evaluation metrics can begroups into three categories. The first category consists of the metrics thatevaluate semantic properties of word embeddings. This category matches ourfirst class of evaluation metrics, i.e., the word similarity metrics. The secondcategory consists of the metrics that evaluate the contribution of word embed-dings to NLP tasks. The third category consists of the metrics that evaluate thecontribution of word embeddings to neural network classifiers. These two cat-egories match our third class of evaluation metrics, i.e., the application basedmetrics.

Word similarity metrics evaluate a set of word embeddings on the basis ofthe correlation between similar vectors and similar words. These metrics of-ten use a dictionary that associates pairs of words with their humanly judgedsimilarity rankings. For word pairs listed in a dictionary, the word similaritymetric computes the correlation between the humanly judged ranks associ-ated with the words and the cosine similarity between the corresponding wordvectors. A high correlation obtained from a set of word embeddings and adictionary indicates that the word embeddings successfully capture the infor-mation encoded into the dictionary about the similarities between words. Theword similarity benchmark proposed by Faruqui and Dyer (2014) is a toolthat evaluates a set of word embeddings with regard to several word similaritydictionaries. It evaluates a set of word vectors in 13 different word similar-ity benchmarks. Each benchmark contains pairs of English words associatedwith their similarity rankings. The tool reports the correlation between thesimilarity rankings provided by the word similarity benchmarks and the co-sine similarity between the word vectors. Baroni et al. (2014) and Schnabelet al. (2015) evaluate word embedding on variety of word similarity bench-marks which rely on cosine similarities.

Word analogy metrics rely on test sets of analogy questions of the form “a

is to b as c is to d”, where a, b, and c are known words and d is an unknownword that is requested. For example, an analogy question is “saw is to see

as returned is to d” and the word d is expected to be return. This evaluationmetric is widely used in the literature (Lai et al., 2016; Köhn, 2015). Mikolovet al. (2013c) develop two types of analogy test sets, a syntactic test set and asemantic test set. The syntactic test set consists of questions testing differentforms of adjectives (e.g. base form, comparative, and superlative), commonnouns (e.g. singular and plural), and verbs (e.g. base form, past tense, and 3rd

53

person present tense). An example of a syntactic analogy question is “year isto years as law is to d” and the word d is expected to be laws. The semantictest set includes questions on the basis of semantic relation similarities in theSemEval-2012 Task 2. For example, a semantic analogy task can be formedby the words that are in the relation Class-Inclusion, e.g., “clothing is to shirt

as animal is to d” and a solution to d can be goat. Mikolov et al. (2013c) showthat an analogy question corresponds to an equation in a semantic vector spacethat represents the word vectors associated with the words in the question. Theanalogy question “a is to b as c is to d” corresponds to the equation:

vb− va = vd− vc (3.10)

where vx is the vector associated with the word x. Equation 3.10 suggests thatthe answer to an analogy question as above is the word associated with thevector with greatest cosine similarity to y = vb− va + vc. Given this, a set ofword embeddings is evaluated based on the number of analogy questions thatare answered correctly.

Application-based metrics measure the meaningfulness of word embed-dings in a specific application (e.g. machine translation, parsing, and part-of-speech tagging). In these evaluation metrics, a set of word embeddings is usedfor training a model in an application. These embeddings are then evaluatedbased on their contributions to the learning process.

Nayak et al. (2016a) introduce a framework, named veceval, consistingof several standard NLP tasks (applications) used for evaluating a set of wordembeddings. This framework makes use of basic neural network architecturesfor the tasks. It evaluates a set of word embeddings on the basis of the per-formance of a neural network trained with the embeddings on a certain task.This framework includes the tasks to test the syntactic and semantic proper-ties of word embeddings. The tasks that test the syntactic properties of wordembeddings are part-of-speech tagging and chunking. The tasks that test thesemantic properties of word embeddings are named-entity recognition, senti-ment classification, and question classification.

Köhn (2015) proposes to analyse a set word embeddings through application-based multilingual evaluation metrics. He uses a linear classifier that takes aword vectors as its input and predicts the dependency features of the word as-sociated with the word vector. They argue that a linear classifier can give usmore information about the structure of a vector space, hence a linear classi-fier is used. Lai et al. (2016) evaluate a set of word embeddings with regardto their contributions to multiple tasks including text classification, part-of-speech tagging, named-entity recognition, and sentiment classification. Nayaket al. (2016a) introduce a framework, named veceval, consisting of severalstandard NLP tasks (applications) used for evaluating a set of word embed-dings. This framework makes use of basic neural network architectures forthe tasks. The performance of a neural network trained with a set of word em-beddings on a task is reported as the evaluation result of the word embeddings

54

with regard to the task. This framework include the tasks to test the syntacticand semantic properties of word embeddings. The tasks that test the syntac-tic properties of word embeddings are part-speech-tagging and chunking. Thetasks that test the semantic properties of word embeddings are named-entityrecognition, sentiment classification, and question classification.

Basirat and Nivre (2017) evaluate a set of word embeddings in a depen-dency parsing task. Basirat and Tang (2018) address a novel word embeddingevaluation metric on the basis of the presence of information about rich lin-guistic features such as the lexical and morpho-syntactic features in the wordembeddings. They use a dictionary consisting of words associated with theirlexical and morpho-syntactic features. The words in this dictionary are re-placed with their corresponding embeddings. This results in a list of word em-beddings labelled by a set of lexical and morpho-syntactic features. This listaddresses a standard classification problem in which the word embeddings arethe observations and the features are the classes. This classification problemis tackled by linear discriminant analysis and a feed-forward neural network.The final accuracy of the classifiers on a test set is considered as the evaluationmetric for word embeddings.

To sum up, we introduced three types of evaluation metrics in this section.The first one is based on word similarities. The second one is on the basis of aseries of word analogy questions. The third one is based on the contribution ofa set of word vectors to a task. Word similarity and word analogy evaluationmetrics are sometimes considered intrinsic evaluation metrics in literature andapplication based evaluation metrics are considered extrinsic evaluation met-rics. There is a weak correlation between the intrinsic evaluation metrics, as itis called in the literature, and the extrinsic evaluate metrics. Chiu et al. (2016)show that word embeddings with high values of the intrinsic evaluation metricsdo not necessarily work well on the downstream tasks such as part-of-speechtagging and named-entity recognition. They argue that the word similaritybenchmarks that relates highly similar words to each other are good predic-tors of downstream performance. Yaghoobzadeh and Schütze (2016) refersto the intrinsic evaluation metrics as point-based intrinsic evaluation metricsbecause they consider word embeddings as points in a vector space and theyignore the similarities in the subspaces of the vector space. They argue thatthese methods do not say anything about the underlying reasons why a low orhigh result is obtained from a set of word embeddings. Instead, they introduceseveral criteria which provide some insights about different properties of wordembeddings. We classify the three types of evaluation metrics, the word sim-ilarity metric, the word analogy metric, and the application-based metric, asextrinsic evaluation metrics since they assess a set of word vectors in certaintasks. Later on, in Section 7.1.1, we introduce some intrinsic evaluation met-rics that provide a statistical view of the distribution of a set of word vectors.

55

Part II:Principal Component Analysis for WordEmbeddings

4. Contextual Word Vectors

In this chapter, we introduce contextual word vectors as the initial frequency-based representation of words in a corpus and study their distribution as amixture model. The contextual word vectors then undergo a dimensionalityreduction technique, which generates a set of low-dimensional word embed-dings. In this chapter, we provide a response to our first research question, whyprincipal component analysis (PCA) is not suitable for processing contextualword vectors (see Section 1.2).

A contextual word vector associated with a word is a real-valued vectorwhose elements are the frequencies of seeing the word in different contextsoccurring in a corpus. In the following sections, we formulate the conceptof corpus and context and show how an abstract statistical model can modeldifferent types of corpus and context. In Section 4.1, we express the basicconcept of corpus as a stochastic process that contains a source of informationfrom which a set of contextual word vectors is extracted. Then in Section 4.2,the concept of context is introduced. This is followed by introducing the basicconcepts of feature and word variables in Section 4.3. These variables areused in Section 4.4 to construct contextual word vectors and a mixture modelas a unique representation of all contextual word vectors associated with a setof words. The distribution of the mixture model of contextual word vectorsis studied in Section 4.5, where we show that the distribution of the mixturemodel is not suitable for PCA. Finally, in Section 4.6, we propose differentways to combine feature variables and form different types of contextual wordvectors.

4.1 CorpusA set of words in a language forms a vocabulary set and a set of words indexedby natural numbers forms a corpus. In other words, a vocabulary set consists ofunique words, but a corpus consists of words that may or may not be repeatedin the corpus, hence an indexed set is used to define a corpus. This basicallymeans that a corpus is a subset of V ×N, where V is a vocabulary set and N

takes care of the words’ indices.1 A sequence of words in a corpus may formabstract units such as phrases, sentences, paragraphs, and documents. If weconsider these units as the basic units of a corpus, then a corpus can be seenas a collection of phrases, sentences, paragraphs, or documents.

1This does not imply that any subset of V ×N is a corpus.

59

The above definition of corpus covers the set of corpora which are known asthe raw corpora. However, corpora can be released in annotated forms too. Araw corpus consists of a collection of words with no annotations. An annotatedcorpus provides some abstract information about the linguistic units such ascharacters, words, sentences, and documents in the corpus. This informationis usually represented through symbols that are associated with the elementsof the corpus. For example, a corpus can be annotated with part-of-speechtags at the word level, or document genre at the document level.

Regardless of the type of annotation, we refer to a set of annotation symbolsused in a corpus as a feature set. In practice, word boundaries may or not bemarked. Since we define a corpus by its constituent words, in this thesis,we restrict ourselves to corpora with word boundaries and we consider a rawcorpus to be an annotated corpus whose words are annotated by their forms.In addition, we consider a word to be the most basic element of a corpus andassociate the annotation information provided for the more abstract units suchas sentences and documents with their constituent words. For example, a genreassociated with a document in a corpus is associated with all words formingthe document. This allows us to develop a model of word embedding that canbenefit from different levels of annotation provided by the corpora. It alsomakes it possible to formalize a corpus with a set of vocabulary and a set ofannotation features.

Denoting a vocabulary set consisting of n words as V = {v1, . . . ,vn}, anda feature set consisting of m features as F = { f1, . . . , fm}, a corpus E of sizeT is defined as a set of triples (word, feature, index), E = {e1, . . . ,eT}, whereei ∈ V ×F×N. The index i in each triple ei = (v, f , i) ∈ E , where v ∈ V andf ∈ F , represents the relative position of the word v in the beginning of thecorpus E . This indexing system takes care of the word orders in a corpus. Weuse the vocabulary lookup function V : E → V , the feature lookup functionF : E → F , and the index function I : E → N, to map each element e ∈ E

onto its corresponding entries in V , and F , and N respectively. In other words,each element ei ∈ E is a triple (v, f , i) such that V(ei) = v, F(ei) = f , andI(ei) = i.

To sum up, we define a corpus as a set of annotated words indexed by theirpositions. Each entry in a corpus is a triple consisting of a word, a featuredescribing the word, and a natural number that indexes the relative position ofthe entry in the corpus. Words are chosen from a vocabulary set consisting ofa set of unique words, and features are chosen from a feature set consisting ofa set of unique descriptor symbols.

4.2 ContextContext is a key concept in generating a set of contextual word vectors. Wedefine a context as a certain type of connection between the elements of a

60

. . . ei−1 ei . . .

Figure 4.1. The singleton context C(ei) = ei−1 (i > 1), mapping each element ei ontoits preceding element ei−1. The context of the element e1 is defined by an artificialnode that indicates the beginning of the corpus.

corpus. For example, at the surface level, one can define a neighbourhoodconnection between words (e.g. ei−1 is the element immediately precedingei). Another example is at the syntax level, where a dependency connec-tion is defined between the elements of a corpus that form a sentence. Onecan also extend the domain of connections to discourse and document usingdependency-based discourse or document parsing. More formally, for a givencorpus E = {e1, . . . ,eT}, as described in Section 4.1, we define the context asa function C : E →P(E), where P(E) is the power set of E . This functionis referred to as a context function or context for short. In this definition, thecontext function connects each element in the corpus to a set of elements in thesame corpus. The power function in the range of the context function gives itenough flexibility to model different types of connections between the corpuselements.

A limited variant of the context function is the singleton context, whichmaps individual elements of a corpus onto other individual elements i.e., C :E → E . In other words, the singleton context of an element in a corpus is an-other element in the same corpus. Figure 4.1 shows an example of the single-ton context mapping each element ei ∈ E onto its preceding element ei−1 ∈ E

and mapping e1 to itself.Two examples of the singleton context function (or context)used in liter-

ature are the neighbourhood context and the dependency context.2 Let E ={e1, . . . ,eT} be a corpus of size T . The neighbourhood context of word et ∈ E

with parameter τ ∈ Z is:

Cn(et ;τ) =

⎧⎪⎨⎪⎩

e1 t + τ < 1

et+τ 1≤ t + τ ≤ T

eT t + τ > T

(4.1)

Depending on the sign of parameter τ , the neighbourhood context returns theword at the τ th position before or after et . We will refer to the neighbourhoodcontext with τ < 0 as backward neighbourhood context and the neighbour-

2We later show how a neighbourhood context is related to the window-based context, which iscommonly used in literature.

61

. . . ei−n . . . ei . . .

(a)

. . . ei . . . ei+n . . .

(b)

Figure 4.2. (a): a left neighbourhood context with parameter τ = −n; (b): a rightneighbourhood context with parameter τ =+n, where n > 0 is a positive integer.

hood context with τ > 0 as forward neighbourhood context. Figure 4.2 showsexamples of the neighbourhood context.

A dependency context is defined by dependency relations between wordsLevy and Goldberg (2014a); Vulic and Korhonen (2016). The dependencyrelations are represented by directed graphs whose nodes are the words inE∪{root}, where root is a special node used as the root node of a dependencytree. The dependency context of the word e ∈ E with parameter τ ∈ {0}∪Z+

is defined as follows:3

Cd(e;τ) =

{Cd(Pa(e);τ−1) τ > 0

e τ = 0(4.2)

where the function Pa : E → {E, root} returns the parent word of the inputword e ∈ E with respect to the dependency graph of E . The parent function Pareturns the special element root if the input is the root node of a dependencytree. Figure 4.3 shows an example of the dependency context, which connectsthe words of the sentence his garden is near the village . to each other. Inthis figure, the context parameter τ is equal to 1, hence the immediate depen-dency parent of a word is chosen as the context of the word. Later on, inSection 4.6, we elaborate on how more complicated contexts used in literaturecan be formed from multiple singleton contexts.

4.3 Feature and Word VariablesIn this section, we introduce two random variables called feature variables andword variables. The random variables are used in Section 4.4 to construct amixture model of contextual word vectors from which principal word vectorsare extracted.

3We use Z+ instead of N to be more consistent with τ ∈ Z in the neighbourhood context.

62

root his garden is near the village .

Figure 4.3. The dependency context for the sentence: his garden is near the village..The dependency relations are based on Universal Dependencies. The edges map eachword onto its context word based on the dependency relations between them. In thisrepresentation, the context words match the head words in the dependency relations,hence the edges end to the head words as opposed to the dependency trees where theedges start from head words and end to modifiers.

We associate each feature fi ∈ F with a Bernoulli variable fi(et), called afeature variable, whose values for a given element et ∈ E is defined as below:

fi(et) =

{1 F(C(et)) = fi

0 otherwise(4.3)

where C : E → E is a singleton context function. The array of all featurevariables associated with the contextual features can be seen as a vector ofrandom variables whose values for each occurrence of a word in a corpus arezero except for one random variable, which indicates the context of the word.Each realization of this random vector is a unit-length binary vector, whichis known as a one-hot vector in literature. These one-hot vectors indicate theoccurrence of a contextual feature. Similarly, we associate each word v j ∈ V

with a Bernoulli random variable v j(et), called a word variable, whose valuesfor a given element et ∈ E are defined as below:

v j(et) =

{1 V(et) = v j

0 otherwise(4.4)

Word variables indicate the occurrence of a word in a corpus. The array of allword variables associated with all words in a vocabulary set can be seen as avector of Bernoulli random variables. Each realization of this random variableindicates the occurrence of a word in a corpus. In the next section, we showhow the feature and word variables are used to generate a set of contextualword vectors forming a mixture model.

63

4.4 Mixtures of Contextual Word VectorsFor a given set of feature variables F = {f1, . . . , fm} corresponding to the fea-ture set F , a set of word variables V= {v1, . . . ,vn} corresponding to the vocab-ulary set V , and a training corpus E = {e1, . . . ,eT}, we define the contextualword vector V( j) associated with word v j ∈ V as an m-dimensional randomvector whose ith element (i= 1, . . . ,m) is the frequency of seeing feature fi ∈F

in the context of the word v j ∈V in the training corpus E:

v( j)i =

T

∑t=1

fi(et)v j(et) (4.5)

Each element v( j)i of the contextual word vector follows a binomial distribu-

tion as below:v( j)i ∼ B(n(v j), p( fi|v j)) (4.6)

where n(v j) = ∑Tt=1 v j(et) is the marginal frequency of seeing v j in E , and

p( fi|v j) = p(fi = 1|v j = 1). This is because the product fi(et)v j(et) in Equa-tion 4.5 follows the Bernoulli distribution, and its sum is therefore a binomialrandom variable. So the contextual word vector V( j) is a vector of binomialrandom variables, i.e., V( j) = (v

( j)1 , . . . ,v

( j)m ) where v( j)

i (i = 1, . . . ,m) is de-fined as in Equation 4.6. If the context function in the context variables is asingleton context (see Section 4.2), then we have:

m

∑k=1

v( j)k = n(v j) (4.7)

where n(v j) is the marginal frequency of v j in the corpus. This basically meansthat the sum of the elements of a contextual word vector associated with aword and obtained from a corpus with a singleton context function is equalto the marginal frequency of seeing the word in the corpus. This shows thatthe contextual word vector V( j) using a singleton context function follows themultinomial distribution as below:

f (V( j);n(v j),P( j)) =

(n(v j)

v( j)1 , . . . ,v

( j)m

)m

∏i=1

p( fi|v j)v( j)i (4.8)

where V( j) = (v( j)1 , . . . ,v

( j)m )T is a realization of V( j), i.e., v( j)

i ∈ Z+ ∪{0}

and ∑mi=1v

( j)i = n(v j) (i = 1, . . . ,m), and P( j) = p( f1|v j), . . . , p( fm|v j) is a vec-

tor of probabilities related to word v j. We use the term contextual word vectorto refer to the vector of random variables denoted by V and to its realizationdenoted by V. The distinction between the two concepts is made through thenotation bold versus regular format.

We represent the set of contextual word vectors associated with all words inthe vocabulary set through a mixture model. This mixture model then under-goes principal component analysis to form the low-dimensional word vectors.

64

Given that each contextual word vector associated with a word in a corpusis a vector of random variables with a certain distribution characterized bythe occurrence of the word in the corpus, the set of contextual word vectors{V(1), . . . ,V(n)} associated with all words in the vocabulary set V forms amixture model with the following probability mass function:

p(V) =n

∑j=1

p j f (V( j)) (4.9)

where p j is the mixture weight associated with the jth mixture componentwith the distribution f (V( j)). We refer to V in Equation 4.9 as the mixtureof contextual word vectors. If the contextual word vectors use a singletoncontext function, then the mixture of contextual word vectors will be a mixtureof multinomial distributions as below:

p(V) =n

∑j=1

p j f (V;n(v j), p( f1|v j), . . . , p( fm|v j)) (4.10)

where v = (v1, . . . ,vm) is an m-dimensional vector representing a realizationof the mixture of contextual word vectors. The mixture weights in Equa-tion 4.9 can be defined in several ways. For example, one can define themixture weights as a uniform distribution over words or as a marginal dis-tribution of words in a corpus. If the mixture weights are defined as a uniformdistribution over words then we have:

p j =1n

(4.11)

where n is the size of the vocabulary set. If the mixture weights are defined asa marginal distribution of words in a corpus then we have:

p j = p(v j = 1) (4.12)

where p(v j = 1) is the marginal distribution of the word v j in the corpus,p(v j). For a corpus of size T , the probability p(v j) is estimated by the ratio ofthe frequency of seeing the word in the corpus:

p(v j) =n(v j)

T(4.13)

The values of these marginal probabilities follow a Zipfian distribution as be-low:

p(v j)≈ C

j(4.14)

where j is the word index in a list of words consisting of n words sorted indescending order of frequency. The constant C in Equation 4.14 is set suchthat the sum of the probabilities is equal to 1, i.e.,

C =1

∑nk=1 k−1 (4.15)

65

Using the harmonic series ∑nk=1 k−1 ≈ logn, the constant C in Equation 4.15

can be estimated as

C ≈ 1logn

(4.16)

Using Equation 4.13 to estimate the marginal distribution of words, the mix-ture weights in Equation 4.12 take the overall frequency of seeing words in acorpus into account. They assign higher weights to the most frequent wordsand lower weights to the less frequent words. According to Zipf’s law, the fre-quently seen words constitute a small number of words in a vocabulary list andthe majority of words are the less frequent words. This basically means thatif the marginal distribution of words is used as the mixture weights, then thedistribution of the mixture of contextual word vectors in Equation 4.9 will behighly influenced by a small number of frequent words and the accumulationof many less frequent words. On the other hand, if the uniform distribution inEquation 4.11 is used to define the mixture weights, then the mixture of con-textual word vectors in Equation 4.9 is equally influenced by all contextualword vectors used as mixture components.

We summarize this section as follows. A contextual word vector associatedwith a word is a vector of random variables, associated with a set of features,which count the frequency of seeing the word with each feature in a corpus.The set of contextual word vectors associated with all words in a vocabularyset forms a mixture model. In order to have a better view of the mixture ofcontextual word vectors, we study the distribution of the mixture of contextualword vectors in the next section.

4.5 Distributions of Mixtures of Contextual WordVectors

In this section, we study the distribution of the mixture of contextual wordvectors. Our study is focused on the mean vector and the covariance matrixof the distribution in Equation 4.9. In our calculations, we assume that themixture weights are defined by the marginal distribution of words as definedin Equation 4.12.

The mean vector of the mixture of contextual word vectors with the distri-bution in Equation 4.9 is:

μV =n

∑k=1

p(vk)μV

(k)(4.17)

where p(vk) is the marginal distribution of the word vk and μV(k)

is the meanvector of the kth contextual word vector, V(k), forming the mixture model.Given that p(vk) follows the Zipfian distribution and the fact that V(k) follows

66

a binomial distribution (see Equation 4.6), the ith element of the mean vectorμV will be as below:

μV

i =n

∑k=1

p(vk)μV

(k)

i

=n

∑k=1

n(vk)p(vk)p( fi|vk)

≈ TC2n

∑k=1

p( fi|vk)

k2

(4.18)

where we use μV(k)

= n(vk)p( fi|vk), p(vk) ≈ Ck

, n(vk) ≈ TCk

, and T is thecorpus size. The product TC2 in Equation 4.18 can be approximated by T

(logn)2 ,

where Equation 4.16. So, the ith elements of the mean vectors μV can beapproximated by:

μV

i ≈ T

(logn)2

n

∑k=1

p( fi|vk)

k2 (4.19)

Equation 4.19 shows that the elements of the mean vector are affected by threefactors: the corpus T , the vocabulary size n, and the distribution of featuresover words p( fi|vk). We study the effect of these parameters on the elementsof the mean vector in two steps. Equation 4.19 has two factors, T

(logn)2 and

∑nk=1

p( fi|vk)k2 . The first factor directly depends on the corpus size and the vo-

cabulary size. The second factor depends on the distribution of the words overthe features.

The parameters T , the corpus size, and n, the vocabulary size, are depen-dent on each other (Egghe, 2007). A large corpus is expected to have a largevocabulary set too. On the basis of Heaps’ law (Heaps, 1978, Section 7.5),the number of distinct words in a corpus is a function of the corpus size. Us-ing the same notation as before, Heaps’ law relates the corpus size T and thevocabulary size n as:

T ≈ Knβ (4.20)

where 10 ≤ K ≤ 100 is an integer and β ∈ (0,1). Using Equation 4.20 tosimplify the fraction T

(logn)2 appearing in Equation 4.19 as a function of n, wehave:

τ(n) =Knβ

(logn)2 (4.21)

Figure 4.4 shows how the values of τ(n) vary with respect to n, with K = 100and β = 0.6. The value of τ increases as the vocabulary size n increases. Wesee that the rate of increase of τ with respect to n is very small. A vocabularyconsisting of n = 105 words results in the value of 700 < τ < 800.

67

0.2 0.4 0.6 0.8 1

·105

200

400

600

800

n

τ

Figure 4.4. The variation of τ = T(logn)2 versus n, where we use Heaps’ law to estimate

T ≈ Knβ with K = 100, and β = 0.6.

Now we study the second factor of Equation 4.19. For fixed values of T andn, Equation 4.19 shows that the mean vector strictly depends on the distribu-tion of the features over the words. The presence of the k2 in the denominatorof p( fi|vk)

k2 in this equation shows that the mean vector is more affected by wordswith small values of k, the most frequent words. In other words, the mean val-ues of those features that are seen with the most frequent words are higher thanthose features that are seen with less frequent words, as would be expected.In many practical cases where the size of the feature set is large enough (e.g.word forms or n-grams of word forms are used as features), the distributionof the features over words can be expected to be close to the Zipfian distribu-tion. In these cases, we have p( fi|vk) ≈ I

i, where i is the index of feature fi

seen with v j in the list of features sorted in the descending order of frequencyand I is the probability normalizing factor with the value of I = (∑m

i=1 i−1)−1.Substituting the Zipfian approximation of p( fi|vk)≈ I

iinto Equation 4.19, we

have:

μV

i ≈ TI

i(logn)2

n

∑k=1

1k2

≈ π2T

6i logm(logn)2

(4.22)

68

where we use I ≈ 1logm

and ∑nk=1 k−2 ≈ π2

6 . Equation 4.22 shows that if thefeatures are distributed with the Zipfian distribution over the words, then theelements of the mean vector of the mixture of contextual word vectors aredirectly proportional to the corpus size T and inversely proportional to thelogarithm of the size of the feature set and the logarithm of the size of thevocabulary set. The effect of the vocabulary size is to a large extent cancelledout by the presence of the corpus size T in the nominator of the fraction, sincethe vocabulary size n is directly proportional to the corpus size T . As shownabove, on the basis of Heaps’ law, the ratio T

(logn)2 is in the scale of hundredswhen the vocabulary size is in the scale of hundreds of thousands. Therefore,we cannot completely eliminate the effect of these parameters on the elementsof the mean vectors. Nevertheless, it is safe to conclude that the elements ofthe mean vectors are highly dependent on the size of the feature set. As weincrease the size of the feature set, the mean vector will get closer to zero.The presence of index i in the denominator of Equation 4.22 shows that μV

i

becomes smaller as i increases. This basically means that the elements ofthe mean vector related to the most frequent features, for which i is small,are much higher than the other features. Most of the elements of the meanvector are related to the less frequent features. The values of these elementsare drastically smaller than the elements related to the most frequent features.This can be seen as additional support for the above argument regarding theeffect of the feature size on the mean vector, since a large feature set leads toa small mean vector.

Nevertheless, the assumption about the Zipfian distribution of features overwords might not be a valid assumption for all circumstances. For example, if asmall set of part-of-speech tags is used as the feature set, then the distributionis more likely not to follow the Zipfian distribution, but is likely to be closer tothe uniform distribution. In order to estimate the mean values in a more gen-eral way, we use the maximum entropy principle, which states that the bestdistribution to model an observed data set is the one with maximum entropy.When we have no knowledge about the data distribution, the best distributionto model it is the uniform distribution. Using the maximum entropy principle,if we assume that feature fi is uniformly distributed over all words with prob-ability p( fi|vk)≈ 1

m, where m is the size of the feature set, then the ith element

of the mean vector in Equation 4.22 is approximated as below:

μV

i ≈ T π2

6m(logn)2 (4.23)

where we use ∑nk=1 k−2 ≈ π2

6 , and C ≈ 1logn

.In order to show how the elements of the mean vector vary with respect to

the corpus size T and the vocabulary set n, we set the size of the feature set toa constant value. Figure 4.5 shows how the values of μV

i vary with respect ton and different values of m = 50,100,500,1000. In general, the figure shows

69

2 4 6 8 10

·104

0

5

10

15

20

25

n

μV i

m = 50m = 100m = 500

m = 1000

Figure 4.5. An approximation of the elements of the mean vector μVi ≈ T π2

6m(logn)2

with respect to the size of the feature set m = 50,100,500,1000, and the size of thevocabulary set 104 ≤ n ≤ 105. The corpus size T is approximated on the basis ofHeaps’ law T ≈ Knβ with K = 100, and β = 0.6.

that the values of μVi become very small as the feature size m increases. The

values of μVi are directly proportional to the vocabulary size n but the rate

of the increase in the values of μVi becomes very small as the feature size m

increases. This means that, even with a large vocabulary set, the mean vectorof contextual word vectors becomes close to the null vector when the featureset size m increases. The length of the mean vector measures the closenessof the vector to the null vector. Using Equation 4.23, an approximation of thelength of the mean vector is:

∥∥∥μV

∥∥∥≈√

m

∑k=1

(T π2

6m(logn)2 )2

=

√m(

Tπ2

6m(logn)2 )2

=T π2

6√

m(logn)

(4.24)

Figure 4.6 shows the variations of the length of the mean vector. As the featuresize m increases the length of the mean vector becomes smaller. The lengthof the mean vector is directly proportional to the vocabulary size n. However,

70

2 4 6 8 10

·104

0

50

100

150

n

μV i

m = 50m = 100m = 500

m = 1000

Figure 4.6. An approximation of the length of the mean vector μV with respect tothe size of the feature set m = 50,100,500,1000, and the size of the vocabulary set104≤ n≤ 105. The corpus size T is approximated on the basis of Heaps’ law T ≈Knβ

with K = 100, and β = 0.6.

the growth rate of the length of the mean vector with respect to n becomesvery small as the feature size n increases. This shows that the mean vector isrelatively close to the null vector even with a large vocabulary set.

Now we turn our attention to the study of the covariance of the mixtureof contextual word vectors ΣV. Denoting the covariance matrix of the kth

contextual word vector forming the mixture model in Equation 4.9 as ΣV(k)

,the covariance matrix of the mixture of contextual word vectors is:

ΣV =n

∑k=1

p(vk)ΣV

(k)(4.25)

where p(vk) is the marginal probability of observing the kth word in the vo-cabulary set. Given that the elements of the contextual word vectors follow

the binomial distribution in Equation 4.6, for the diagonal elements of ΣV(k)

(k = 1, . . . ,n), we have:

ΣV(k)

(i,i) = n(vk)p( fi|vk)(1− p( fi|vk)) (4.26)

where n(vk) is the marginal frequency of the word vk associated with V(k) and

i = 1, . . . ,n in an index.

71

Assuming Zipf’s law for the words, the diagonal elements (i, i) of the over-all covariance matrix will be as below:

ΣV

(i,i) =n

∑k=1

p(vk)ΣV

(k)

(i,i)

=n

∑k=1

n(vk)p(vk)p( fi|vk)(1− p( fi|vk))

≈ TC2n

∑k=1

p( fi|vk)

k2 (1− p( fi|vk))

(4.27)

where we use p(vk) ≈ Ck

, n(vk) ≈ TCk

, and T is the corpus size. For the off-

diagonal elements of ΣV(k)

, we have:

ΣV(k)

(i, j) =−n(vk)p( fi|vk)p( f j|vk) (4.28)

with i �= j. Using Zipf’s law, the off-diagonal elements can be approximatedas below:

ΣV

(i, j) =n

∑k=1

p(vk)ΣV

(k)

(i, j)

=n

∑k=1

−n(vk)p(vk)p( fi|vk)p( f j|vk)

≈−TC2n

∑k=1

p( fi|vk)p( f j|vk)

k2

(4.29)

Similar to the analysis of the mean vector of the mixture of contextual wordvectors, we analyse the covariance matrix with two scenarios about the dis-tribution of features over the words. First, we assume that the distribution offeatures over words follows the Zipfian distribution, i.e., p( fi|vk) ≈ I

iwhere

I = ∑mi=1 i−1. Second, we use the maximum entropy principle and assume that

the features are uniformly distributed over words, i.e., p( fi|vk)≈ 1m

. With thefirst scenario, the diagonal element (i, i) is approximated as:

ΣV

(i,i) ≈TC2I(i− I)

i2

n

∑k=1

1k2

≈ Tπ2

6i logm(logn)2

(1− 1

i logm

) (4.30)

and the off-diagonal element (i, j) with (i �= j) is approximated as:

ΣV

(i, j) ≈−I2C2T

i j

n

∑k=1

1k2

≈− Tπ2

6i j(logm)2(logn)2

(4.31)

72

where we use I ≈ 1logm

, C ≈ 1logn

, and ∑nk=1

1k2 ≈ π2

6 . Equation 4.30 and Equa-tion 4.31 can be simplified into:

ΣV

(i, j) =−A(1−α)

i j(4.32)

where A = T π2

6(logm logn)2 and α is defined as below:

α =

{i logm i = j

0 i �= j(4.33)

Equation 4.32 shows that the covariance between the feature variables is di-rectly proportional to the parameter A and inversely proportional to the indicesi and j. Given that T , m, and n are positive integers, and assuming that m > 2and n > 1, A is always a positive real number and α is a real number largerthan 1 for i = j. Thus, the diagonal elements are positive since α > 1 andA > 0, and the off-diagonal elements of the covariance matrix are negativesince A > 0. The values of A are directly proportional to the corpus size T andinversely proportional to the feature size m and vocabulary size n. As shownin Figure 4.4, the ratio τ = T

(logn)2 increases with the vocabulary size n, which

itself is a function of the corpus size T . Given that A = τπ2

6(logm)2 , with a con-stant value of m, A also linearly increases with the corpus size and vocabularysize. With a constant value of τ , the variation of the values of A is inverselyproportional to the logarithm of the size of the feature set, m. A small featureset results in relatively larger values of A and a very large feature set can re-sult in significantly smaller values of A. In addition to the parameter A, thecovariance between feature variables is also affected by their indices in the listof features sorted in descending order of the feature frequencies. The abso-

lute value of the covariance between the feature variables associated with lessfrequent features, for which the feature indices are high, is much smaller thanthe absolute value of covariance between the feature variables associated withthe features with relatively higher frequency. The diagonal elements of the co-variance matrix are higher than the absolute value of the off-diagonal elementsin the same row and column. Figure 4.7 provides a visualization of what wedescribed about the elements of the covariance matrix in Equation 4.32. Thefact that the diagonal elements of the covariance matrix sharply decrease asthe index i increases, show that the majority of the data spread is due to thedisproportionate contribution of the most frequent features with high indices.In terms of the eigen-spectrum, this picture of the distribution of contextualword vectors implies a sharp decrease in the spectrum of the eigenvalues oftheir covariance matrix. In other words, most of the variation of the data isalong a small number of top eigenvectors of the covariance matrix. Later on,in Section 8.1.2, we will make use of this property to determine the optimalnumber of dimensions for the principal word vectors.

73

Figure 4.7. The covariance matrix of the mixture of contextual word vectors with theassumption that the features are distributed over words with a Zipfian distribution. Theintensity of black indicates the absolute value of the covariances.

Alternatively, if we assume that the features are uniformly distributed overwords i.e., p( fi|vk)≈ 1

m, the element (i, j) of the covariance matrix will be as

below:

ΣV

(i, j) =T π2α

6(m logn)2 (4.34)

where α is defined as:

α =

{m−1 i = j

−1 i �= j(4.35)

Equations 4.34 and 4.35 show that the elements of the covariance matrix be-come smaller as the number of features increases. The diagonal elements ofthe covariance matrix are larger than the absolute values of the off-diagonalelements. Using m ≈ m− 1 in Equation 4.35, the diagonal elements of thecovariance matrix will be as small as the corresponding mean values in Equa-tion 4.23. As shown in Figure 4.5, the elements of the mean vector can be verysmall when the number of features increases. Similarly, the elements of thecovariance matrix can be very small as the number of features increases. This,together with the fact that the diagonal elements of the covariance matrix arehigher than the off-diagonal values, shows that, depending on the size of thefeature set, the elements of the covariance matrix can be very small. This ba-sically means that the contextual word vectors are tightly massed around theirmean vector, which itself is close to the null vector.

74

We summarize this part as follows. For a given corpus of a certain size, boththe mean vector and the covariance matrix of the mixture of contextual wordvectors are highly affected by the number of features. Depending on the size ofthe feature set, the mean vector of the mixture of contextual word vectors canbe very close to the null vector and the contextual word vectors can be massedaround the mean vector. The spectrum of the eigenvalues of the covariancematrix of the principal word vectors sharply decreases, and most of the varia-tion in the data is along a few eigenvectors associated with the top eigenvalues.These properties of the covariance matrix limit the usage of dimensionality re-duction techniques, such as PCA, which rely on data variance. When PCA isperformed on the mixture of contextual word vectors, the resulting word vec-tors are highly influenced by the top eigenvectors of the covariance matrix ofthe mixture of contextual word vectors. As mentioned in Section 2.3, PCAdoes not make any assumption about the data distribution, but it works betteron data with a normal distribution. However, the covariance matrix of the mix-ture of contextual word vectors indicates that the distribution of the mixture ofcontextual word vectors is far from the normal distribution. This is an answerto our first research question in Section 1.2, which addresses the limitationsof using PCA for word embeddings. Later on, in Section 5.2.2, we show howthis problem with data distribution can be mitigated through a transformationfunction that facilitates the PCA of the contextual word vectors.

4.6 Combined Feature VariablesIn this section, we examine how contextual word vectors are affected by dif-ferent types of feature variables and their combinations. As explained in Sec-tion 4.2, a set of feature variables is defined by a feature set and a contextfunction. The features are defined over words and describe word occurrencesin a corpus. Some examples of features are part-of-speech tags, lemmas, wordforms, and even the label of the document to which a word belongs. A con-text function is a mapping between the elements of a corpus. Each element ismapped onto a subset of elements in the corpus.

Depending on the feature set and context function in use, different sets offeature variables and different types of contextual word vectors can be formed.In the following, we elaborate on different methods of combining multiplesets of features variables. We also examine how the feature combination ap-proaches affect the distribution of the resulting contextual word vectors.

Let {F1, . . . ,Fk} be a set of k disjoint sets of feature variables with Fi =

{f(i)1 , . . . , f(i)mi} (i = 1, . . . ,k). We consider three ways to combine these feature

sets. The first is to form a set of joint feature variables from the Cartesian prod-uct set of the original sets of feature variables. In this approach, the combined

75

set of feature variables is

F = F1×F2×·· ·×Fk (4.36)

The size of the joint feature variables F = {f1, . . . , fm} is m = ∏ki=1 mi and

its jth element is f j = ∏ki=1 f

(i)ji

, where j = 1, . . . ,m, and ji (i = 1, . . . ,k) is a

feature index in Fi such that j1 +∑ki=2( ji− 1)∏i−1

l=1 ml = j. In other words,f j is equal to the product of the feature variables in the jth element of theCartesian product ∏k

i=1 Fi. This means that the joint feature variable f j is aBernoulli random variable and the contextual word vectors formed by themfollow the multinomial distribution in Equation 4.8. Since the number of jointfeature variables in F is much bigger than the number of the individual featurevariables in each Fi (i = 1, . . . ,k), the contextual word vectors generated withthe joint feature variable are expected to be more densely concentrated aroundtheir mean vector than the contextual word vectors generated with each Fi

(i = 1, . . . ,k). This is due to the effect of number of the feature variables on thecovariance matrix of the contextual word vectors, as discussed in Section 4.5.

As an example, let us assume that the feature variable set F is formed bythe Cartesian product of two sets of feature variables F1 and F2, which are de-fined as follows. F1 is defined by a set of part-of-speech tags as its feature setand the neighbourhood context with parameter τ =−1 as its context function.F2 is defined by a set of supertags tags as its feature set and the dependencycontext with parameter τ = −1 as its context function. Then the joint fea-

ture variable fl = (f(1)i , f

(2)j ), where f

(1)i ∈ F1 and f

(2)j ∈ F2, for a given token

et is 1 if f(i)1 (e) = 1 and f

( j)2 (e) = 1, i.e., the part-of-speech tag of the imme-

diately preceding token, et−1, matches the part-of-speech tag associated with

f(i)1 and the supertag of the parent word matches the supertag associated with

f( j)2 . The joint approach of feature combination can be seen as an extension

of the n-gram models, which are used in language modelling. This approachcan also be seen as a generalization of the feature combination methods usedin the parsing community, where many different combinations of textual fea-tures are used by a parser (McDonald, 2006; Zhang and Nivre, 2011). Thedifference between the joint approach of feature combination and the n-grammodels is as follows. An n-gram model is defined as a contiguous sequence ofitems, but in the joint approach, the variables generated by the joint approachare not necessarily a contiguous sequence. By way of illustration, in the previ-ous example, the joint feature variables are formed by the part-of-speech tagsbelonging to the immediate preceding words and supertags of parent words.However, these (preceding words and parent words) are not necessarily in acontiguous sequence. In addition to this, the joint feature variables provide uswith a systematic way to combine different types of contextual features. Theelements of the contextual word vectors built with the joint feature variablesbasically show the frequency of seeing each word with all combinations of the

76

features in the original sets of feature variables. Since not all combinationsof feature variables are likely to occur in the corpus, one can eliminate thosejoint variables that are never seen in the corpus, i.e., ∑T

t=1 f j(et) = 0, where{e1, . . . ,eT} is the training corpus.

The second way to combine multiple sets of feature variables is to form theunion set of feature variables:

F =k⋃

i=1

Fi (4.37)

The size of the union set of features variables F = {f1, . . . , fm} is m = ∑ki=1 mi,

where mi is the size of the feature variable set Fi. The contextual word vectorsobtained from this approach are equivalent to the concatenation of the contex-tual word vectors obtained from each set of feature variables. Denoting theset of contextual word vectors obtained from each set of the feature variableFi ∈ F (i = 1, . . . ,k) as {V1, . . . ,Vk}, all associated with the same word, thecorresponding union contextual word vector will be V = (VT

1 , . . . ,VTn )

T . Acontextual word vector obtained by the union feature variables does not followthe multinomial distribution in Equation 4.8. This is because the range of thefeature lookup function and the context function of the union set of featurevariables are changed to their corresponding power sets P(F) and P(E) re-spectively, i.e., F : E→P(F) and C : E→P(E). In other words, each et ∈E

(i = 1, . . . ,T ) activates n feature variables in the union contextual word vectorV, and each of which belongs to one of the contextual word vectors formingV. However, the multinomial distribution in Equation 4.8 requires that eachobservation et ∈ E activates only one feature variable. The contextual wordvector V obtained from the union of the feature variable sets Fi (i = 1, . . . ,k)follows the joint distribution of the multinomial distributions associated witheach Fi (i= 1, . . . ,k). Nevertheless, the individual feature variables in F followthe binomial distribution in Equation 4.6. Since our arguments in Section 4.5rely on the fact that feature variables follow binomial distributions, we canconclude that the mean vector and the covariance matrix of the mixture ofunion of contextual word vectors will have the same properties as the meanvector and covariance matrix of the mixture of contextual word vectors (seeSection 4.5).

The third way to combine multiple sets of feature variables is to add up thecontextual word vectors obtained from each set. The addition approach of fea-ture combination requires that the feature variables be of equal size, since vec-tor addition is only defined for vectors of equal length. Therefore, the additionapproach is restricted to word vectors with equal numbers of dimensions. Inaddition to this algebraic restriction, the addition approach restricts the natureof feature variables. In fact, the addition approach of feature combination onlymakes sense if there is a one-to-one mapping between the elements of featuresets such that the corresponding feature variables are associated with the same

77

feature and the same type of contexts (e.g. neighbourhood context) but differ-ent context parameters. For example, we can add together the contextual wordvectors built with the neighbourhood context with different parameter valuesτ ∈ Z (0 < t +τ < T ). However, it probably makes no sense to add contextualword vectors built with the neighbourhood context together with another con-textual word vector built with the dependency context, although the additionmight be algebraically possible.

The addition approach of feature combination can be used to form thewindow-based context used in literature. Let Vτ be a contextual word vec-tor obtained from the neighbourhood context with parameter τ . We define awindow-based contextual word vector for word v with three parameters a, n,α as below:

V =a+n−1

∑τ=a

ατVτ (4.38)

where a ∈ Z indicates the beginning of the window, n is the window length,and α is a vector of weights ατ ∈R

+ associated with each word in the window.Depending on the values of a and n, one can form asymmetric (backward, andforward) and symmetric window-based contextual word vectors correspond-ing to the asymmetric and symmetric window-based contexts. A backwardwindow-based contextual word vector is formed with the backward window-based context of length k ∈ N, k preceding words, i.e., a = −k, n = k, andατ = 1

|τ | (i = 1, . . . ,k) . Similarly, a forward window-based contextual wordvector is formed with the forward window-based context of length k ∈ N, k

succeeding words, i.e., a = 1, n = k, and ατ = 1τ (i = 1, . . . ,k). For symmet-

ric window-based contextual word vector, we set a = −k, n = 2k + 1, andα|τ |= 1

|τ | for τ �= 0 and α0 = 0. The window-based context can be generalizedto dependency context as well. We can form an ancestral context by addingmultiple contextual word vectors obtained from a dependency context withdifferent parameter values.

It can be shown that the contextual word vectors obtained from the additionapproach of feature combination follow a multinomial distribution if the coef-ficients ατ in Equation 4.38 are positive integers i.e., ατ ∈ Z

+. However, ingeneral, the contextual word vectors obtained from the addition approach canbe approximated by the Gaussian distribution with μ = ∑a+n−1

τ=a ατ μτ and Σ =

∑a+n−1τ=a ατΣτ , where μτ and Στ are the mean vectors and covariance matrices

of the multinomial distributions corresponding to Vτ (τ = a, . . . ,a+n−1).

4.7 SummaryWe have introduced contextual word vectors as high-dimensional random vec-tors whose elements count frequencies of seeing a set of features in the contextof words. Words and contextual features were represented as Bernoulli ran-

78

dom variables called word variables and feature variables. Word variablesindicate the occurrence of words in a corpus and feature variables indicate theoccurrence of contextual features (i.e. a set of symbols used for annotating acorpus) in the corpus. We also defined a context as a key concept that indicatesif a contextual feature is seen with a word or not. Contexts were defined asfunctions that map each element of a corpus to an element in the same corpus.

Contextual word vectors are considered to be the primitive units used fortraining low-dimensional word vectors. A set of contextual word vectors as-sociated with words in a vocabulary set forms a mixture of contextual wordvectors. A set of low-dimensional word vectors can be trained by principalcomponent analysis of a mixture of contextual word vectors. We studied thedistribution of this mixture model with two aims: 1) to provide a better viewof the model, and 2) to see if the model is suitable for PCA or not.

In our studies on the distribution of mixtures of contextual word vectors, weshowed that, depending on the distribution of contextual features over words,the mean vector of the mixture model can be very close to the null vectorand its covariance matrix can be dominated by a small number of most fre-quent features. This results in a sharp decay in the spectrum of eigenvaluesof the covariance matrix. The principal components of a mixture model withsuch a covariance matrix will be highly influenced by a small number of thetop eigenvalues and their corresponding eigenvectors. This reduces the effectof the majority of eigenvalues with relatively smaller values and their corre-sponding eigenvectors on the principal word vectors. Based on this dispropor-tionate effect of eigenvalues on the principal vectors, we concluded that thedistribution of the mixture of contextual word vectors is not very suitable forPCA.

We finalized this chapter with a study of different combination approachesused to combine different sets of mixtures of contextual word vectors. In thisstudy, we showed how the distribution of contextual word vectors are affectedby combination approaches.

79

5. Principal Word Vectors

In this chapter, we introduce principal word vectors as a compact representa-tion of mixtures of contextual word vectors. As mentioned in Section 4.4, amixture of contextual word vectors is a random vector consisting of a mixtureof binomial random variables. A compact representation of a mixture of con-textual word vectors can be obtained by performing a dimensionality reductiontechnique on the random vectors. Principal component analysis (PCA) is astandard method of dimensionality reduction, which is used in many scientificareas (see Section 2.3). However, as explained in Section 4.5, the distributionof mixtures of contextual word vectors is unsuitable for PCA. In this chapter,we use the generalized PCA (GPCA), introduced in Section 2.4, to mitigatethe problem with data distribution and make it possible to precess mixtures ofcontextual word vectors. We consider this solution as an answer to our secondresearch question in Section 1.2: how PCA can be efficiently and effectivelyused to reduce the dimensionality of mixtures of contextual word vectors (seeSection 1.2). In Section 5.1, we explain how to set up the GPCA algorithm toeffectively reduce the number of dimensions of a mixture of contextual wordvectors and generate a set of word embeddings called principal word vectorsor principal word embeddings. In addition to alleviating the problem withthe data distribution, GPCA make it possible to use extra knowledge in theprocess of generating principal word vectors. This is achieved through theadditional weighting parameters employed by GPCA. A study of how to setup these parameters is provided in Section 5.2. We also introduce an efficientalgorithm for estimating the singular value decomposition (SVD) of very largedata matrices such as a data matrix sampled from a mixture of contextual wordvectors. This algorithm, explained in Section 5.3, makes it possible to generateprincipal word vectors in an efficient way.

5.1 Word Embedding through Generalized PCAAs mentioned in Section 2.3, principal components of a vector of random vari-ables are latent variables defined over the top eigenvectors of the covarianceof the original random variables. We define a principal word vector as a real-ization of the principal components of a mixture of contextual word vectors,defined in Section 4. More specifically, we define a set of principal wordvectors associated with a set of words as the realization of the principal com-ponents of a set of contextual word vectors associated with the words. We

80

refer to the matrix of contextual word vectors as a contextual matrix. For-mally, for a given corpus formed by n word variables and m feature variables,the columns of the m×n contextual matrix M are the contextual word vectorsassociated with the word variables. Denoting the contextual word vector as-sociated with word vi ∈ V (i = 1, . . . ,n) as Vi, the corresponding contextualmatrix is M = (V1 . . .Vn).

As mentioned in Section 4.5, the distribution of the mixture of contextualword vectors is unsuitable for PCA. The GPCA algorithm in Algorithm 2 canbe used to construct the principal word vectors. GPCA adds several parametersto the classic PCA algorithm. These parameters normalize the distributionof the mixture of contextual word vectors and make it suitable for principalcomponent analysis. They also provide a standard way to add extra knowledgeto the principal word embedding method. In this part, we detail how the GPCAalgorithm can be used for this aim. The GPCA algorithm is applied on acontextual matrix. We describe a procedure that sets the input parameters ofthe GPCA algorithm and generates a set of low-dimensional principal wordvectors using GPCA.

Algorithm 4 summarizes three main steps for generating a set of k-dimen-sional principal word vectors form the corpus E , characterized by a set offeature variables and a set of word variables. The number of dimensions k issmaller than or equal to the number of feature variables. First, on Line 2, acontextual matrix is built by scanning the input corpus and counting the fre-quency of words with different features in their contexts. This step is a bottle-neck in the algorithm. It is expanded in the procedure BUILD-CONTEXTUAL-MATRIX. This procedure consists of a loop with two search steps on Line 12and Line 13, and an assignment on Line 14. The two search steps can bedone rapidly by hashing mechanisms such as the hash chain proposed by Zo-bel et al. (2001). The efficiency of the assignment step if highly dependenton the algorithm and data structure used to implement the contextual matrix.In practice, the sparsity of the contextual matrix M, which is defined as theratio of the number of zeros to the total number of elements, is large enoughto consider it a sparse matrix.

Second, we set the parameters of the GPCA algorithm and the transforma-tion function f. The weight matrices Φ, Ω, and Λ are set on lines 3 and 4.The parameters of the transformation function are set on Line 5. These pa-rameters can be set based on a priori knowledge of the distribution of featuresand words, or they can be set on the basis of statistics provided by the contex-tual matrix. In this thesis, we do not investigate the former approach. Instead,we study the latter approach in Section 5.2. The third step is to run the GPCAalgorithm in Algorithm 2 on the contextual matrix using the above parameters.

The corpus E in Algorithm 4 is characterized by a set of feature variablesand a set of word variables. Depending on the number of feature variablesand the number of word variables, the contextual matrix M on Line 2 of Algo-rithm 4 can be a very large matrix with a high degree of sparsity. In practice,

81

Algorithm 4 The algorithm to generate a set of k-dimensional principal wordvectors for words in the corpus E , characterized by a set of feature variablesF = {f1, . . . , fm}, and a set of word variables V = {v1, . . . ,vn}.

1: procedure PRINCIPAL-WORD-VECTOR(E,k)2: M ← BUILD-CONTEXTUAL-MATRIX(E)3: Build the weight matrices Φ and Ω4: Build the diagonal eigenvalue matrix Λ having k positive elements5: Train the transformation function f

6: Y ← GPCA(M,Φ,Ω,Λ,f)7: return Y

8: end procedure

9: procedure BUILD-CONTEXTUAL-MATRIX(E)10: Initialize the m×n contextual matrix M with zeros11: for t ← 1 to T do

12: Find the feature index i for which fi(et) = 113: Find the word index j for which v j(et) = 114: M(i, j)←M(i, j) +115: end for

16: return M

17: end procedure

the vocabulary size can be in the scale of hundreds of thousands to millions ofwords and the number of contextual features can be between tens and billionsof features. If M is a large sparse matrix, then the performance of Algorithm 2in Section 2.4 is affected by the mean-centring step and the SVD step in Algo-rithm 2, Lines 3 and 4. The output of the mean-centring step is always a densematrix, regardless of the sparsity state of the contextual matrix.

Since the contextual matrix is often a large matrix, the subsequent pro-cessing of such a large matrix requires a large amount of memory and CPUtime. This partially answers our first research question in Section 1.2, dealingwith the limiting factors of PCA for word embedding. We recall that anotheranswer to this research question is the distribution of mixture of contextualword vectors as explained in Section 4.5. In Section 5.3, we introduce a ran-domized SVD algorithm that enables the singular value decomposition of themean-centred contextual matrix without needing to construct the actual mean-centred matrix.

82

5.2 Parameters of Generalized PCAIn this section, we study the GPCA parameters in the following order. First,we introduce different metric and weight matrices and examine how the prin-cipal word vectors are influenced by these matrices. Then we propose sometransformation functions to mitigate the problem with the distribution of themixture of contextual word vectors as described in Section 4.5. Finally, westudy different eigenvalue weighting matrices that make it easer to control thevariance of the principal word vectors.

5.2.1 Metric and Weight MatricesThe metric matrix Φ and the weight matrix Ω provide GPCA with a priori

knowledge about feature variables and word variables, respectively. The ma-trix Φ defines a metric on the feature variables so that the distance betweentwo contextual word vectorsV(i) andV( j) is (V(i)−V

( j))T Φ(V(i)−V( j)).

This is useful for scaling the values in the contextual word vectors. The matrixΩ is usually used to weight the observations. In our model it is used to weightthe contextual word vectors. Although neither the metric matrix Φ nor theweight matrix Ω needs to be diagonal, we restrict our research to some spe-cific diagonal metric and weight matrices that are connected to related workin literature. In the remainder of this section, we introduce some examples ofthese matrices and explain how they are related to the weighting mechanismsused in literature. We also study how contextual word vectors are affected bydifferent combinations of the weight and metric matrices.

Table 5.1 gives a list of diagonal metric and weight matrices that can be usedwith contextual word vectors. Denoting the matrix of contextual word vectorsby X , we define Y = ΦXΩ to be the product matrix which is formed on Line 2of Algorithm 2. We use the pair (Φ,Ω) to form different combinations of themetric and weight matrices in Table 5.1.

Using (Φ,Ω) = (Im,In), the product matrix Y = ImXIn will be equal to X .The inverse of the feature frequency matrix iff is a metric matrix that giveshigher weights to the less frequent features.1 If we use (iff, In), then

Y(i, j) ≈ p(ci|v j) (5.1)

is an estimate of the conditional probability of seeing the feature ci condi-tioned on v j, i.e., p(ci = 1|v j = 1). If we use word forms as contextual fea-tures in Equation 5.1 and apply a square root transformation to the probabilityp(ci|v j), then Equation 5.1 will be equivalent to the Hellinger transformationon a probability co-occurrence matrix used by Lebret and Collobert (2014)(see Equation 3.5). The weight matrix iwf can be used to cancel out the Zip-fian effect of the word distributions by reducing the disproportionate effect of

1The name iff is used in the same way as other well established terms such as inverse of docu-ment frequency, idf, and should not be confused with the logical connective if and only if.

83

Name Definition

Im Φ(i,i) = 1 The m×m identity matrix

iff Φ(i,i) =1

n(ci)The inverse of feature frequency

isf Φ(i,i) =1

σVi

The inverse of STD of feature frequency

In Ω(i,i) = 1 The n×n identity matrix

iwf Ω(i,i) =1

n(vi)The inverse of word frequency

Table 5.1. The list of diagonal metric and weight matrices. n(ci) is the frequency of

seeing the contextual features ci in the corpus E = {e1, . . . ,eT}, i.e., n(ci) =∑Tt=1 ci(et)

and ci is the contextual variable corresponding to ci. n(vi) is the frequency of seeing

word vi in the corpus E = {e1, . . . ,eT}, i.e., n(vi) = ∑Tt=1 vi(et) and vi is the word

variable corresponding to vi. σViis the standard deviation (STD) of the ith random

variable in the set of contextual word vectors, i.e., ith row of the input matrix X used

in Algorithm 4.

the very frequent words on the elements of contextual word vectors. If we use(Im, iwf ), then

Y(i, j) ≈ p(v j|ci) (5.2)

is an estimate of the conditional probability of seeing the word v j conditionedon ci, i.e., p(v j = 1|ci = 1). If we use word forms as contextual featuresin Equation 5.2 and apply a seventh root transformation to the probabilityp(v j|ci), then Equation 5.2 will be equivalent to Equation 3.6 used by Basiratand Nivre (2017). If we use (iff,iwf ), then the elements of product matrix Y

will be as follows:

Y(i, j) =n(ci,v j)

n(ci)n(v j)(5.3)

where i = 1, . . . ,m and j = 1, . . . ,n. Using the element-wise transformationfunction f (Y ) = max(0, log(TY )), where T is the corpus size, the element(i, j) of f (Y ) will be equal to the positive pointwise mutual information be-tween the contextual feature ci and the word v j. In this case, if we set thecontextual features to word forms then Equation 5.3 is equivalent to Equa-tion 3.4. When we use (isf, In), the product matrix Y will be equal to the

84

correlation matrix of X , so Algorithm 2 performs PCA on the correlation ma-trix of contextual word vectors rather than their covariance matrix. The use ofa correlation matrix instead of a covariance matrix is a way to mitigate the het-erogeneous metric problem in observations, where the elements of observationvectors are represented in different metric systems (e.g. kilogram, and metre).The heterogeneous metric problem does not arise directly in contextual wordvectors since their elements are word frequencies. However, a similar problemcan be imagined when one considers imbalanced contributions of contextualfeatures in the word vectors. For example, if we use word forms as contextualfeatures, then due to the Zipfian distribution of words, a small number of theelements of contextual word vectors associated with the most frequent wordswill take very large numbers, but a majority of the elements associated with theremaining words will take relatively smaller numbers. These disproportionatevalues of contextual word vectors can be addressed by the heterogeneous met-ric problem in the observations.

5.2.2 Transformation FunctionData transformation is a common preprocessing step in the principal com-ponent analysis of special data. It is also seen as a way to apply non-linearprincipal component analysis to a set of random variables. As discussed inSection 4.5, the eigenvalues of the covariance matrix of the mixture of con-textual word vectors decay sharply as the number of features increases. Thisindicates a long elliptical distribution for the mixture of contextual word vec-tors, which is very wide along a few top eigenvectors and very narrow alongthe remaining eigenvectors (see Section 4.5). Principal component analysisof data with such a distribution is highly influenced by the data along the topeigenvectors and cannot capture enough information about the data variationalong the other eigenvectors. In order to mitigate this problem, we proposecompressing the data along the top eigenvectors and expanding them alongthe remaining eigenvectors while preserving the order of eigenvectors withrespect to their eigenvalues. This transformation reshapes the data distribu-tion from a long elliptical distribution to an elliptical distribution that is moresimilar to the normal distribution with diagonal covariance matrix. This canbe achieved through the application of any monotonically increasing concavefunction that preserves the given order of the data and magnifies small num-bers in its domain. Some examples of such transformation functions are thelogarithm, the hyperbolic tangent, and the power transformation functions.

The expansion effect of the transformation function increases the entropyof the mixture of contextual word vectors. We use this property to tune theparameters of the transformation function. Using V to denote the mixture ofcontextual word vectors and f(V;θ ) for the transformation function definedwith the parameter set θ , the optimal parameter set θ is one that maximizes

85

the entropy of the data:

θ = argmaxθ

H(f(V;θ )) (5.4)

The optimal value of θ in Equation 5.4 can be estimated by iterative optimiza-tion techniques such as genetic algorithms or simulated annealing with theobjective function:

g(θ ) = H(f(V;θ )) (5.5)

In order to compute the entropy of the mixture of contextual word vectorsV inEquation 5.4, we need to compute the probability density function f (f(V;θ ))since:

H(f(V;θ )) =−∑V

f (f(V;θ )) log f (f(V;θ )) (5.6)

We use Gaussian kernel density estimation (KDE) to calculate the probabilitydensity function f (see Equation 2.34 is Section 2.2.

In cases where the dimensionality of V = (V1, . . . ,Vm) is too high, wepartition V into a limited number of vectors {b1, . . . ,bk} and estimate the en-tropy as the mean of the entropy obtained from each of these vectors. Eachvector bi of length l is characterized by a set of random indices si = {i1, . . . , il},which is a subset of the permutation of {1, . . . ,m}, i.e., si ⊆ {1, . . . ,m} suchthat

⋃ki=1 si = {1, . . . ,m} and for every i �= j we have si ∩ si = φ . The ele-

ments of bi are the sequence of elements of V with the indices in si, i.e.,bi = (Vi1, . . . ,Vil). Given the set of vectors, we estimate the entropy in Equa-tion 5.4 as below:

H(f(V;θ ))≈ 1k

k

∑i=1

H(f(b;θ )) (5.7)

where k is the number of partitions and H(f(b;θ )) is the entropy as in Equa-tion 5.6. In this approach, the dependencies between the feature variables ineach set si (i = 1, . . . , l) are taken into consideration but the dependencies be-tween the feature variables in different sets are not considered. In other words,the sets of feature variables s1, . . . ,sl are assumed to be independent of eachother. Although this is not a correct assumption, it is sufficient for our purposesince it covers part of the dependencies between the features and enables usto estimate the entropy efficiently. In addition, the independence assumptionabout the feature variables enables us to process the vectors bi (i = 1, . . . , l) inparallel. We leave the exact computation of the entropy to future work.

5.2.3 Eigenvalue Weighting MatrixThe eigenvalue weighting matrix Λ is used on Line 5 in Algorithm 2. Asshown in Equation 2.47, the final low-dimensional vectors obtained from Al-gorithm 2 are linearly influenced by Λ. This matrix is used to control the

86

variance of the principal components along the selected eigenvectors. It isalso used to choose the desired number of dimensions of the principal compo-nents. A study of how to fill the elements of Λ is provided by Jolliffe (2002,Chapter 6.3). In this section, we present two eigenvalue weighting matricesthat are used in our experiments. The first is the matrix used in the classicPCA. Assuming that the singular values on the diagonal elements of the ma-trix Σ in Algorithm 2 are sorted in their descending order, the classic PCAalgorithm defines the matrix Λ as

Λ =√

n−1

[Ik 0

0 0

](5.8)

where n is the number of observations (i.e. number of words), Ik is the k× k

identity matrix and 0 is the matrix of zeros. The constant coefficient√

n−1serves to compute eigenvalues from singular values. This matrix chooses thetop k eigenvalues and their corresponding eigenvectors that account for mostof the variation in the data matrix, e.g., the contextual matrix. SubstitutingEquation 5.8 into Equation 2.47, we reach:

Y =√

n−1Σ(k,k)V (k,n)T(5.9)

where Σ(k,k) is a k× k matrix consisting of top k rows and columns of Σ andV (k,n) is a k× n matrix consisting of the top k rows and n columns of V . Thesecond is the matrix inspired by the word embedding approach proposed byBasirat and Nivre (2017). In this approach, the effect of the eigenvalues onthe data variance is completely ignored. Thus, the principal components havethe same variance along all eigenvectors. This can be implemented by aneigenvalue weighting matrix whose elements are inversely proportional to thesingular values in Σ:

Λ = α√

n−1

[Σ(k,k)−1

0

0 0

](5.10)

where α is a constant, n is the number of observations (i.e. number of words),

and Σ(k,k)−1is the inverse of the matrix Σ(k,k) consisting of the top k rows

and k columns of Σ. This matrix results in an identical variance of α2, alongthe top k eigenvectors. In other words, the resulting principal componentsare normally distributed with mean vector 0 and the covariance matrix α2Ik.The fact that the elements of principal components are independent of eachother and follow the normal distribution makes them suitable to be used inneural networks (LeCun et al., 2012). The parameter α should be set to thestandard deviation of the initial weights of the neural network. SubstitutingEquation 5.10 into Equation 2.47, we obtain:

Y = α√

n−1V (k,n)T(5.11)

87

where V (k,n) is a k×n matrix consisting of the top k rows and n columns of V .In this case, the principal word vectors are the top k right singular vectors ofthe mean-centred contextual matrix scaled by the coefficient α

√n−1.

5.3 Centred Singular Value DecompositionAs explained in Section 2.3, the set of principal components of a mean-centredmatrix can be efficiently computed from the singular values and singular vec-tors of the matrix. The process of mean centring and singular value decom-position can be very costly if the processing matrix is a large matrix such asa contextual matrix (see Section 5.1). In this section, we propose a methodof singular value decomposition combining both the mean centring and thesingular value decomposition in a single step. This method enables us to ap-proximate the principal components of a large matrix in an efficient way.

Let X be an m× n matrix, E an m-dimensional vector, and 1n an n-dimen-sional vector of ones. Algorithm 5, inspired by the randomized matrix fac-torization method introduced by Halko et al. (2011), returns a rank-k approx-imation of the singular value decomposition of the data matrix X= X −E1T

n ,whose columns are centred around the vector E . The algorithm consists ofthree main steps.

The first is to approximate a rank K basis matrix Q1 (k < K � n) that cap-tures most of the information in the input matrix X . Halko et al. (2011) proposesetting the parameter K =min(m,2k). The rank K basis matrix Q1 is estimatedon lines 2–4 in Algorithm 5. On Line 2, a random matrix is drawn from thestandard Gaussian distribution. This random matrix is then used on Line 3to form an m×K sample matrix X1 consisting of K m-dimensional vectors.The columns of X1 are independent random points sampled from the range ofX . By way of illustration, as shown in Equation 5.12, the ith element of thejth sampled vector, X1(i, j), is the linear combination of the ith element of allvectors in X and the random linear coefficients in the jth row of Ω:

X1(i, j) =n

∑k=1

X(i,k)Ω(k, j) (5.12)

Hence, the column vectors of X1 are in the same space as the column vectorsof X . This basically means that the basis matrix of the column space of X1 isapproximately equivalent to the basis matrix of the column space of X . Thisbasis matrix is the Q1 matrix of the QR factorization of the matrix X1 (seeLine 4):

X1 = Q1R1 (5.13)

Since X1 in Equation 5.13 is an m×K matrix, Q1 will be an m×m matrixand R1 will be an m×K matrix if we use one of the standard methods of QRfactorization such as the Householder algorithm (Householder, 1958). The

88

last m−K rows of R1 are zero and only the first K rows will contain non-zeroelements. In many practical usages of word embeddings, the number of con-textual features, m, is much larger than K, (i.e. K � m) so m−K is large andmany rows of R1 are zero. In this case, it is more efficient to omit the zero rowsof R1 and their corresponding columns in Q1. This factorization, in which thezero rows of R1 and their corresponding columns in Q1 are omitted, is calledan economy-size QR factorization. Using the economy-size QR factorization,Q1 will be an m×K matrix and R1 will be an K×K matrix. Since in practiceK � m, we are more interested in the economy-size QR factorization of X1rather than the full QR factorization. This is because the economy-size QRfactorization can be computed more efficiently than the full QR factorization.

However, the efficiency gain obtained from the economy-size QR factor-ization comes at the cost of an accuracy. If we consider the column vectors ofQ1 in Equation 5.13 to be the basis matrix of the column space of X1, then wehave

X1−Q1QT1 X1 = [0]m×K (5.14)

When we use the economy-size QR factorization, we lose part of the informa-tion provided by the omitted columns of Q1, so the right hand side of Equa-tion 5.14 will not be the zero matrix but an approximation of the zero matrix.In this case, the columns of Q1 will only span a K-dimensional subspace ofthe column space of X1. In other words, we have

X1−Q1QT1 X1 ≈ [0]m×K (5.15)

Denoting the Frobenius norm of a matrix by ‖.‖F (see Equation 2.11), if wedefine the reconstruction error made by the basis matrix Q1 to approximate therange of X1 as

ε =∥∥X1−Q1QT

1 X1∥∥

F(5.16)

then the error is minimized as the value of K is increased. This is because, aswe increase K, Q1 becomes closer to the full matrix obtained from the full QRfactorization. If we set K to m, then the reconstruction error in Equation 5.16 is0 and we will have the equation in Equation 5.14. This, however, comes at thecost of using more memory and CPU time to compute the QR factorization.Hence, we set K to a reasonably small number such as

K = min(2k,m) (5.17)

where m is the dimensionality of the mixture of contextual word vectors and k

is the desired number of dimensions.Since the column vectors of X1 and X are in the same space, Q1 will also

approximate the basis of the column space of X . In other words, from Equa-tion 5.15, we have

X ≈ Q1QT1 X (5.18)

89

where we replace X1 with X . Due to the relatively smaller size of X1 in compar-ison to the size of X (m×K versus m×n), the economy-size QR factorizationof X1 can be computed in a more efficient way than the QR factorization ofX . This enables us to approximate the basis of the column space of X in anefficient way.

In order to compute the basis of the mean-centred matrix X = X −E1Tn ,

we update the parameter Q1 with regard to the mean vector E . Line 5 usesthe QR-update algorithm proposed by Golub and Van Loan (1996, p. 607) toupdate the QR factorization of X1 = Q1R1 in Equation 5.13 with respect tothe input vector E . For a given QR factorization such as Q1R1 = X1 and twovectors u and v, the QR-update algorithm computes the QR-factorization of

QR = X1 +uvT (5.19)

by updating the already available factors Q1 and R1. Replacing u with −E andv with 1n, the QR-update on Line 5 returns the matrix Q that captures mostof the information in the mean-centred matrix X = X −E1T

n . Since Q1 is anapproximation of the basis matrix of the column space of X , the matrix Q alsocan be considered an approximation of the basis matrix of the column spaceof X:

X≈ QQTX (5.20)

where X is the mean-centred matrix. Note that we compute the basis matrixof the mean-centred matrix X without explicitly building the matrix X.

Algorithm 5 The rank-k approximation of the singular value decompositionof X−E1T

n =UΣV T .1: procedure CENTRED-SVD(X ,E,k,K)2: Draw an n×K standard Gaussian matrix Ω3: Form the sample matrix X1 ← XΩ4: Compute the economy-size QR factorization X1 = Q1R15: Compute QR = Q1R1−E1T

n using the QR-update algorithm6: Form Y ← QT X−QT E1T

n

7: Compute the singular value decomposition of Y =U1ΣV T

8: U ← QU1

9: return (U,Σ,V )10: end procedure

The second step is to project the matrix X to the space spanned by Q, i.e.,Y = QT

X. This step is performed on Line 6 by applying the distributive prop-erty of multiplication over addition as follows:

Y = QT X1−QT E1Tn (5.21)

Note that on Line 6, instead of applying the product operator on Q and X =X −E1T

n , we use QT X1−QT E1Tn . This is because the amount of memory re-

90

quired by the latter approach is significantly smaller than the amount of mem-ory required by the former approach. In the former approach, one needs tobuild the m× n matrix X = X −E1T

n but in the latter approach we have twom×K matrices QT X1 and QT E1T

n . Since K � m and in many applicationsthe number of observations n is larger than or equal (or at least comparable)to the number of variables m (m ≤ n), the latter approach is more efficient.Moreover, since the matrix QT E1T

n repeats the column vectors QT E in all itscolumns, instead of building this matrix, the column vector QT E can be effi-ciently subtracted from all columns of QT X1 in a loop.

Finally, in the third step, Line 7, the SVD factors of X are estimated fromthe K×n matrix Y in two steps. First, the rank-k SVD approximation of Y iscomputed using a standard method of singular value decomposition:

Y =U1ΣV T (5.22)

Then the left singular vectors are updated by U ← QU1 resulting in UΣV T =QY (Line 8). Replacing Y with QT

X and given that X ≈ QQTX (see Equa-

tion 5.20), we have UΣV T ≈ X, which is the rank-k approximation of X.The differences between Algorithm 5 and the original algorithm proposed

by Halko et al. (2011) are in Lines 5 and 6. As explained above, these linesmake it possible to compute the singular value decomposition of the mean-

centred data matrix without explicitly constructing the matrix in memory. Thismakes it easer to perform PCA on large sparse data matrices since the centringstep and the SVD step on Line 3, and Line 4 of Algorithm 2, respectively, aremerged.

5.4 SummaryIn this chapter, we introduced principal word vectors as a low-dimensionalrepresentation of a mixture of contextual word vectors. We showed that thedistribution of a mixture of contextual word vectors is far from the normal dis-tribution and it poses problems for PCA. This is due to some limiting factorsof using PCA with non-normal random vectors. We proposed using a general-ized PCA (GPCA) to mitigate the problem with data distribution. The GPCAalgorithm uses a transformation function to remedy the problem with the datadistribution. It also introduces weighting matrices that make it possible to useextra knowledge in the process of generating principal word vectors. Thesematrices cover different types of weighting and transformation mechanisms inliterature. In addition, we proposed using an efficient algorithm of singularvalue decomposition (SVD) that makes it possible to process very large datamatrices. The SVD algorithm incorporates the mean-centring step of PCAwithout explicitly computing the dense centred matrix. This enables us tocompute the principal components of large sparse matrices in an efficient way.

91

6. Connections with Other Methods

In this chapter, we examine how the principal word embedding method is con-nected with other popular word embedding methods. We also answer our thirdresearch question, how the limitation factors of PCA are avoided or handledin other word embedding methods (see Section 1.2). As explained before, twoshortcomings of using PCA for word embedding are

• the skewed distribution of mixtures of contextual word vectors (see Sec-tion 4.5), and

• the large size of contextual matrices sampled from mixtures of contex-tual word vectors (see Section 5.1).

The former issue is mitigated by performing a transformation function on theword vectors (see Section 5.2.2). As will be explained in this chapter, this fac-tor is sometimes ignored by other word embedding methods, and sometimesit is mitigated by applying a transformation function to word frequencies. Thelatter issue concerns the space and computational complexity of contextualmatrices. Principal component analysis of a large contextual matrix needs alarge amount of memory and CPU time. This issue too is usually ignored byother word embedding methods. However, some of word embedding methodstry to mitigate this issue through software engineering or sequential process-ing techniques. The principal word embedding method uses an SVD algorithmthat makes it easer to compute the PCA of a large contextual matrix in an effi-cient way (see Section 5.3).

As mentioned in Section 3.2, word embedding method can be classified intotwo major groups: methods developed in the area of distributional semantics,and methods developed in the area of language modelling. Since the princi-pal word embedding method is closer to the methods developed in the areaof distributional semantics, we study the relationship between principal wordembeddings and some of the methods developed in that area. Among the manyword embedding methods, we choose those methods that are described in Sec-tion 3.2.2, HAL (Lund and Burgess, 1996), RI (Sahlgren, 2006), HPCA (Lebretand Collobert, 2014), GloVe (Pennington et al., 2014), and RSV (Basirat andNivre, 2017). As shown in Algorithm 3, these methods involve three mainsteps. First, they build a contextual (co-occurrence) matrix. Then this ma-trix undergoes a transformation function. Finally, a set of low-dimensionalword vectors are extracted from the transformed matrix by a dimensionalityreduction technique.

In the following sections, we explain how the principal word embeddingmethod is related to the other methods. We examine the relationships between

92

the principal word embedding method and the methods of HAL and HPCA

together. The connections between the principal word embedding method andthe other methods are studied in individual sections.

6.1 HAL and HPCAThis section is focused on how the principal word embedding method is re-lated to HAL (Lund and Burgess, 1996) and HPCA (Lebret and Collobert,2014). In general, HAL and HPCA follow the same approach as the princi-pal word embedding method. They build a matrix of low-dimensional wordvectors by performing PCA on a matrix of contextual word vectors. Contex-tual word vectors in HAL are built from word co-occurrences that are weightedwith a windows-based symmetric local weighting function. Lund and Burgess(1996) claim that the co-occurrence matrix formed by these vectors can befairly well estimated by a small number of principal components. HAL usesPCA to train a matrix of low-dimensional word vectors from a co-occurrencematrix. However, Lund and Burgess (1996) do not clarify how PCA is appliedto the co-occurrence matrix. Since they do not mention any transformationstep, we conclude that no transformation is carried out on the co-occurrencematrix. They also do not provide a clear description of how a co-occurrencematrix is analysed, in terms of required memory and processing time. Thismeans that HAL does not apply any special techniques to process very largeco-occurrence matrices. In general, we conclude that HAL ignores both limit-ing factors of using PCA for word embedding.HPCA, by contrast, does address the limiting factors of PCA for word em-

bedding. HPCA generalizes the word embedding approach adopted by HAL.In HPCA, the co-occurrence matrix in HAL is replaced with a probability co-occurrence matrix as described in Section 3.2.2. Lebret and Collobert (2014)suggest transforming the elements of the probability co-occurrence matrixthrough a Hellinger transformation (see Equation 3.5). This transformationreshapes the distribution of co-occurrence data in such a way that the data be-comes more suitable for performing PCA. The transformed matrix then under-goes PCA. Given a probability co-occurrence matrix M, the low-dimensionalvectors in HPCA are generated by:

Y = Σ(k,k)V (k,n)T(6.1)

where Y is the matrix of k-dimensional data, UΣV T is the singular value de-composition of the element-wise square root of M,

√M, and A(a,b) represents

a sub-matrix of A consisting of the top a rows and top b columns of A. Lebretand Collobert (2014) ignore the mean-centring step in PCA. This preservesthe sparsity of the transformed matrix and makes it possible to process thematrix in an efficient way. However, performing PCA on non-centred data

93

makes the results of PCA unreliable, especially for the dominant principalcomponents. Lebret and Collobert (2014) perform a normalization step afterperforming PCA. Denoting the empirical mean of the column vectors of Y inEquation 6.1 as Y ≈ E(Y ) and their standard deviation as σ(Y ), Lebret andCollobert (2014) suggest normalizing the elements of word vectors to havezero mean and a fixed standard deviation of λ ≤ 1:

ϒ =λ (Y − Y )

σ(Y )(6.2)

When using word embeddings with a neural network, this normalization helpsto prevent the network weights from becoming saturated (LeCun et al., 2012).The numerator of the fraction in Equation 6.2 centres the final word embed-dings around their mean. In other words, HPCA carries out a mean-centringstep after performing PCA on

√M. If the vectors in

√M had been centred be-

fore performing PCA, then Y would be equal to 0. Substituting Equation 6.1into Equation 6.2 and the facts that Y = 0 and σ(Y ) = 1√

n−1Σ(k,k), where n is

the number of words, we obtain:

ϒ = λ√

n−1V (k,n)T(6.3)

which is equivalent to Equation 5.11 representing the k-dimensional principalword vectors.

In general, the above argument shows that HAL and HPCA are special casesof principal word embedding. Principal word embedding differs from HAL

and HPCA in the following ways.• It generalizes the co-occurrence matrix to a contextual matrix which en-

ables the principal word embedding method to model different types ofcontexts and different types of contextual features. In other words, prin-cipal word embedding can generate word embeddings for both raw andannotated corpora with different types of context. By contrast, HAL andHPCA can only process raw corpora with window-based contexts.

• It generalizes the transformation step to an adaptive transformation func-tion whose parameters are automatically tuned with regard to a trainingcorpus. By contrast, HAL does not use any transformation function andHPCA uses a fixed transformation function.

• It centres the column vectors of the transformed contextual matrix be-fore performing PCA and computes the principal components of a mean-centred matrix. However, HPCA postpones the mean-centring after per-forming PCA and HAL does not mention if the mean-centring is per-formed or not.

With regard to the limiting factors of PCA for word embedding, we con-clude that HAL ignores both limiting factors. However, HPCA addresses andprovides solutions to both limiting factors. The first limiting factor, relatedto the data distribution, is mitigated by performing a Hellinger transformation

94

on the co-occurrence data. The second limiting factor, related to the memoryand processing units required to process co-occurrence data, is addressed bya randomized method of SVD that speeds up the processing of co-occurrencedata. HPCA also avoids data centring in PCA to preserve the sparsity of data.This also helps making efficient use of memory.

6.2 Random Indexing (RI)In this section, we examine the relationships between the principal word em-bedding method and RI (Sahlgren, 2006). We also study how the two restrict-ing factors of PCA for word embedding mentioned in the beginning of thischapter are addressed in RI. One of the restricting factors is related to thedistribution of contextual word vectors. This is addressed in RI by a contextselection method that eliminates highly frequent context words, i.e., functionwords that contain grammatical information, and keeps the frequent contentwords. In terms of contextual word vectors, RI uses word forms as contextualfeatures and eliminates function words from the list of contextual features.This is equivalent to multiplying a co-occurrence matrix M by a diagonal ma-trix Φ defined as

Φ(i,i) =

{1 if i corresponds to a highly frequent context word

0 otherwise(6.4)

Using the matrix Φ, the matrix obtained from the context selection in RI is:

M = ΦM (6.5)

The matrix Φ plays the same role as a metric matrix in the principal word em-bedding method (see Section 5.1). This shows that the context selection in RIis a special case of weighting in principal word embedding. Unlike principalword embedding, RI does not perform any other transformation on the matrixM in Equation 6.5. Another difference between principal word embeddingand RI is in the method of dimensionality reduction. Sahlgren (2006) sug-gests reducing the dimensionality of a co-occurrence matrix through randomprojection (Achlioptas, 2001). Random projection is based on the Johnson-Lindenstrauss lemma stating that any point in a high-dimensional space canbe projected to a lower-dimensional space while the pointwise distances be-tween the point and other points in the original space are retained. It uses a ran-dom projection matrix whose unit length rows, forming the lower-dimensionalspace, are almost orthogonal to each other. RI employs this idea of near or-

thogonality and associates each context unit (word) in the language with a low-dimensional unit length random vector. For each word in a vocabulary list, RIscans a training corpus and accumulates the context vectors associated withcontext words in the contextual environment of the word. The accumulation

95

of context vectors is equivalent to multiplying the matrix M in Equation 6.5by a random projection matrix whose columns correspond to the random con-text vectors. RI performs this accumulation step in a sequential way whilescanning the training corpus. This makes RI very efficient in terms of mem-ory usage and makes it possible to process a co-occurrence matrix. This isthe way that RI solves the issues related to the processing large co-occurrencematrices, as mentioned in the beginning of this chapter.

We sum up the relationships between RI and the principal word embeddingmethod as follows. In the principal word embedding method, we associatecontext words with fully orthogonal vectors as opposed to RI, which asso-ciates context words with nearly orthogonal vectors. The transformation stepin the principal word embedding method is more sophisticated than the trans-formation step in RI (see Equation 6.5). This addresses the first limiting factorof PCA for word embedding that deals with the distribution of co-occurrencedata. Principal word embedding makes use of PCA to reduce the dimension-ality of contextual word vectors but RI uses a sequential random projectionfor this aim. This addresses the second limiting factor of PCA for word em-bedding and enables RI to process very large co-occurrence matrices.

6.3 GloVe

This section is focused on the relationships between the principal word em-bedding method and GloVe (Pennington et al., 2014). It also explains howGloVe addresses the two shortcomings of using PCA for word embeddingnoted in the preamble of this chapter.

Similar to the other word embedding methods discussed in this chapter,GloVe also extracts a set of low-dimensional word vectors from a co-occur-rence matrix. Glove addresses the first limiting factor of using PCA for wordembedding, dealing with the distribution of co-occurrence data, by a trans-formation function. It applies a predefined transformation function on the co-occurrence matrix in order to diminish the disproportionate effect of very largeand very small numbers in the co-occurrence matrix. The small numbers in theco-occurrence matrix undergo a power transformation and the remaining ele-ments undergo a logarithmic transformation function. The threshold boundarydetermining small and large numbers is defined by end users. Unlike GloVe,principal word embedding uses an adaptive transformation function to resolvethe problem of the disproportionate effect of frequently/rarely used words (seeSection 5.2.2). The transformation step in principal word embedding is tunedautomatically based on the distribution of co-occurrence data and normalizesthe distribution to make it more suitable for PCA. In addition, principal wordembedding introduces a weighting mechanism which which makes it possibleto model different types of linear transformations used in literature (see Sec-tion 5.2.1). Moreover, it allows for using different types of contexts and differ-

96

ent types of contextual features. This is achieved by formulating the conceptof context variables (see Section 4.2), according to which context is defined asa function and a contextual feature can be any symbol that describes a word.GloVe, by contrast, only supports the window-based context with word formsas contextual features. Principal word embedding is therefore more flexiblethan GloVe in the types of co-occurrence matrices it admits.

Once a co-occurrence matrix is built, GloVe uses a global log-bilinear re-gression model to generate a set of low-dimensional word vectors. Penning-ton et al. (2014) argue that a matrix of word vectors ϒ will have the followingproperty:

ϒT ϒ = f (M)+b1Tn (6.6)

where the n× n matrix M is a co-occurrence matrix, f is the transformationfunction used byGloVe, the n-dimensional vector b is a bias vector, and the n-dimensional 1n is a vector of ones; n is the number of words and the columnsof matrix ϒ correspond to n word vectors. Denoting the ith column of ϒ asϒi and assuming ‖ϒi‖ = 1 for i = 1 . . .N, the left-hand side of Equation 6.6measures the cosine similarity between the unit-sized word vectors ϒi in akernel space and the right-hand side is the corresponding kernel matrix. Withthis formulation, the matrix of k-dimensional word vectors ϒ can be estimatedby kernel principal component analysis (Schölkopf et al., 1998) of the kernelmatrix

K = f (M)+b1Tn (6.7)

Kernel PCA is a variant of PCA that enables the extraction of non-linear re-lations from a set of random variables. Using kernel PCA, a k-dimensionalestimate of word vectors in ϒ is

ϒ =√

Σ(k,k)V (k,n)T(6.8)

where UΣV T = K is the singular value decomposition of the n×n kernel ma-trix K in Equation 6.7, Σ(k,k) is the diagonal matrix of top k singular values in Σand V (k,n) is a sub-matrix of V consisting of the top k rows and top n columnsof V . Let M =UΣV T be the singular value decomposition of the co-occurrencematrix M. If we use the second degree polynomial kernel K = MT M insteadof Equation 6.7, then the singular value decomposition of K will be

K = MT M

=V ΣTUTUΣV T

=V Σ2V T

(6.9)

where V is the right singular matrix of M and Σ is the diagonal matrix of sin-gular values of M. Applying the SVD in Equation 6.9 to Equation 6.8, thelow-dimensional word vectors in ϒ are in the same direction as the word vec-tors generated by the principal word embedding method in Equation 5.9. This

97

shows that if we replace the kernel matrix in Equation 6.7 with the second de-gree polynomial kernel K =MT M, then the word vectors generated by GloVeand the word vectors generated by the principal word embedding method aredistributed in the same directions but with different variances.

Now we turn our attention to how the limiting factors of using PCA for wordembedding are handled in GloVe. The problem related to the data distribu-tion is handled by a logarithmic and a power transformation, whose parame-ters are manually tuned by end users. The other problem, related to the sizeof a co-occurrence matrix and the computational resources required to pro-cess this matrix, is handled by software engineering techniques. GloVe usesa buffering technique that avoids storing the entire matrix in the main mem-ory. Instead, it keeps the information related to most frequent words in thememory and stores other information on disk. Due to the Zipfian distributionof words, the most frequent words account for a small fraction of words thatcan be easily kept in the main memory. Since frequencies of seeing less fre-quent words, kept on disk, are much smaller than the words kept in the mainmemory, the frequency of disk accesses is relatively small and can be donein reasonable time and using a reasonable amount of memory. This enablesGloVe to process a co-occurrence matrix in an efficient way.

6.4 RSV

In this section, we examine the connection between the principal word em-bedding method and RSV (Basirat and Nivre, 2017). RSV forms a set of wordvectors using the right singular vectors of a mean-centred co-occurrence ma-trix that undergoes a seventh root transformation function. The transformationfunction normalizes the co-occurrence data and makes the data suitable forthe subsequent singular value decomposition. This addresses the problem ofthe data distribution limiting the usage of PCA for word embeddings, noted inthe beginning of this chapter. RSV does not provide any solution for the otherlimiting factor, which is related to the size of the co-occurrence matrix.

The principal word embedding method generalizes the seventh root trans-formation function adopted by RSV to an adaptive transformation power func-tion whose degree is determined based on the distribution of co-occurrencedata. In addition, it generalizes the concepts of context and contextual fea-tures in a way that makes it easier to process different types of contexts andcontextual features. Moreover, the principal word embedding method uses asingular value decomposition algorithm enabling the method to compute theprincipal components of a co-occurrence matrix in an efficient way. It alsouses several weight matrices, giving the method enough flexibility to modeldifferent types of transformations in literature (see Section 5.2.1). Principalword embedding introduces an eigenvalue weight matrix Λ, which providesa way to control the spread of principal word vectors. Given the SVD of a

98

transformed co-occurrence matrix M =UΣV T , as explained in Section 5.2.3,the principal word embeddings are formed by the right singular vectors of M,which is V , if we use

Λ = α√

n−1

[Σ(k,k)−1

0

0 0

](6.10)

where α is a constant, n is the number of words, and Σ(k,k)−1is the inverse of

the matrix Σ(k,k) consisting of the top k rows and k columns of Σ. Given thatword vectors in RSV are formed by the right singular vectors of a mean-centredco-occurrence matrix, the eigenvalue weighting matrix Λ in principal wordembedding generalizes the way that RSV forms low-dimensional word vectors.This confirms that the principal word embedding generalizes different aspectsof RSV such as the transformation step, and the dimensionality reduction step.

6.5 SummaryWe examined the relationships between the principal word embedding method,introduced in this thesis, and other word embedding methods such as HAL(Lund and Burgess, 1996), RI (Sahlgren, 2006), HPCA (Lebret and Collobert,2014), GloVe (Pennington et al., 2014), and RSV (Basirat and Nivre, 2017).We also addressed two shortcomings of using PCA for word embedding inthese methods. One of the shortcomings is related to the distribution of mix-tures of contextual word vectors, which is not suitable for PCA. The othershortcoming is related to the size of the contextual matrix sampled from mix-tures of contextual word vectors. The matrix is often very large and perform-ing principal component analysis on it requires a huge amount of memory andCPU time.HAL and HPCAwere studied together. We showed that both methods can be

considered special cases of principal word embedding. We also discussed thatHAL addresses none of the PCA shortcomings. HPCA addresses both short-comings but at the cost of eliminating the mean-centring step in PCA. HPCAmitigates the problem with the data distribution by performing a Hellingertransformation on a co-occurrence matrix. The shortcoming related to the ma-trix size is alleviated by eliminating data centring and preserving the sparsenature of the matrix. RI does not use PCA for dimensionality reduction. In-stead, it uses a random projection as its dimensionality reduction technique.This makes RI a very efficient word embedding method. This is the way thatRI solves the problem related to data size. RI addresses the shortcoming re-lated to the data distribution by a feature selection approach. We showed thatthe feature selection in RI can be modelled by using a proper weighting matrixin principal word embedding.

99

GloVe addresses both shortcomings of using PCA for word embedding.The shortcoming related to the data distribution is mitigated by a logarith-mic transformation and the shortcoming related to the matrix size is addressedby software engineering techniques such as buffering. We showed that themethod adopted by GloVe is equivalent to a kernel principal component anal-ysis, which is closely related to the generalized PCA used in principal wordembedding.RSV addresses the shortcoming related to the data distribution, but it ig-

nores the shortcoming related to the data size. The data distribution in RSV

undergoes a seventh root transformation, which is a special case of the powertransformation in the principal word embedding method. Word vectors in RSVare formed by right singular vectors of a mean-centred co-occurrence matrix.We showed that the same vectors can be obtained from the principal wordembedding method if it is trained with proper parameters as described in Sec-tion 6.4.

100

Part III:Experiments

7. Experimental Settings

In this chapter, we describe the experimental settings used in the thesis. Ourexperiments are divided into two main groups. The first group consists ofa series of experiments that study different parameters of the principal wordembeddings with regard to multiple intrinsic and extrinsic evaluation metrics.The second group consists of experiments comparing principal word embed-dings with other sets of word embeddings.

The organization of this chapter is as follows. In Section 7.1, we introduceseveral intrinsic and extrinsic evaluation metrics used to assess the quality ofword vectors. Then in Section 7.2, we outline the organization of the ex-periments dealing with the parameters of the principal word embeddings. InSection 7.3, we introduce a comparison setting, which is used to compare theprincipal word embeddings with other sets of word embeddings. Finally, inSection 7.4, we outline the training data used to generate word embeddings.

7.1 Evaluation MetricsIn this section, we introduce multiple intrinsic and extrinsic evaluation metricsthat are used in our experiments. In Section 3.2.4, we outlined several evalua-tion metrics used in literature to assess the quality of word embeddings. Theevaluation metrics were classified into three groups: word similarity metrics,word analogy metrics, and application based metrics. Our intrinsic evaluationmetrics, introduced in this section, belong to none of these groups. The intrin-sic evaluation metrics provide a statistical view of the distribution of the wordvectors. These metrics rely on the spread and discriminability of word vectors.Our extrinsic evaluation metrics, however, belong to the word similarity andapplication-based evaluation metrics. In the following sections, we introducethe intrinsic and extrinsic evaluation metrics.

7.1.1 Intrinsic EvaluationWe introduce two types of intrinsic evaluation metrics that provide a statisticalview of the distribution of a set of word vectors. The first metric measuresthe spread of a set of word vectors. The second metric measures the discrim-inability of a set of word vectors. The spread of word vectors is measured bythe generalized variance of the word vectors. As described in Section 2.2, the

103

generalized variance of a random vector is equal to the product of the eigen-values of its covariance matrix. Depending on the parameters of the principalword embeddings, the generalized variance of a set of principal word vectorscan be dominated by the top eigenvalues of their covariance matrix. In orderto mitigate this potential problem in the values of generalized variance, we usethe logarithm of the generalized variance as our measure of the spread of prin-cipal word vectors. Denoting the ith top eigenvalue of the covariance matrixof a set of word vectors by λi, the logarithm of generalized variance of theword vectors is computed as below:

logGV =k

∑i=1

logλi (7.1)

This equation is obtained from Equation 2.38.We use the Fisher discriminant ratio (FDR) to measure the discriminabil-

ity of the word vectors (see Section 2.2). The FDR is used to measure thesyntactic and semantic discriminability of word vectors. Since part-of-speechtags are more syntactic in nature, we use them to measure the syntactic dis-criminability of word vectors. In other words, the syntactic discriminabilityof word vectors is measured by the FDR of word vectors with respect to theirpart-of-speech tags. Similarly, since named-entities such as persons, loca-tions, and organizations are more semantic in nature, we use them to measurethe semantic discriminability of word vectors. The semantic discriminabilityof word vectors is measured with respect to named entities. These categoriesprovide us with some insights about the syntactic and semantic informationencoded into the word vectors. Therefore, we refer to the discriminant ratiosobtained from the part-of-tags and named entities as syntactic discriminabilityand semantic discriminability respectively. Since the syntactic and semanticcategories of words may vary with their contextual environment, we representeach occurrence of a word with a large vector built with the concatenation ofthe word vectors associated with the word itself and its surroundings. Moreconcretely, each occurrence of a word is represented by the concatenation ofthe word vectors associated with the words in a symmetric window of length7 where the word in question is in the middle of the window, i.e., three pre-ceding words, the word at position 4, and three succeeding words. The labelof the word in the middle of the window is taken to be the label of the entirevector.

We use universal part-of-speech tags (Nivre et al., 2016) as the syntacticcategories of words and the named entity data provided by the CoNLL-2003shared task (Tjong Kim Sang and De Meulder, 2003) as the semantic cat-egories of words. The syntactic and semantic discriminability of the wordvectors are computed on the development set of the English part of the corpusof Universal Dependencies v2.0 and the testa file of the English part of theshared task, respectively.

104

To sum up, we introduced two intrinsic evaluation metrics providing a sta-tistical view of the spread and discriminability of a set of word vectors. As in-trinsic evaluation metrics, the spread and the discriminability of word vectorstell us about the internal structure of the word vectors but not the meaningful-ness of the word vectors for certain tasks. A high value of data spread indicatesthat the volume of word vectors in their vector space is higher than the volumeof other sets of word vectors having smaller values of data spread. In termsof PCA, a high value of data spread is desirable because it is considered as ametric of information. A high value of data discriminability indicates that theword vectors are well distributed into certain word classes used for computingFDR (e.g. part-of-speech tags).

7.1.2 Extrinsic EvaluationThe extrinsic evaluation of principal word vectors studies the contribution ofprincipal word vectors to other tasks such as a word similarity task, part-of-speech tagging, named-entity recognition, and dependency parsing. All ofthese frameworks are outlined in Section 3.2.4. The word similarity bench-marks, developed by Faruqui and Dyer (2014), are used to measure correla-tions between word similarities captured by word embeddings and humanlyjudged word similarities. veceval (Nayak et al., 2016b) is employed toevaluate the contribution of word embeddings to part-of-speech tagging andnamed-entity recognition. The contributions of word vectors to dependencyparsing are measured by the dependency parsing framework used by Basiratand Nivre (2017).

The word similarity framework reports the correlation between similarityrankings provided by multiple word similarity benchmarks and the cosine sim-ilarities between a set of word vectors. We report a breakdown of the resultsobtained from the benchmark. This reveals strengths and weaknesses of wordvectors on different data sets. In addition, in order to provide an overall viewof the performance of word vectors, we report the average of the correlationsobtained from all word similarity benchmarks.

As mentioned in Section 3.2.4, veceval is an application-based frame-work that evaluates a set of word embeddings with regard to its contributionin several standard NLP tasks including part-of-speech tagging and named-entity recognition. We use veceval with 50-dimensional word vectors in-stead of the 100-dimensional word vectors used for other experiments. Thisis because the tool is configured to processes only 50-dimensional word vec-tors. veceval trains and tests a part-of-speech tagger on Wall Street Journal(WSJ) (Marcus et al., 1993). Sections 00–18 are used for training and sections19–21 are used for testing. veceval trains and tests a named entity recog-nizer on the data provided by the CoNLL-2003 shared task (Tjong Kim Sang

105

and De Meulder, 2003). The shared tasks’s training data is used to train thenamed entity recognizer and testa is used as the test set.

The dependency framework evaluates a set of word vectors with respectto its contributions to the dependency parsing task. Basirat and Nivre (2017)use the dependency parser of Chen and Manning (2014) to evaluate differ-ent sets of word vectors. The parser is an arc-standard system (Nivre, 2004)with a feed-forward neural-network as its classifier. The parsing experimentsare carried out on the WSJ annotated with Stanford typed dependencies (SD)(de Marneffe and Manning, 2008). Sections 02–21 of WSJ are used for train-ing, and sections 22 and 23 are used as the development set and the test setrespectively.1 The part-of-speech tags are assigned to words through 10-foldcross-validation on the training sets using the Stanford part-of-speech tagger(Toutanova et al., 2003).2 We use the parser with 100-dimensional word vec-tors and 400 hidden units in the hidden layer of the neural network. The re-maining parameters are set to their default values. The parsing performanceis reported in terms of labelled attachment score (LAS) and unlabelled attach-ment score (UAS).

7.2 Initial SettingsIn this section, we outline the setup of our experiments, which are aimed atstudying the effect of parameters of principal word embeddings. Among allparameters of the principal word embeddings, we have selected the followingparameters to study:

1. Feature variables2. Number of dimensions3. Metric and weight matrices4. Transformation function

Each of these parameters can be set in several ways, resulting in a parame-ter space too large to analyse exhaustively. For example, one can set featurevariables in several ways, as described in Section 4.3 and Section 4.6. The

1In machine learning, a development set, also known as a validation set, is used for tuningthe parameters of a learning model. A test set, however, contains unseen data to evaluate alearning model. The performance of a learning model is measured on the test set (Bishop,2006, Page 32).2Cross-validation is a learning and evaluation technique in which a data set is used for bothtraining and validating a learning model. In k-fold cross-validation, a data set is randomlydivided into k equal size subsets. One of the k subsets is used as development data and theremaining k− 1 subsets are used as training data. This process is repeated k times so that allsubsets have a chance to be used as development data. In part-of-speech tagging, each of the k

development data is tagged by a tagging model, which is trained on the remaining folds. By theend of this process, all k folds are tagged. The average of the k results from the developmentdata is then considered to be an estimate of the performance of the entire learning process.

106

number of dimensions can be set to any positive integer value smaller than orequal to the size of feature variables. We navigate through these parametersone-by-one. First, we define a default setting to initialize the parameters. Us-ing this default setting, we then study each of the parameters individually asfollows. We start by forming different types of feature variables and examinehow the feature variables affect the principal word vectors. A study of differ-ent types of feature variables is provided in Section 4.3 and Section 4.6. Thenwe focus on the effect of the number of dimensions on the principal wordvectors. The number of dimensions (or dimensionality) of the principal wordvectors is smaller than or equal to the number of feature variables from whichthe word embeddings are generated. The effect of the weighting matrices isstudied together with the effect of the transformation function. Examples ofmetric and weight matrices can be seen in Section 5.2.1. As explained in Sec-tion 5.2.2, the transformation function can be any monotonically increasingconcave function. In our experiments, we use the power transformation func-tion:

f(V;θ) =Vθ (7.2)

where V is a contextual word vector and θ is a parameter set to be optimized(see Equation 5.4).

The feature variables and the number of dimensions are evaluated with re-gard to our intrinsic evaluation metrics described in Section 7.1.1. The effectsof weighting matrices and the transformation function are studied with regardto both the intrinsic and extrinsic evaluation metrics described in Section 7.1.2.For these experiments, the feature variables and number of dimensions are setto their default values and all combinations of the values of metric matrix,weight matrix, and transformation function are formed and evaluated with re-spect to the intrinsic and extrinsic evaluation metrics.

The default setting for the above parameters is defined as follows. The setsof feature variables are initialized by word forms as their contextual featureand the backward neighbourhood context with parameter τ = 1 as their contextfunction, i.e., the immediately preceding word. The number of dimensionsis set to 100. The metric matrix and the weight matrix are both set to theidentity matrix. The eigenvalue weighting matrix Λ is set as in Equation 5.8with k = 100. The transformation function is the identity function, i.e., notransformation is done on the elements of contextual word vectors.

7.3 Comparison SettingsWe compare principal word vectors with word vectors obtained from otherpopular word embedding methods. The comparisons are made on the basisof both the intrinsic evaluation metrics and the extrinsic evaluation metricsexplained in Section 7.1.2 and Section 7.1.1. The following methods of wordembedding are chosen for comparison:

107

• word2vec (Mikolov et al., 2013b)• GloVe (Pennington et al., 2014)• HPCA (Lebret and Collobert, 2014)• random indexing (RI) (Sahlgren, 2006)

We train all methods with the same raw corpus used for generating the prin-cipal word vectors. This corpus is described in Section 7.4. Except for theset of word vectors extracted by word2vec, all embeddings are trained withbackward neighbourhood context with length 1, i.e., the context function foreach word in the corpus points to its immediately preceding word and thecontextual features are the word forms. We train word2vec with a sym-metric neighbourhood context of length one, i.e., the context forms with theimmediately preceding and the immediately succeeding words. The numberof iterations in GloVe is set to 50. All methods are trained with five threads,if multi-threading is supported.

We also provide a comparison of the efficiency of word embedding meth-ods. As a software system, the efficiency of a word embedding method doesnot only depend on its mathematical formulation. It may also depend on otherfactors such as memory management, parallelization, and how the system isdivided into subsystems. Since not all of the word embedding methods fol-low the same software architecture, this comparison will not be as straight-forward as the previous comparisons. As described in Section 3.2, some ofthese methods such as HPCA, GloVe, and principal word embedding, dividethe task of word embedding into three main sub-tasks. In the first step, theybuild a co-occurrence matrix, which then undergoes a transformation functionin the second step. In the third step, a set of low-dimensional word vectors isextracted from the transformed co-occurrence matrix by performing a dimen-sionality reduction technique. RI and word2vec, by contrast, perform allof these steps together. This makes these methods highly efficient in terms ofmemory usage, since they don not need to explicitly build the co-occurrencematrix in the memory.

We use our own implementations of HPCA and RI. HPCA is implemented asfollows: first, we scan the training corpus and build a probability co-occurrencematrix as described in Section 3.2.2. Then we perform a Hellinger transforma-tion on this matrix. A set of low-dimensional word vectors are generated byapplying PCA to the transformed matrix. These vectors are then normalized tohave zero mean and a variance of one. RI is implemented as a random projec-tion method. First, we scan the training corpus and build a co-occurrence ma-trix. This matrix then undergoes the PPMI transformation (see Section 3.2.2).The resulting matrix is finally multiplied by a random matrix, resulting in a setof low-dimensional word vectors.

Using the implementations described above, the differences between RI,HPCA, GloVe, and principal word embedding are in their transformation anddimensionality reduction steps. The main difference between these methods,

108

in terms of efficiency, is in their dimensionality reduction techniques. Thetransformation step is not computationally expensive. Hence, we comparethe efficiency of these methods with respect to the CPU time required to per-form the dimensionality reduction. We compare the principal word embeddingmethod and word2vec separately.

We use Octave compiled with OpenBLAS to implement the dimension-ality reductions in RI, HPCA, and principal word embedding. The multi-threading functionality of OpenBLAS speeds up the basic matrix operationsand the singular value decomposition in Line 7 in Algorithm 5. Both GloVeand word2vec support multi-threading. This makes the comparison betweenthe efficiency of the word embedding methods reasonably reliable. The com-parison results are reported in Section 8.2.

7.4 Training CorpusWe detail the data used for training word vectors in our experiments. The wordvectors are trained on the English corpus provided by the CoNLL-2017 sharedtask (Hajic and Zeman, 2017) as additional data for training word embeddings.The corpus is annotated with universal part-of-speech tags and universal de-pendency relations (Nivre et al., 2016) using UDPipe (Straka and Straková,2017). We use the annotations with no change but normalize the raw text asfollows. All sequences of digits are replaced with the special token <number>

and all tokens with a frequency less than 50 are replaced with <unknown>.We also replace all uppercase letters with the corresponding lowercase letters.The normalized corpus contains around 11 billion tokens in total with around1 million unique tokens. This corpus is used for all experiments unless weclearly state that another data set is used.

7.5 SummaryWe have introduced several intrinsic and extrinsic evaluation metrics to anal-yse principal word embeddings. The intrinsic metrics deal with the spreadand discriminability of the word embeddings. The spread of a set of wordembeddings is represented by the logarithm of the generalized variance of thevectors. The discriminability of word embeddings is defined as the Fisherdiscriminant ratio (FDR) of vectors. Two types of discriminability have beendefined, the syntactic discriminability and the semantic discriminability. Thesyntactic discriminability of word embeddings is computed with regard to thepart-of-speech tags of the words. The semantic discriminability of word em-beddings is computed with regard to the named-entity classes of words. Theextrinsic metrics assess the quality of word embeddings with regard to their

109

contributions to other tasks such as a word similarity task, part-of-speech tag-ging, named-entity recognition, and dependency parsing.

Our experiments with principal word embeddings are organized to study theeffect of the parameters of the principal word embedding methods on the eval-uation metrics. Four parameters of principal word embedding were chosen:feature variables, number of dimensions, weighting matrices, and the transfor-mation function. This was followed by a section that introduced a comparisonsetting by which to compare the principal word embedding method with otherpopular word embedding methods. The comparison is based on the efficiencyof the word embedding methods and their performance on the evaluation met-rics introduced in this chapter.

110

8. Results

In this chapter, we analyse the results obtained from our experiments in or-der to answer our fourth research question (see Section 1.2). First, we reportthe experimental results obtained from our experiments on the parameters ofprincipal word embeddings. These experiments are based on the intrinsic andextrinsic evaluation metrics described in Section 7.1. The organization of theseexperiments was outlined in Section 7.2. Then we report the results obtainedfrom the comparison between principal word embedding and other word em-bedding methods. The comparison is carried out with regard to the settingsdescribed in Section 7.3.

8.1 Parameters of Principal Word EmbeddingThe division of this section is based on our experiments on the parameters ofprincipal word embeddings. In Section 8.1.1, we examine how principal wordembeddings are affected by different types of feature variables. This study isbased on our intrinsic evaluation metrics introduced in Section 7.1.1. In Sec-tion 8.1.2, we study the effect of the number of dimensions on the principalword embeddings. This study is based on both the intrinsic and the extrinsicevaluation metrics introduced in Section 7.1.1 and Section 7.1.2, respectively.The effect of weight matrices and transformation functions is studied in Sec-tion 8.1.3.

8.1.1 Feature VariablesIn this section, we study the effect of feature variables on the spread and dis-criminability of the principal word vectors. As detailed in Section 4.4, a setof feature variables is characterized by a set of contextual features and a con-text function. In Section 4.6, we proposed different approaches for combiningfeature variables. Our experiments on the feature variables are performed onthe feature variables formed by the following contextual features and contextfunctions:

• contextual features: word forms, and part-of-speech tags.• context function: neighbourhood context and dependency context

The experiments cover every combination of the contextual features and con-text functions. In other words, we use the neighbourhood context with both

111

word forms and part-of-speech tags. Similarly, the dependency context is usedwith both word forms and part-of-speech tags. The neighbourhood context isused with different values of τ =+−1, . . . ,+−10. For the dependency context, weonly use τ = 1, the immediate dependency parent of a word (see Section 4.2).The feature variables formed by the neighbourhood context are combined inthe following ways. We form different variants of the window-based contextand the union context, backward, forward, and symmetric window-based con-text. Moreover, we experiment with the joint set of feature variables obtainedfrom the word form feature variables and the corresponding part-of-speech tagfeature variables. For example, we join the feature variables obtained from theword forms and the neighbourhood context with the feature variables obtainedfrom part-of-speech tags and the neighbourhood context with the same valueof τ . We use the universal part-of-speech tags and the universal dependen-cies provided by our training corpus, described in Section 7.4, as the set ofpart-of-speech tags and the dependency context.

All parameters of principal word embedding are set to their default valuesexcept for the feature variables and the number of dimensions. The featurevariables are set as above and the number of dimensions is set to 15. The rea-son we choose 15 as the number of dimensions is because of the small sizeof the part-of-speech tag set: 17. As mentioned in Section 5.1, the maximumnumber of dimensions of principal word embeddings is always smaller thanor equal to the size of feature variables which is determined by the size of thefeature set. The word embeddings generated with word forms can have thou-sands of dimensions but the word embeddings generated with part-of-speechtags can have maximum 17 dimensions. We set the number of dimensions k

to 15 to cancel out the effect of the number of dimensions on the results. Lateron, in Section 8.1.2, we study how principal word vectors are affected by thenumber of dimensions. In the remainder of this section, we report the spreadand the discriminability of the principal word embeddings obtained from thefeature variables.

Data Spread

As described in Section 7.1.1, the spread of a set of word embeddings is mea-sured by the logarithm of the generalized variance of the word vectors. Fig-ure 8.1 shows the logarithm of generalized variance of the 15-dimensionalprincipal word vectors generated with different types of feature variables. Asshown, regardless of the context in use, the principal word vectors generatedwith part-of-speech tags (Figure 8.1b) result in higher variance than the prin-cipal word vectors generated with word forms (Figure 8.1a), and the principalword vectors generated with the joint word forms and part-of-speech tags (Fig-ure 8.1c). However, it is worth noting that the results reported for the principalword vectors generated with part-of-speech tags are due to the contribution ofalmost all of the eigenvalues, but the other results are due to the contributionof a small fraction of eigenvalues, i.e., 15 eigenvalues out of thousands of pos-

112

sible eigenvalues. This means that the spread of the word vectors generatedwith word forms can be much higher if we take more eigenvalues into consid-eration. Later on, in Section 8.1.2, we show how the spread of principal wordvectors generated with word forms is affected by the number of eigenvaluesand eigenvectors taken into consideration.

The decreasing trend in the results obtained from the neighbourhood con-texts shows that the word vectors become denser as the parameter |τ | increases.The principal word vectors generated with word forms and both backward andforward neighbourhood contexts result in equal generalized variance. This isbecause, for each value of τ > 0, the contextual matrix M(−τ) obtained fromthe backward neighbourhood context is equal to the transposition of the con-textual matrix M(τ) obtained from the forward neighbourhood context, i.e.,

M(−τ) = M(τ)T . In other words, for each pair of tokens (et ,et+τ) in the corpuswith vi(et) = 1 and v j(et+τ) = 1, we have f j(et) = 1 and fi(et+τ) = 1, i.e.,et = (vi, f j) and et+τ = (v j, fi), since the set of contextual features F is equal

to the set of vocabulary units V , i.e., fi = vi, and f j = v j, hence M(−τ)(i, j)

= M(τ)( j,i)

.Thus, the covariance matrix of the corresponding contextual matrices with theabove property will be the same, hence resulting in the same sets of eigenval-ues. The overlap between the results obtained from the forward and backwardwindow-based context in Figure 8.1a is interpreted in the same way.

The decreasing trend observed in the neighbourhood contexts changes to anincreasing trend once we combine the neighbourhood contexts. The increasingtrends observed in the window-based and union context show that the relatedfeature combination approaches described in Section 4.6 are useful in increas-ing the spread of the principal word vector. Figure 8.1 shows that the highestamounts of data spread are obtained from the symmetric window-based andunion context regardless of the type of contextual feature in use. The unioncontexts always result in a higher amount of generalized variance than theircorresponding window-based contexts.

We summarize our observations on the effect of feature variables on thedata spread as follows. The spread of principal word vectors generated withthe neighbourhood context decreases as the absolute value of the parameter τincreases. The spread of the word vectors generated with the window-basedand union feature combination approaches increases as the parameter k in-creases. The union contexts result in higher values of generalized variancethan their corresponding window-based contexts.

Data Discriminability

In this part, we examine the syntactic and semantic discriminability of the prin-cipal word vectors generated with different types of feature variables explainedin the beginning of Section 8.1.1. The syntactic and semantic discriminabilityof a set of word vectors are defined as the Fisher discriminant ratio (FDR) ofword vectors with regard to the part-of-speech tags and the named entities as

113

0 2 4 6 8 10

240

260

280

300

320

340

x

Word Form

(a)

0 2 4 6 8 10

240

260

280

300

320

340

x

POS

(b)

0 2 4 6 8 10

240

260

280

300

320

340

x

Word Form × POS

BNFNBWFWSWBUFUSUD

(c)

Figure 8.1. The logarithm of generalized variance of principal word vectors gener-ated with different types of feature variables formed with different types of contex-tual features and context functions. The contextual features are (a) word forms, (b)part-of-speech tags, and (c) joint word forms and part-of-speech tags. The contextfunctions are backward neighbourhood context (BN), forward neighbourhood context(FN), backward window-based context (BW), forward window-based context (FW),symmetric window-based context (SW), backward union context (BU), forward unioncontext (FU), symmetric union context (SU), and dependency context (D). The hor-izontal axis x refers to |τ| for neighbourhood context and k for window-based, andunion context. The vertical axis y shows the logarithm of generalized variance ofprincipal word vectors.

114

described in Section 7.1.1. We start with the feature variables formed by theword forms and study the effect of context functions on their syntactic andsemantic discriminability. This scenario is repeated for the feature variablesformed by part-of-speech tags. At the end, we examine how the joint set offeature variables are affected by different types of context. The results foreach series of experiments are depicted by two columns that show the syn-tactic (left) and the semantic (right) discriminability of the word vectors. Ineach figure, we show a baseline that is obtained from word vectors randomlygenerated by a standard Gaussian distribution.

Figure 8.2 shows the values of syntactic and semantic discriminability ofprincipal word vectors generated with the word forms and different types ofcontext, the neighbourhood context, the window-based context, and the unioncontext. Figure 8.2a and Figure 8.2b show the syntactic and semantic discrim-inability obtained from the neighbourhood contexts with different values of τ .In order to make the comparison between the contexts more convenient, thehorizontal axes in these figures show the absolute value of τ . The almost uni-form trend in the results obtained from the backward neighbourhood contextin Figure 8.2a shows that this context function is insensitive to the value ofparameter τ . However, the large decrease in the values of forward neighbour-hood context indicates that the syntactic discriminability of word vectors isnegatively affected as τ increases. In general, we see that the backward neigh-bourhood context results in higher values of syntactic discriminability than theforward neighbourhood context. On the other hand, Figure 8.2b shows thatthe semantic discriminability of the principal word vectors is almost insensi-tive to the values of parameter τ . We see that the discrimination ratio obtainedfrom the backward neighbourhood context starts with an upward trend but fallsdown when τ is larger than 3. However, the general trend of the graphs showsthat the parameter τ is not an effective factor to the semantic discriminabil-ity of the word vectors generated with the word forms and the neighbourhoodcontext. We also see that the semantic discrimination ratio does not differen-tiate between the forward context and the backward context. This is opposedto what we observed in the syntactic discriminability of the word vectors gen-erated with the neighbourhood context, where backward context was moreinformative than the forward context.

Figure 8.2c and Figure 8.2d show the syntactic and semantic discriminabil-ity of the principal word vectors generated with different types of window-based contexts, such as backward, forward, and symmetric window-basedcontexts with length k = 1, . . . ,10. The ascending trend in the results obtainedfrom the forward and backward window-based contexts in Figure 8.2c showsthat the addition approach in Equation 4.38, used to form a window-based con-text, preserves part of the syntactic information provided by the constituentneighbourhood contexts forming the window-based context. The symmet-ric window-based context is formed by the linear combination of both for-ward and backward window-based contexts. The word vectors generated with

115

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

|τ |

FD

R

neighbourhood - syntactic disc.

backwardforwardbaseline

(a)

0 2 4 6 8 102

4

6

8

10

12·10−2

|τ |

FD

R

neighbourhood - semantic disc.


(b)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R

window-based - syntactic disc.

backwardforwardsymmetricbaseline

(c)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R

window-based - semantic disc.

backwardforward

symmetricbaseline

(d)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R

union - syntactic disc.


(e)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R

union - semantic disc.

backwardforward

symmetricbaseline

(f)

Figure 8.2. The syntactic (left) and semantic (right) Fisher discriminant ratio (FDR) ofprincipal word vectors extracted with different types of feature variables formed withword forms as contextual features and neighbourhood context (a, d), window-basedcontext (b, e), and union context (c, f).

116

the symmetric window-based context are more influenced by the backwardwindow-based context than the forward window-based context. In general,we see that the symmetric window-based context is as good as the backwardwindow-based context. In terms of syntactic discriminability, there is no clearadvantage in prioritizing one over the other. However, in terms of the com-putational resources, the backward window-based context with a small valueof k is preferable because it results in higher sparsity in the contextual ma-trix and reduces the computation time and memory consumption. Similar tothe neighbourhood context, the backward and symmetric window-based con-texts result in higher syntactic discriminability than the forward window-basedcontext. This is opposed to the semantic discriminability of word vectorsgenerated with the window-based context where both backward and forwardwindow-based contexts result in almost similar values of Fisher discriminantratio (see Figure 8.2d). Despite the similar performance of the different typesof window-based context on the semantic discriminability of the word vec-tors, we see that the symmetric window-based context consistently results inslightly higher semantic discriminability. This is in accordance with the ob-servations made by Lebret and Collobert (2015), who found that the symmet-ric window-based context with a fairly large window size was better than thebackward window-based context for the semantic oriented tasks. Neverthe-less, the overall trend of the variations in the results presented in Figure 8.2dshows that the semantic discrimination ratio of principal word vectors obtainedwith the symmetric window-based context is not very sensitive to the windowtype and window-length k.

Figure 8.2e and Figure 8.2f show the syntactic and semantic discrimina-tion ratios obtained from the principal word vectors generated with the unioncontext with different length k. Both backward and symmetric contexts actsimilarly to what we observed in the window-based context in both syntacticand semantic cases. Figure 8.2e shows that the contextual information pro-vided by the backward context is more helpful for determining the syntacticcategories of words than the forward context. This contrast with the semanticdiscriminant ratio in Figure 8.2f, where the forward context with large valuesof k is more meaningful in determining the semantic categories of words. Thebest results of syntactic and semantic discriminant ratios are obtained from thesymmetric union context with k = 1 and k = 3. We see almost no change inthe values of semantic discriminability as k becomes larger. However, increas-ing the value of k has a negative effect on the syntactic discriminability of theprincipal word vectors.

We summarize our observations of the feature variables formed by wordforms and different types of context as follows. The backward and symmet-ric contexts are more informative in determining the syntactic categories ofwords. The symmetric contexts are more informative in determining the se-mantic categories of words.

117

Figure 8.3 shows the syntactic and semantic discriminant ratios obtainedfrom the 15-dimensional principal word vectors generated with part-of-speechtags as their contextual features and different types of contexts. Figure 8.3aand Figure 8.3b show the results obtained form the neighbourhood context. Asshown in Figure 8.3a, the forward neighbourhood context starts with a fairlyhigh result that drastically decreases as the value of τ increases. Conversely,the backward neighbourhood context starts with a small value of the FDRthat dramatically increases as the parameter |τ | increases. This shows thatthe syntactic category of the preceding words provide more information forthe syntactic discriminability of the words than the syntactic category of thesucceeding words. This observation about the meaningfulness of the back-ward context to the syntactic categories of words is also in accordance withour previous observations about word forms. The difference between the twoobservations is that the syntactic discriminability of word vectors generatedwith part-of-speech tags increases as the value of |τ | becomes larger but thesyntactic discriminability of word vectors generated with word forms is al-most insensitive to the value of τ . Nevertheless, Figure 8.3b does not showthe same observation for the semantic categories of words. Instead, it showsthat the backward neighbourhood context with higher values of |τ | does notnecessarily lead to higher semantic discriminability. The semantic discrim-inability is almost insensitive to the value of τ . The forward neighbourhoodcontext with |τ |= 1 results in higher semantic discriminability than the back-ward neighbourhood context with the same value of |τ |. This order changesas the value of |τ | increases.

Figure 8.3c and Figure 8.3b show the syntactic and semantic discriminabil-ity of word vectors generated with the backward, forward, and symmetricwindow-based contexts with length k. As shown in Figure 8.3c, the differencebetween syntactic discriminability of the word vectors generated with differ-ent types of window-based contexts vanishes as the window length k increases.Both forward and symmetric window-based contexts with length k = 1 resultin higher syntactic discriminability than the backward window-based context.This contrast with the results obtained from word forms where the backwardand symmetric window-based contexts result in higher syntactic discriminabil-ity (see Figure 8.2c). Figure 8.3c shows that increasing the value of windowlength k has a positive effect on the syntactic discriminability of word vectorsgenerated with the backward context and a negative effect on the word vectorsgenerated with the forward and symmetric contexts. However, we see that theresults obtained from the backward window-based context are never higherthan those obtained from the forward and symmetric window-based contexts.On the other hand, increasing the window length k has a positive effect onthe semantic discriminability of the word vectors generated with the forwardwindow-based context. We also see a descending trend in the results obtainedfrom backward and symmetric contexts as the window size k increases.

118

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

|τ |

FD

R



(a)

0 2 4 6 8 102

4

6

8

10

12·10−2

|τ |

FD

R



(b)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R



(c)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R


backwardforward

symmetricbaseline

(d)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R



(e)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R


backwardforward

symmetricbaseline

(f)

Figure 8.3. The syntactic (left) and semantic (right) Fisher discriminant ratio (FDR) ofprincipal word vectors generated with different types of feature variables formed withpart-of-speech tags as contextual features and neighbourhood context (a, d), window-based context (b, e), and union context (c, f) as context function.

119

Figure 8.3e and Figure 8.3f show the results obtained from the principalword vectors generated with the union context and part-of-speech tags. Fig-ure 8.3e shows that a fairly high syntactic discriminability can be obtainedfrom the symmetric union context. We see that the syntactic discriminabil-ity of the word vectors slightly increases as the parameter k in the symmetricunion context increases. However, Figure 8.3f shows that the semantic dis-criminability of the word vectors generated with the symmetric union contextis completely insensitive to the parameter k. Among the principal word vec-tors generated with part-of-speech tags and different types of union context,the best result in terms of semantic discriminability is obtained from the for-ward union context with a small value of k. We see that there is an inverserelationship between the values of the semantic discriminability obtained fromthe forward union context and the parameter k, i.e., as the value of k increasesthe value of semantic discriminability of word vectors decreases.

We summarize our observation on the feature variables generated with thepart-of-speech tags as follows. In general, we see that the results obtainedfrom part-of-speech tags behave differently from the previous experimentswith word forms. This shows the importance of the contextual features onthe quality of principal word vectors. The syntactic and semantic discrim-inability of principal word vectors generated with part-of-speech tags are notsolely dependent on the context direction, backward versus forward. We seethat the backward neighbourhood contexts result in high values of syntacticdiscriminability, but the backward window-based contexts result in fairly lowvalues of syntactic discriminability. This is also observed with the semanticdiscriminability of word vectors. The window-based forward contexts result inthe highest semantic discriminability but the neighbourhood forward contextsresult in low values of semantic discriminability.

In our last series of experiments with feature variables, we extract 15-dimensional principal word vectors from the joint set of feature variablesformed by the word forms and the part-of-speech tags. The word vectors areextracted with different types of context functions as before. Figure 8.4 showsthe values of syntactic and semantic discriminability obtained from these wordvectors. In comparison with the previous experiments on word forms and part-of-speech tags represented in Figure 8.2 and Figure 8.3, we see that the resultsobtained from the joint feature variables are similar to the results obtainedfrom the word forms shown in Figure 8.2 with slight improvements on theresults. This shows that the joint approach of feature combination can take ad-vantage of its constituent feature variables. For example, here we see that thejoint feature variables inherit the regularities of the context direction in featurevariables formed by word forms and part of the information provided by thepart-of-speech tags.

In the last part of this section, we summarize the results obtained from thesyntactic and semantic discriminability of the principal word vectors. In ad-dition to the neighbourhood context function and its two other extensions, the

120

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

τ

FD

R



(a)

0 2 4 6 8 102

4

6

8

10

12·10−2

τ

FD

R



(b)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R



(c)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R


backwardforward

symmetricbaseline

(d)

0 2 4 6 8 10

0.2

0.4

0.6

0.8

1

k

FD

R



(e)

0 2 4 6 8 102

4

6

8

10

12·10−2

k

FD

R


backwardforward

symmetricbaseline

(f)

Figure 8.4. The syntactic (left) and semantic (right) Fisher discriminant ratio (FDR)of principal word vectors extracted with the joint set set of feature variables formedby the word forms and part-of-speech tags, and different types of context functionincluding the neighbourhood context (a, d), the window-based context (b, e), and theunion context (c, f).

121

window-based context, and the union context, we also study the results ob-tained from the dependency context. Figure 8.5a shows the best values ofsyntactic discriminability obtained from different types of feature variables.Each bar in this figure shows the maximum value of syntactic discriminabilityobtained from a certain type of feature variable. For example, the first bar fromleft shows the maximum value of syntactic discriminability obtained from thefeature variables formed by neighbourhood context and word forms with dif-ferent values of τ . We see that in most cases the joint set of feature variablesformed by the word forms and part-of-speech tags results in higher syntac-tic discriminability than the feature variables formed by the word forms orpart-of-speech tags individually. We see that the dependency context does notyield satisfactory results despite the laborious annotation of the training cor-pus. This agrees with the observations of Kiela and Clark (2014), who statethat a small window-based context works better than a dependency contextif the vectors are extracted from a fairly large corpus. Interestingly, the fea-ture variables formed with a neighbourhood context and part-of-speech tagsresult in a high syntactic discriminability, which is on a par with that of morecomplicated contexts (e.g. window-based context and union context) and con-textual features (e.g. joint word form and part-of-speech tag). This shows thatif a part-of-speech tagged corpus is available, one can efficiently generate aset of principal word vectors with a high degree of syntactic separability usingthe backward neighbourhood context with a large value of |τ |. Figure 8.5bshows the best results of semantic discriminability obtained from the featurevariables. We see that in most cases the word vectors generated with part-of-speech tags result in higher semantic separability. Similarly to the syntacticdiscriminability, the dependency context does not show any advantage overthe other types of context. The best semantic discriminant ratio is obtainedfrom the feature variables formed with the part-of-speech tags and window-based context with a large window size k. Our final conclusion on this partis that feature variables formed with different types of contextual features andcontext can be meaningful to different tasks.

To sum up, both components of feature variables, the feature set and thecontext type, play an important role in the discriminability of principal wordvectors. A set of linguistically rich features (e.g. part-of-speech tags) can leadto high syntactic and semantic discriminability if they are used with the propercontext. The joint approach of feature combination is a way to take the advan-tage of different types of features. The results presented in this section suggestthat the feature variables need to be studied more carefully by themselves. Weleave the study of more complicated and more informative features to futurework.

122

BN FN BW FW SW BU FU SU D0

0.2

0.4

0.6

0.8

1

FD

RWord Form - POSWord Form POS Word Form × POS

(a)

BN FN BW FW SW BU FU SU D2

4

6

8

10

12·10−2

FD

R

NERWord Form POS Word Form × POS

(b)

Figure 8.5. The best values of fisher discriminant ratio (FDR) obtained from the prin-cipal word vectors with regard to word’s part-of-speech tags (top) and named entities(bottom). The word vectors are generated with different types of context function andtwo types of contextual features, word forms and part-of-speech tags.

123

8.1.2 Number of DimensionsThe number of dimensions, also known as the dimensionality, of word vectorsis a key factor for controlling the amount of variation in the original contextualword vectors encoded into the principal word vectors. The dimensionality ofprincipal word vectors is smaller than or equal to the dimensionality of thecontextual word vectors, i.e., the number of feature variables. Several ad hocrules have been proposed by researches to find the smallest number of dimen-sions that retains most of the data variations (Jolliffe, 2002, Chapter 6.1). Oneof the metrics used for this aim is the cumulative percentage of total variance.The total variance of a data matrix is defined as the sum of the eigenvalues ofits covariance matrix:

TV(X) =m

∑i=1

λi (8.1)

where m is the rank of covariance of the data matrix X and λi (i = 1, . . . ,m) arethe eigenvalues of the covariance matrix. This concept is related to that of thegeneralized variance of data, which is defined as the product of the eigenvaluesof a covariance matrix (see Equation 2.37). Denoting the total variance of arank m covariance matrix as TVm, the cumulative percentage of total variancelooks for the smallest value k with k < m for which we have

100× TVk

TVm

> ρ (8.2)

where the threshold value ρ is a task dependent parameter which usually takesa value between 70% to 90%.

Figure 8.6a shows the spectrum of the top 1000 eigenvalues of the covari-ance matrix of contextual word vectors trained with our default setting. Thesteep decay in the eigenvalues shows that most of the data variation is encodedinto the first few dimensions. Figure 8.6b shows the cumulative percentage oftotal variance of principal word vectors with m = 1000. As shown, the first10 dimensions account for more than 99% of the total variance of the data.So, according to the rule of cumulative percentage of total variance, the op-timal number of dimensions k should be smaller than 10. However, as wewill see later, a large amount of information is encoded into the other dimen-sions, which significantly affects the spread and discriminability of the prin-cipal word vectors. This shows that the rule of cumulative percentage of totalvariance does not give us the best value of k.

In order to mitigate the problem with the cumulative percentage of totalvariance and to provide a better view over the data variation, we propose usingthe log-eigenvalue diagram (Craddock, 1969) for finding the optimal numberof dimensions. In this approach, the decision on the number of dimensions k

is made on the basis of the logarithm of the eigenvalues, not the eigenvaluesthemselves. The idea behind this approach is that the logarithm of the eigen-values corresponding to noise in the original data should be small and close to

124

0 200 400 600 800 1,000

0

2

4

6

8

·1010

k

eige

nval

uespectrum of eigenvalues

(a)

0 200 400 600 800 1,000

80

85

90

95

100

k

perc

enta

ge

cumulative percentage of TV

(b)

Figure 8.6. (a) The spectrum of top 1000 eigenvalues of the covariance matrix ofprincipal word vectors. (b) The cumulative percentage of total variance (TV).

125

each other. In other words, if we plot the diagram of the logarithms of eigen-values sorted in increasing order, called LEV diagram, we should look for “apoint beyond which the graph becomes approximately a straight line” (Jolliffe,2002, p 118).

The LEV approach is closely connected to the logarithm of generalizedvariance used to measure the spread of the principal word vectors. As shownin Equation 7.1, the generalized variance of a set of k-dimensional princi-pal word vectors is equal to the cumulative sum of the logarithm of the topk eigenvalues. Similarly to the cumulative percentage of total variance, onecan define the cumulative percentage of the logarithm of generalized variance

as 100× LGVk

LGVm, where k and m are positive integers with k < m, and look

for the optimal value of k. Figure 8.7a shows the LEV diagram of the 1000-dimensional principal word vectors, and Figure 8.7b shows the values of thecumulative percentage of the logarithm of generalized variance of the princi-pal word vectors, called LGV diagram. Although we see a sharp drop in theLEV diagram (Figure 8.7a), the LGV diagram increases linearly with a gentleslope of around 0.1 (see Figure 8.7b). This shows that the large reduction inthe spectrum of eigenvalues, as we previously saw in Figure 8.6a, does notnecessarily mean that the corresponding dimensions with small eigenvaluesencode noisy information and should be eliminated. In fact, our experimentson high-dimensional word vectors show that increasing the number of dimen-sions significantly improves the spread and the discriminability of the prin-cipal word vectors. This is in accordance with our observation in the LGVdiagram, which shows that none of the dimensions corresponding to the top1000 eigenvalues encode noisy data.

Figure 8.8a shows the generalized variance of principal word vectors withdifferent numbers of dimensions ranging from 1 to 1000. Not surprisingly, thevalues of generalized variance are linearly related to the cumulative percentageof LGV shown in Figure 8.7b. We see that the values of generalized varianceare linearly affected by the number of dimensions. Increasing the number ofdimensions k leads to a higher amount of generalized variance. Figure 8.8bshows the syntactic and semantic discriminability of the principal word vec-tors with respect to the number of dimensions k. The figure shows that theamount of discriminability of principal word vectors is linearly affected bythe number of dimensions. Both syntactic and semantic discriminability ofword vectors increases as the value of k increases. Nevertheless, it is worthnoting that generating and processing high-dimensional word vectors requiremore computational resources, which in many circumstances are not easilyavailable.

We summarize our observation of the number of dimensions as follows.Multiple experiments with two evaluation metrics were carried out to find theoptimal number of dimensions of principal word vectors with regard to theirspread and discriminability. One of the metrics uses the cumulative percentage

126

50 200 400 600 800 1,0005

10

15

20

25

k

Log

ofei

genv

alue

LEV

(a)

0 200 400 600 800 1,000

0

20

40

60

80

100

k

perc

enta

ge

cumulative percentage of LGV

(b)

Figure 8.7. (a) The LEV diagram and (b) the LGV diagram of principal word vectorswith maximum 1000 dimensions. The horizontal axis k is the number of dimensionsof principal word vectors (the number of principal components). LEV stands for thelogaithm of eigenvalue and LGV stands for the logarithm of generalized variance.

127

50 200 400 600 800 1,000

0

0.2

0.4

0.6

0.8

1·104

k

Log

ofge

nera

lized

vari

ance

logarithm of generalized variance

(a)

0 200 400 600 800 1,000

0

5

10

15

k

FD

R

data discriminability

syntactic disc.semantic disc.

(b)

Figure 8.8. (a) The logarithm of generalized variance of the k-dimensional principalword vectors. (b) The syntactic and semantic discriminability of the k-dimensionalprincipal word vectors.

128

of total variance and the other uses the diagram of the logarithm of eigenval-ues. We have shown that the latter metric is better suited to select the dimen-sionality of word vectors. The experimental results show that the spread andthe discriminability of the principal word vectors are directly proportional tothe number of dimensions. We also observed that the syntactic discriminabil-ity of the principal word vectors grows with more steeply than the semanticdiscriminability of the principal word vectors.

8.1.3 Weighting and TransformationIn this section, we examine how principal word vectors are affected by thetransformation step on Line 2 of Algorithm 2 on Page 38. Our experimentsin this part are divided into two main types. The first is a set of experimentsfocused on our intrinsic evaluation metrics, i.e., the data spread and the datadiscriminability. The second is a set of experiments focused on our extrinsicevaluation metrics, i.e., the contribution of the word vectors to the word simi-larity benchmark and other NLP tasks such as part-of-speech tagging, named-entity recognition, and dependency parsing. The eigenvalue weighting matrixΛ is selected from the matrices described in Section 5.2.3. In the first series ofexperiments, the eigenvalue weighting matrix Λ is set to

Λ =√

n−1

[Ik 0

0 0

](8.3)

where n is the vocabulary size and k is the number of dimensions, i.e., k = 100.In the second series of experiments, the matrix is set to

Λ = α√

n−1

[Σ(k,k)−1

0

0 0

](8.4)

where α is a constant, n is the vocabulary size, and Σ(k,k)−1is the inverse of

the matrix Σ(k,k) consisting of the top k rows and k columns of the covariancematrix of the word vectors. The reason for choosing this eigenvalue weightingmatrix in Equation 8.4 is that the word vectors generated by this matrix aremore suitable for use by a neural network (see Section 5.2.3) and the tools(i.e. veceval (Nayak et al., 2016b) for tagging and the parser of Chen andManning (2014) for parsing) used in these experiments use neural networks astheir classifiers (see Section 7.1.2).

The transformation is controlled by three parameters: the weight matricesΦ and Ω and the transformation function f. As explained in Section 5.2.1, Ωis a matrix used for weighting words and Φ is a metric matrix used for scalingthe elements of contextual word vectors. In Table 5.1, we proposed differentdiagonal matrices to be used as Ω and Φ. As explained in Section 5.2.2, thetransformation function f adds some degree of non-linearity to the principal

129

word embedding model. We restrict our experiments to power transformationfunction. The power transformation function can be defined in two ways. Thefirst is to use a vector of power values whose elements correspond to the ele-ments of contextual word vectors. Let V = (v1, . . . ,vm) denote a contextualword vector. Then the power transformation function defined with the powervectors θ = (p1, . . . , pm) maps V to

f(V;θ ) = (vp11 , . . . ,vpm

m )T (8.5)

The optimal value of θ , which maximizes Equation 5.4, is estimated by simu-lated annealing with pi ∈ (0,1]. The second way is to use a scalar power valuefor all elements of the contextual word vectors. Using the same notation asbefore, the power transformation function defined with the scalar power valueθ = p maps V to

f(V;θ ) = (vp1 , . . . ,v

pm)

T (8.6)

In this case too, the optimal value of power is estimated by simulated annealingthrough maximizing Equation 5.4 with p ∈ (0,1].

Figure 8.9 shows the values of generalized variance of principal word vec-tors with respect to different combinations of the weight matrices and thetransformation function. In addition to the results obtained with the powertransformation function, the figure shows the results obtained from the iden-tity transformation function I, which basically means no transformation. Theidentity transformation function I returns its inputs with no change.

From Figure 8.9, we see that the spread of principal word vectors is highlyinfluenced by both the weight matrices and the transformation function. Thehighest data spread is obtained from the principal word vectors with no trans-formation, Φ = I, Ω = I, and the transformation function f = I. Most of thishigh data spread is due to the unbalanced and excessive contribution of the topeigenvalues in Equation 7.1. As elaborated in Section 5.2.2, the transformationfunction in Equation 5.4 reduces the excessive effect of the top eigenvalues bycompressing the data distribution along the top eigenvectors and expandingthe distribution along the other eigenvectors. This data normalization leads tosome reduction in the generalized variance of the principal word vectors (seethe bars related to (I,I) and the power transformation function). This reduc-tion in the data spread is due to the very large data compression along the topeigenvalues.

The second pair of bars in Figure 8.9 (Φ = iff , and Ω = I) shows the effectof inverse feature frequency on the data spread. We see that the weightingmatrix leads to a large compression in the principal word vectors, which isthen slightly expanded by the power transformation function.

The third pair of bars in Figure 8.9 (Φ = isf , and Ω = I) shows how thespread of the principal word vectors is affected by the inverse of standard de-viation of the feature variables. In other words, it shows how the spread ofthe principal word vectors is affected if we compute the principal word vec-

130

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )

−1,000

0

1,000G

V

logarithm of generalized variance

I

Vector PowerScalar Power

Figure 8.9. The logarithm of generalized variance of principal word vectors generatedby different types of weight matrices and transformation functions.

tors from the correlation matrix of contextual word vectors instead of usingtheir covariance matrix. As mentioned in Section 5.2.1, performing PCA onthe correlation matrix instead of the covariance matrix is a way to mitigate theimbalanced contribution of feature variables to the data spread. We see thatthe spread of the principal word vectors generated from the correlation ma-trix with no transformation is close to the spread of the principal word vectorsgenerated from the covariance matrix with the power transformation (see theblue bar in the first pair of bars in Figure 8.9). In terms of data spread, thismeans that the effect of the power transformation function on the distributionof principal word vectors is as good as the effect of performing PCA on thecorrelation matrix. However, as we will see later, this does not mean that theywill necessarily result in the same value of data discriminability. Performingthe power transformation on the contextual word vectors normalized by thestandard deviation of their feature variables reduces the spread of the result-ing principal word vectors. Part of this reduction is due to the randomnessinvolved in the estimation of the entropies in Equation 5.4. Part of it can alsobe due to the fact that the metric matrix Φ = iff normalizes the contextualword vectors along their basis vectors, which are associated with the featurevariables. However, the transformation function normalizes the data along the

131

eigenvectors of their covariance matrix, which are not necessarily equal to thebasis vectors.

The two remaining pairs of bars in Figure 8.9 show the effect of the weightmatrix Ω on the spread of principal word vectors. The negative value of thedata spread shows that most of the eigenvalues describing the data spread aresmaller than one. This means that the spread of the principal word vectors isdrastically decreased when we weight the observations with their inverse fre-quency. However, this negative effect of the weight matrix Ω on the spreadof principal word vectors is largely cancelled out by the power transformationfunction. We see that the spread of the principal word vectors after performingpower transformation is comparable to the other sets of word vectors. In gen-eral, we find that the power transformation normalizes the data in the expectedway, i.e., it compresses the over-expanded data and expands highly masseddata.

We study the effect of the weighting mechanism and the transformationfunction on the discriminability of principal word vectors. Figure 8.10 showsthe syntactic and semantic discriminability of principal word vectors obtainedfrom different weight matrices and transformation functions as above. Asshown in the figures, the power transformation leads to a significant improve-ment in both syntactic and semantic discriminability of the principal wordvectors. This improvement is clearer in the cases where the discriminability isvery small, Ω = iwf . All weight matrices reduce the discriminability of prin-cipal word vectors. In fact, the best the results are obtained from the identityweight matrices, Φ = I and Ω = I, and the power transformation function.We see that the feature weighting, which is controlled by Φ, has a strongereffect on the syntactic and semantic discriminability of word vectors than theobservation weighting, which is controlled by Ω.

In the remaining part of this section, we study how the weighting mech-anism and the transformation function affect the performance of the prin-cipal word vectors in tasks such as the word similarity benchmark, part-of-speech tagging, named-entity recognition, and dependency parsing (see Sec-tion 7.1.2). As mentioned above, the eigenvalue weighting matrix Λ in theseexperiments is set as in Equation 8.4 on Page 129 with α = 0.1. The parame-ter α is set on the basis of the standard deviation of the initial weights of theneural network classifier used in the parser.

Figure 8.11 summarizes the results obtained from the similarity benchmark.The vertical axis is the average of similarity correlations obtained from 13word similarity data sets (see Section 7.1.2). As shown, the power transforma-tions contribute more to the task than the weight matrices. The best results areobtained from the scalar power transformation function with identity weightmatrices. Almost the same result is also obtained from the scalar power trans-formation with Φ = iff and Ω = I. Similarly to the data discriminability, wesee that the weighting matrix Ω = iwf has a negative effect on the results.Given these observations, one might ask if the weighting matrix iwf always

132

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )0

1

2

3

4

FD

R

syntactic discriminability

I


(a)

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )0

0.1

0.2

0.3

0.4

0.5

FD

R

semantic discriminability

I


(b)

Figure 8.10. The (a) syntactic and (b) semantic discriminability of principal wordvectors generated by different types of weight matrices and transformation functions.

133

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )0

0.1

0.2

0.3

0.4

0.5

Avg

.C

orre

latio

n

I Vector Power Scalar Power

Figure 8.11. The average of the results obtained from the word similarity benchmark.

has a negative effect on the performance of principal word vectors. Later on,we will experimentally show that the contribution of the weighting matrix Ωcompletely depends on the task, and different weighting matrices can be usefulfor different tasks.

As mentioned above, each bar in Figure 8.11 shows the average of 13correlation values obtained from the tasks in the word similarity benchmark(Faruqui and Dyer, 2014). In order to gain a better understanding of the resultsobtained from each task, we break them down by tasks. Since the transforma-tion functions contribute more to the average of results than the weight matri-ces, we focus on the transformation functions and examine their impct on thecorrelation values obtained from 13 data sets. To this end, we set Φ = I andΩ = I and examine the variation of the correlation values with regard to thetransformation functions. This corresponds to the first category of bars (fromleft to right) in Figure 8.11. Figure 8.12 shows a breakdown of the bars withregard to the data sets. As shown, except for MC-30, the scalar power transfor-mation results in higher results in all of the data sets. The results obtained fromthe vector power transformation are always weaker than the results obtainedfrom the scalar power transformation. A comparison between the results ob-tained from the vector power transformation and the identity transformationshows that neither one is superior to the other and their results completelydepend on the data sets. With some data sets (e.g. WS-353, MC-30, and RG-

134

WS

-353

WS

-353

-SIM

WS

-353

-RE

L

MC

-30

RG

-65

Rar

e-W

ord

ME

N

MT

urk-

287

MT

urk-

771

YP

-130

Sim

Lex

-999

Ver

b-14

3

Sim

Ver

b-35

00

0

0.2

0.4

0.6

0.8

Cor

rela

tion


Figure 8.12. The breakdown results obtained from the word similarity benchmarkswith Φ = I, Ω = I, and different types of transformation functions. The informationabout the benchmarks can be fount in (Faruqui and Dyer, 2014).

65), the identity transformation (i.e. no transformation) works better than thevector power transformation and with some other data sets (e.g. Rare-Word,MEN, and Verb-143) the vector power transformation works better than theidentity transformation. In this thesis, we do not investigate why a typicaltransformation function works better than other types of transformation ona task. Instead, from these observations, we conclude that the scalar powertransformation is more likely to result in higher correlations but the resultsobtained from the identity and the vector power transformation depend on thedata sets.

Figure 8.13 shows how the accuracy of part-of-speech tagging is affectedby principal word vectors generated by different settings of weight matricesand transformation functions. As explained in Section 7.1.2, due to the limita-tion of veceval in processing word vectors with an arbitrary number of di-mensions, the experiments are performed with 50-dimensional word vectors.Regardless of the weight matrices, the scalar power transformation functionresults in higher part-of-speech tagging accuracy. The metric matrix Φ has al-most no effect on the results. However, the results are significantly affected by

135

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )90

92

94

96

Acc

urac

y


Figure 8.13. Part-of-speech tagging accuracy. The tagging models are trained onsections 00–18 of WSJ and tested on sections 19–21 of WSJ. veceval is used fortraining and testing the tagging models.

the weight matrix Ω. This can be seen in the results obtained from the identitytransformation function I, when no transformation is applied on contextualword vectors. We see that the results obtained from the identity transforma-tion I and the weight matrix iwf are very close to the best results obtainedfrom the scalar power transformation function. In order to study the effect ofthe weight matrix iwf and the scalar power transformation, we compare thetwo settings (I,I) with the scalar power transformation (i.e. the brown bar inthe first group of bars) and (I, iwf ) with the identity transformation function(i.e. the blue bar in the fourth group of bars from left side). The comparisonshows that the weight matrix iwf is more important for part-of-speech taggingthan the scalar power transformation.

When it comes to the vector power transformation, we see that the transfor-mation function works better than the identity function if the weight matrix Ωis the identity matrix (see the first three groups of bars from left side). How-ever, when the weight matrix Ω is set to iwf, the vector transformation func-tion results in lower accuracies than the other transformation functions. Ingeneral, the vector power transformation results in lower accuracies than thescalar power transformation, but depending on the weight matrix Ω, it resultsin higher or lower accuracies than the identity transformation.

136

We summarize our observations on the effect of weight matrices and trans-formation functions on the accuracy of part-of-speech tagging as follows. Themetric matrix Φ has almost no effect on the accuracies. The weight matrixΩ = iwf has a significant effect on the results. Among the transformationfunctions, the scalar power transformation is more meaningful to the task spe-cially when it is used with the weight matrix Ω = iwf . The results obtainedfrom the scalar power transformation and other weight matrices are not asgood as the results obtained from Ω = iwf with the identity transformation.However, using Ω = iwf with the scalar power transformation function resultsin the maximum accuracy on part-of-speech tagging.

Figure 8.14 shows the results of named-entity recognition using the prin-cipal word vectors generated with different weighting matrices and transfor-mation functions. The experiments are carried out with 50-dimensional wordvectors as explained in Section 7.1.2. We see in all groups of bars that thescalar power transformation results in higher accuracies than the other trans-formation functions. The metric matrices Φ = iff and Φ = isf have negativeeffects on the accuracies. However, the weight matrix Ω = iwf results in sig-nificant improvements in the NER accuracies. The results obtained from Ω= I

and the scalar power transformation are higher than the results obtained fromthe other settings. The highest accuracy is obtained from Φ = I, Ω = iwf , andthe scalar power transformation. To sum up, the scalar power transformationconsistently results in higher NER accuracies. The weight matrix Ω = iwf ismore meaningful to the NER task than the other metric and weight matrices.

Figure 8.1.3 shows the dependency parsing results obtained from the princi-pal word vectors generated by different weighting matrices and transformationfunctions. The figure shows the unlabelled attachment scores obtained fromthe dependency parser of Chen and Manning (2014) on the development setof WSJ. In most cases, the scalar power transformation results in the high-est parsing accuracy. The best result is obtained from Φ = iff and Ω = iwf

together with the scalar power transformation. Both vector power transforma-tion and scalar power transformation result in higher parsing accuracies thanthe identity transformation. This shows that both transformations are mean-ingful to dependency parsing. This contrasts with what we observed in otherexperiments with part-of-speech tagging and named-entity recognition, wherethe vector power transformation leads to smaller accuracies than the identitytransformation.

When it comes to the metric and weight matrices, we see that the results aremore influenced by the weight matrix Ω than the metric matrix Φ. The metricmatrix iff has almost no effect on the parsing accuracy when it is used withΩ = I and the identity transformation and the vector power transformation.However, when it is used with Ω = iwf , it leads to significant increases inparsing accuracy, specially when the scalar power transformation is used.

We sum up our observations of the parsing experiments as follows. Thepower transformation in general (i.e. the vector or scalar power transforma-

137

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )90

92

94

96

Acc

urac

y


Figure 8.14. Named-entity recognition accuracy. The recognizer is trained on thetraining data set provided by the CoNLL-2003 shared task (Tjong Kim Sang andDe Meulder, 2003) and tested on the testa data set provided by the shared task.

tion) has a positive effect on the parsing task. The metric matrix Φ = iff andthe weight matrix Ω = iwf lead to a significant improvement in parsing ac-curacy if they are used together. The best parsing accuracy is obtained with(Φ,Ω) = (iff , iwf ) and the scalar power transformation.

The results obtained from the intrinsic and extrinsic evaluation metrics aresummarized as follows. Our observations of the intrinsic evaluation metrics,the spread and discriminability of word vectors, show that proper weight ma-trices and the transformation functions can decrease the spread of principalword vectors and increase the discriminability of word vectors. The reduc-tion in the spread of word vectors is equivalent to compressing the vectors.This compression is mostly in the direction of top eigenvectors of the covari-ance matrix of word vectors. Hence, the weight matrices and the transfor-mation function mitigate the excessive effect of the top eigenvectors on theword vectors. The discriminability of the principal word vectors is also highlyinfluenced by the weighting matrices and the transformation functions. Thescalar and vector power transformation functions increase the discriminabilityof word vectors. The highest values of syntactic discriminability of princi-pal word vectors are obtained from the vector power transformation function,but the highest values of semantic discriminability of word vectors are ob-

138

(I,I)

(iff,I)

(isf ,I

)

(I,iw

f )

(iff, iw

f )88

89

90

91

92

UA

S


Figure 8.15. The unlabelled attachment on the development set (Section 22) of WSJ.The parser of Chen and Manning (2014) is used for training and testing the parsingmodels.

tained from the scalar power transformation function. The results obtainedfrom the extrinsic evaluation metrics show that the scalar power transforma-tion function is more meaningful to different tasks. However, the optimalchoice of weight matrices differs across tasks. This shows the importance ofusing proper weight matrices for different tasks.

8.2 ComparisonIn this section, we compare the results obtained from principal word embed-dings with other sets of word embeddings collected using popular methodssuch as word2vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014),HPCA (Lebret and Collobert, 2014), and random indexing (RI) (Sahlgren,2006). The comparison settings are outlined in Section 7.3. word2vec isused in two modes, the continuous bag of words CBOW and the skip gramSGRAM. We use our implementation of HPCA and RI. The comparison isbased on both intrinsic and extrinsic evaluation metrics described in Sec-tion 7.1. These metrics include:

139

1. the spread of word vectors in terms of the logarithm of generalized vari-ance (see Equation 7.1)

2. the syntactic and the semantic discriminability of the word vectors (seeEquation 2.40)

3. the performance of the word vectors in the word similarity benchmark,part-of-speech tagging, named entity recognition, and dependency pars-ing (see Section 7.1.2)

All sets of word vectors are extracted from the same raw corpus described inSection 7.4. Except for the set of word vectors extracted by word2vec, allembeddings are trained with the same setting, backward neighbourhood con-text with length 1, i.e., the context refers to the immediately preceding wordand the contextual features are the word forms. word2vec is trained withsymmetric neighbourhood context of length 1, i.e., the context forms with theimmediately preceding and succeeding words. The number of iterations inGloVe is set to 50, and in word2vec, it is set to 1. All methods are trainedwith five threads, if multi-threading is supported. The part-of-speech taggingand named-entity recognition are carried out with 50-dimensional word em-beddings but the other experiments are performed with 100-dimensional wordembeddings.

Table 8.1 summarizes the results. The table is divided into two parts. Thelast five rows show the results obtained from the principal word vectors withdifferent combinations of weighting matrices Φ, Ω, and transformation func-tion f which are represented by the triple (Φ,Ω,f). Among all combinationsof (Φ,Ω,f), we report the results of those combinations that contribute to thehighest results in each of the evaluation metrics. For example, we report theresults obtained from (Φ,Ω,f) = (I,I,I) since it contributes to the highestdata spread (Log. GV). The comparison on parsing is made on the test set ofthe WSJ journal, Section 23 of WSJ.

The results show that the spread of the principal word vectors generatedwith (I,I,I) is significantly higher than that of the other sets of word vectors.However, this set of word vectors shows poor performance when the otherevaluation metrics are used. In terms of syntactic discriminability, we see thatthe principal word vectors trained with identity weight matrices and the vectorpower transformation result in the highest syntactic discriminability. The bestresults of semantic discriminability are obtained from HPCA and the principalword vectors with scalar power transformation function.

In terms of word similarities, principal word vectors work better than RI

and HPCA. The principal word vectors generated by (I,I,sp) are on par withthe vectors generated by CBOW and GloVe. However, we see that the wordvectors generated by SGRAM result in a higher value of word similarity corre-lation than the principal word vectors. In order to gain a better understandingof the comparison results, we perform a statistical significance test on the bestresult obtained by the principal word vectors and the results obtained by other

140

Log.GV

Syn.Disc.

Sem.Disc.

Sim.Corr.

POS NER UAS LAS

RI 202 2.6 0.2 0.2 95.4 93.6 90.5 88.2HPCA 347 2.1 0.5 0.2 96.3 96.2 90.7 88.6CBOW 622 0.8 0.1 0.5 96.3 96.2 92.1 90.1

SGRAM 564 0.9 0.2 0.6 96.4 97.2 92.1 90.0

GloVe 525 0.9 0.1 0.5 95.2 97.4 91.9 89.9

(I,I,I) 1454 3.0 0.2 0.2 91.9 91.1 89.9 87.6(I,I,vp) 772 3.9 0.4 0.4 95.2 94.5 90.5 88.3(I,I,sp) 554 2.0 0.5 0.5 95.8 95.8 91.5 89.4(I, iwf ,sp) 314 0.8 0.4 0.2 96.2 96.7 91.3 89.1(iff , iwf ,sp) 339 1.3 0.5 0.4 96.0 96.4 91.9 89.9

Table 8.1. The comparison between principal word vectors and other sets of word

vectors. The results obtained from the principal word vectors are shown in the sec-

ond part of the table, below the double line. The triples show the specific parameter

settings (Φ,Ω,f), where Φ and Ω are the weighting matrices, and f is the transfor-

mation function. f = vp and f = sp refer to the power transformation function with

vector of power values and scalar power value respectively. The results of POS tag-

ging are obtained from sections 19–21 of WSJ. The results of NER are obtained from

the testa data set from CoNLL-2003 shared task (Tjong Kim Sang and De Meul-

der, 2003). UAS and LAS stand for the unlabelled and labelled attachment scores

respectively. The parsing results are obtained on the test set of WSJ.

sets of word vectors. To this end, we use the statistical significance test methodproposed by Berg-Kirkpatrick et al. (2012).

The first row in Table 8.2 (i.e. Sim. Corr.) shows the p-value under thenull hypothesis H0: A is not better than B, where A refers to the best set ofprincipal word vectors generated with (I,I,sp) and B can be any of the otherword embeddings methods mentioned above. The table shows that the null hy-pothesis is rejected with high confidence for RI and HPCA, but not for othermethods. This confirms the superiority of principal word vectors to HPCA andRI. The p-value of 1.00 obtained from SGRAM indicates that the null hypothe-sis is accepted. This basically mean that the contribution of the principal wordembeddings to word similarity correlations is not better the contribution ofword embeddings generated by SGRAM.

In part-of-speech tagging, the principal word vectors generated by (I, iwf ,sp)result in the highest tagging accuracy among the principal word vectors. Theword vectors generated by SGRAM result in maximum part-of-speech taggingaccuracy. The tagging results obtained from the principal word vectors are

141

RI HPCA CBOW SGRAM GloVe

p-value 0.00 0.00 0.54 1.00 0.43

Table 8.2. p-value of the null hypothesis H0: principal word vectors are not betterthan the other sets of word vectors on the average of word similarity benchmarks.

RI HPCA CBOW SGRAM GloVe PWE

RER 0.35 0.21 0.22 0.19 0.22 0.22

Table 8.3. The relative error reduction (RER) on part-of-speech tagging using dif-

ferent sets of word embeddings. PWE refers to principal word vectors generated by

Φ = I, Ω = iwf, and the scalar power transformation function.

higher than the results obtained from RI, and GloVe. However, the resultsare lower than the other sets of word vectors, HPCA, CBOW, and SGRAM.

In order to know whether the difference between the POS tagging results issignificant, one may want to perform a statistical significance test on the re-sults, as we did on the results obtained from the word similarity benchmarks.However, veceval does not provide any straightforward way to performsuch a test. Instead, veceval reports a relative error reduction comparedwith an SVD baseline model. This metric shows whether a certain improve-ment over the baseline is significant or not. For example, one may want toknow if a 1% improvement over part-of-speech tagging is significant. Therelative error reduction is not as meaningful as a statistical significance test.It does not clearly indicate whether a certain difference between the resultsobtained from different word embeddings is statistically significant or not. Itonly provides us with a rough estimation of the amount of improvement overthe baseline. For example, a 1% improvement over a baseline value of 80%might not be as significant as a 1% improvement over the baseline of say 98%.

Table 8.3 shows the values of relative error reduction obtained from dif-ferent sets of word embeddings. SGRAM results in the smallest relative errorreduction. This shows that word embeddings trained by SGRAM are more ben-eficial to the task of part-of-speech tagging than the other sets of word embed-dings. The principal word embeddings (PWE), CBOW, and GloVe result in thesame values of relative error reduction. HPCA also results in almost the samevalue of relative error reduction as principal word embeddings. This showsthat principal word embeddings are on par with most of the word embeddingsand slightly weaker than SGRAM in terms of part-of-speech tagging.

142

RI HPCA CBOW SGRAM GloVe PWE

RER 0.33 0.24 0.21 0.22 0.19 0.24

Table 8.4. The relative error reduction (RER) on named-entity recognition using dif-

ferent sets of word embeddings. PWE refers to principal word vectors generated by

Φ = I, Ω = iwf, and the scalar power transformation function.

In named-entity recognition (NER), Table 8.1 shows that the best result isobtained from GloVe. Among the different settings of principal word vec-tors, (I, iwf ,sp) result in the highest value of NER, which is higher than theresults obtained from RI, HPCA, and CBOW but smaller than the results ob-tained from SGRAM and GloVe. In order to see whether the differences be-tween the results are significant or not, we report the relative error reductionsobtained from each set of word embeddings (see the above argument aboutpart-of-speech tagging).

Table 8.4 shows the values of relative error reduction obtained from differ-ent sets of word embeddings on the task of named-entity recognition. GloVeresults in the smallest value of relative error reduction. This shows that wordembeddings trained by GloVe are more meaningful to NER than the othersets of word embeddings. The result obtained from PWE is as good as theresults obtained from HPCA but lower than the results obtained from CBOW,SGRAM, and GloVe.

In parsing, Table 8.1 shows that the principal word vectors generated with(iff , iwf ,sp) and the vectors generated by GloVe result in the same parsingscores, which are significantly higher than the results obtained from the RIand HPCA word vectors. However, we see that CBOW and SGRAM result inslightly higher parsing scores. In order to see if the slight superiority of theseembeddings is due to chance, we perform a statistical significance test on theparsing results using the method of Berg-Kirkpatrick et al. (2012). Table 8.2shows the p-value under the null hypothesis H0: A is not better than B, whereA refers to the set of principal word vectors generated by (iff , iwf ,sp) and Bcan be any of the other word embeddings methods mentioned above. Thetable shows that the null hypothesis is rejected with high confidence for RIand HPCA, but not for other methods. This confirms the superiority of prin-cipal word vectors to HPCA and RI. The p-values obtained from word2vec

and GloVe neither accept nor reject the null hypothesis. The p-values, how-ever, indicate that the word embedding methods are as good as the principalword embedding. Hence, we conclude that in terms of dependency parsing,the parsing results obtained from principal word embeddings are on par withthe results obtained from GloVe and word2vec and better than the resultsobtained from HPCA and RI.

143

RI HPCA CBOW SGRAM GloVe

p-value 0.00 0.00 0.55 0.60 0.65

Table 8.5. p-value of the null hypothesis H0: principal word vectors are not betterthan the other sets of word vectors on the task of dependency parsing.

RI HPCA GloVe PWE

Sec. 180 480 8040 900

Table 8.6. The amount of time (seconds) required by each of the word embedding

methods to perform the dimensionality reduction. PWE refers to the principal word

embedding method introduced in this paper.

Now we turn our attention to the efficiency of the word embedding meth-ods. As discussed in Section 7.3, not all of the word embedding methodsfollow the same software architecture. The architectural differences betweenthe word embedding methods can influence the comparison between the meth-ods. RI, HPCA, GloVe, and principal word embeddings are developed in thearea of distributional semantics (see Section 3.2.2). These methods followthree main steps: 1) building a contextual matrix, 2) performing a transfor-mation function on the contextual matrix, and 3) reducing the dimensionalityof the column vectors of the transformed matrix. In terms of software en-gineering, the first two steps are carried out in almost identical ways. Themain difference between these methods is in the way they perform the dimen-sionality reduction. However, word2vec is a method that is developed inthe area of language modelling. It uses a neural network architecture to trainlow-dimensional word vectors.

In order to mitigate the effect of the architectural differences between thedifferent word embedding methods, we first compare RI, HPCA, GloVe, andprincipal word embeddings. Then we compare principal word embeddingswith word2vec. As mentioned above, the comparison is carried out withrespect to their efficiency on the dimensionality reduction step. The compari-son is based on the CPU time needed to perform the dimensionality reduction.Table 8.2 shows the time required to perform the dimensionality reduction.

We see that in terms of CPU time required to perform the dimensionalityreduction, the most efficient method is RI which is around five times fasterthan the principal word embedding. The principal word embedding is almosttwo times slower than HPCA but nine times faster than GloVe. All of thesemethods require the same amount of time, around two hours, to scan the train-

144

ing corpus, build and transform the contextual matrix. Therefore, the totalamount of time required by the principal word embedding method to extract aset of word vectors from the raw corpus is around two hours and 15 minutes.By contrast, word2vec needs more than ten hours to generate the final wordvectors. This shows that the principal word embedding method is faster thanGloVe and word2vec but slower than RI and HPCA. Together with what isshown in Table 8.1, this shows that the principal word embedding method ismore efficient than word2vec and GloVe and on par with them in terms ofthe extrinsic evaluation metrics.

8.3 SummaryWe summarize the results obtained from our experiments on the parameters ofthe principal word embedding as follows. The parameters of principal wordembedding studied in this chapter are: feature variables, number of dimen-sions, the metric and weight matrices, and the transformation function. Weanalysed different types of feature variables formed by word forms and part-of-speech tags. The metric and weight matrices were set to different valuesand multiple combinations of them were studied. We set the transformationfunction to the identity function and two types of power transformation: ascalar power transformation and a vector power transformation.

In terms of our intrinsic evaluation metrics, the spread and the discrim-inability of the principal word vectors are sensitive to almost all parameters ofthe word embedding method. The spread of the word vectors is mostly influ-enced by the number of dimensions and the feature variables. The weightingmatrices and the transformation functions, in general, reduce the spread of theprincipal word vectors. From our experiments with the weighting matrices andthe transformation functions, the highest spread was obtained from word vec-tors trained with the identity weight matrices and the identity transformationfunction. By contrast, the non-identity weighting matrices and transformationfunctions studied in this chapter compress the word embeddings.

The discriminability of principal word vectors is influenced by feature vari-ables. The highest amount of discriminability is obtained from feature vari-ables formed by part-of-speech tags. The discriminability of word vectors isdirectly proportional to the number of dimensions. It has been shown thatthe syntactic and the semantic discriminability of principal word vectors growwith different slopes as the number of dimensions are increased. The powertransformation functions show more effect on the discriminability of wordvectors than the weighting matrices. The highest amount of discriminabilityis obtained from a power transformation and the identity weighting matrices.

In terms of our extrinsic evaluation metrics, the performance of principalword embeddings varies with different combinations of weight matrices andtransformation functions. All tasks benefit from the scalar power transfor-

145

mations. The metric matrices are more beneficial to the word similarity andthe dependency parsing tasks. However, the weight matrices are more ben-eficial to the part-of-speech tagging and the named-entity recognition tasks.In general, we conclude that the weighting mechanism and the transforma-tion function have high impacts on the quality of word embeddings and theircontribution to the other tasks.

We also compared principal word embedding with other word embeddingmethods. The results showed that principal word embeddings are better thanor on par with most of the word embeddings. In terms of the intrinsic evalua-tion metrics, principal word embeddings result in higher values of data spreadand discriminability. In terms of the extrinsic evaluation metrics, principalword embeddings result in satisfactory results comparable with the results ob-tained from other word embedding methods. The experimental results on theefficiency of the word embedding methods show that the principal word em-bedding is more efficient than the popular word embedding methods in termsof the time required to generate a set of word embeddings.

146

9. Conclusion

Word embeddings are real-valued vectors representing words of a language ina vector space. These vectors capture global syntactic and semantic regulari-ties of words observed in a corpus. In this thesis, we study a word embeddingmethod based on principal component analysis (PCA), called principal wordembedding. The application of PCA to word embedding is not a novel idea.PCA, one of the most commonly used feature extraction and dimensionalityreduction methods, has been used in document and word embedding methodssuch as latent semantic analysis (LSA), and hyperspace analogue to language(HAL) for many years.

When used as a technique to generate word embeddings, PCA is applied toa data matrix whose elements are the frequency of seeing words in differentcontexts. The word frequency data matrix is often a very large and sparsematrix. To perform PCA on this matrix requires a huge amount of memory andCPU time. This limits the use of PCA for word embedding. Another limitingfactor of using PCA for word embedding is due to the distribution of the wordfrequencies. The word frequencies often follow a Zipfian distribution, whichis not very suitable for PCA. Although PCA does not make any assumptionabout the data distribution, in practice, it works better when the data followsa distribution which is closer to the normal distribution. However, the Zipfiandistribution of word frequencies is far from a normal distribution. This limitsthe use of PCA for word embedding.

In this thesis, we have formulated four research questions regarding the useof PCA for word embedding. Each of the questions are studied in details. Inthe remainder of this part, we review the four research questions and drawconclusions from our work. We also discuss the feature work related to thisthesis.

9.1 Limitations of PCAThe first research question addresses the limitations of using PCA for wordembedding. In order to answer this question, we introduced a concept of con-textual word vector whose elements are the frequency of seeing a word withdifferent contextual features in different contexts. Then, we formulated a mix-ture model of contextual word vectors on which PCA is performed in order togenerate a set of low-dimensional word embeddings.

147

We studied the distribution of the mixture model in detail. In our studies,we considered different scenarios for the distribution of contextual featuresover words and provided a statistical view of the mean vector and the covari-ance matrix of the mixture model. We showed that the mean vector and thecovariance matrix are highly influenced by their number of dimensions, whichis equal to the number of contextual features. The mean vector of the mixturemodel with a large number of dimensions is very close to the null vector. Thedensity of the mixture model is highly massed around the mean vector witha long tail corresponding to the highly frequent contextual features. This in-dicates that most of the variations in the mixture model are along a few topeigenvectors of the covariance matrix of the mixture model, corresponding toa very sharp decay in the spectrum of the eigenvalues of the covariance matrix.

We argued that the distribution of mixture of contextual word vectors is un-suitable for PCA. This is because the principal components of the mixture ofcontextual word vector are highly affected by a small number of top eigenvec-tors of the covariance matrix of the mixture model. As a result, the values of asmall number of the elements of the principal components will be much largerthan the other elements. This skewed distribution of the elements of princi-pal components negatively affects the performance of some machine learningtools such as neural networks. In the case of neural networks trained with thestandard back-propagation algorithm, such distributions may lead to a weightsaturation, which stops the gradient propagation. In general, in many practi-cal cases, PCA works better on normally distributed data, although it does notmake any specific distributional assumption. However, the distribution of themixture of contextual word vectors is far from the normal distribution. Hence,we consider the distribution of the mixture of contextual word vectors as oneof the limiting factors of using PCA for word embedding.

The other limiting factor of using PCA for word embedding is related tothe dimensionality of the mixture of contextual word vectors. In practice,the dimensionality of the mixture models is often very large, resulting in avery large sample matrix to be processed by PCA. Processing and storing thismatrix requires a large amount of CPU time and memory. However, due to theZipfian distribution of word frequencies, the sample matrix often has a largedegree of sparsity. The sparsity of the word frequency sample matrix helps usto process the matrix with limited resources. However, the mean-centring stepin PCA distorts the sparsity of the matrix. After performing the mean-centringstep, the word frequency matrix becomes a dense matrix, which means thatsignificantly more computational resources are required to process it.

The two limitations mentioned above were mitigated by a singular valuedecomposition method, which makes it easier to perform PCA on mixturesof contextual word vectors. The SVD method was proposed as a core part ofa generalized PCA algorithm used as a replacement for the classic PCA forword embedding. We provide a summary of this algorithm in the next section.

148

9.2 Effective and Efficient PCAOur second research question concerns the effective and efficient use of PCAfor word embeddings. This question was answered by proposing algorithmsthat mitigate the two limiting factors of using PCA for word embedding. Asdescribed in the previous section, the first factor is related to the skewed distri-bution of mixtures of contextual word vectors and the second factor is relatedto the size of the data matrix sampled from the mixture models.

The limiting factor related to the data distribution was mitigated by an adap-tive transformation function, which reshapes the data distribution and makesit more suitable for principal component analysis. We proposed a power trans-formation function that maximizes the entropy of the data in the word fre-quency matrix while preserving the relative order of its elements. We showedthat after performing the power transformation, the distribution of the databecomes closer to the normal distribution, making it more suitable for PCA.

The proposed transformation function is a part of a more general algorithm,which makes word embedding through PCA easier. The generalized algo-rithm, referred to as a generalized PCA (GPCA), makes use of a randomizedSVD algorithm to mitigate the other limiting factor of PCA concerning theCPU time and the amount of memory needed to process a word frequencymatrix. The SVD method estimates the singular value decomposition of alarge matrix from a small sample matrix, which is sampled from all vectorsin the original matrix so that the vectors of the sampled matrix span the samespace as the vectors of the original matrix. In practice, since the sampledmatrix is much smaller than the original matrix, the SVD can be performedmuch faster. In addition, the SVD method facilitates PCA by incorporatingthe mean-centring step, needed for PCA, into the SVD computation. Themean-centring step destroys the sparsity of a data matrix. The randomizedSVD algorithm enables us to compute the principal components of a data ma-trix while preserving the sparsity of the matrix. This helps us to process verylarge matrices with limited amount of memory.

In addition to the solutions provided for the limiting factors of PCA, GPCAintroduces a weighting mechanism that generalizes many of the weighting andtransformation approaches used in literature. It also provides a mathematicalformulation for the concepts of context and corpus. These formulations en-able the principal word embedding method to process both raw and annotatedcorpora with different types of context.

9.3 PCA Limitations in Other MethodsOur third research question addresses how the limitations of PCA are han-dled in other word embedding methods. We studied the connections betweenmultiple popular word embedding methods and the principal word embeddingmethod proposed in this thesis.

149

In general, we showed that all of the selected methods are closely relatedto the principal word embedding method. In most instances, those methodswere special cases of the principal word embedding. For example, most of themethods transform the word frequency data before performing a dimension-ality reduction. We showed that the transformations used by other methodsare special cases of the adaptive transformation function in principal wordembedding. An example of the transformation function used for word em-bedding is the Hellinger transformation, which is a power transformation withan exponent of 0.5. In addition, in contrast to the other methods, which usefixed transformation functions, the transformation function in principal wordembedding is an adaptive function whose parameters are trained to match theword frequency data. We also showed that the dimensionality reduction tech-niques in all of the selected methods are closely related to PCA. For example,one of the commonly used word embedding methods, which was studied inthis thesis, is GloVe. We showed that what GloVe is doing in its dimen-sionality reduction step is in fact equivalent to a kernel principal componentanalysis of word frequencies.

We also discussed how the limiting factors of PCA are addressed in thesemethods. We showed that the limiting factors are completely ignored by someof the word embedding methods. Some of the methods, however, provide so-lutions that are covered by the more general solutions provided by principalword embedding. For example, the limiting factor related to the data distri-bution is mitigated by performing static transformation functions on the wordfrequency data. The limiting factor related to the size of the data is miti-gated by software engineering techniques or omitting the mean-centring stepin PCA. The principal word embedding method mitigates both limitations byusing a randomized SVD algorithm that makes it possible to perform PCA ona word frequency matrix without omitting the mean-centring step or applyingsophisticated software engineering techniques.

9.4 EvaluationOur fourth research question deals with the intrinsic and extrinsic evaluationof principal word embeddings. To this end, multiple evaluation metrics wereused to evaluate principal word embeddings in both intrinsic and extrinsicways. Our intrinsic evaluations were focused on the internal structures andthe distributions of the word embeddings. We introduced two intrinsic evalu-ation metrics. One measures the spread of the word embeddings and the othermeasures the discriminability of the word embeddings. The extrinsic evalua-tion metrics, by contrast, assess a set of word embeddings with regard to theircontributions in other tasks such as the word similarity task, part-of-speechtagging, named-entity recognition, and dependency parsing.

150

Several experiments were carried out to study the effect of the word embed-ding parameters on the evaluation metrics. We chose four parameters to study:1) the feature variables, 2) the number of dimensions, 3) the metric and weightmatrices, and 4) the transformation function. The experimental results showthat the parameters have a high impact on the quality of word embeddings andmake them suitable for certain tasks. We observed that the intrinsic evaluationmetrics are very sensitive to the number of dimensions. Non-identity transfor-mation functions and weighting matrices reduce the spread and the discrim-inability of word embeddings. This is because these parameters diminish thedisproportionate effect of the top eigenvalues and eigenvectors of the covari-ance matrix of the mixture of contextual word vectors and increase the effectof the remaining ones. In other words, these parameters compress the dataalong the top eigenvalues and expand it along the other ones. We found thiscompression and expansion beneficial in our extrinsic evaluation. For exam-ple, we found that a scalar power transformation function results in the bestparsing accuracies. We also found that different settings of the weighting pa-rameters and the transformation function generate word embeddings that aresuitable for different tasks.

Finally, we compared the principal word embedding method with otherpopular word embedding methods. The comparison was based on the effi-ciency of the embedding methods and both intrinsic and extrinsic evaluationmetrics. In terms of the efficiency of word embeddings, the comparisons showthat the principal word embedding is faster than the advanced word embed-ding methods such as GloVe and word2vec, but slower than the traditionalmethods such as random indexing. In terms of the evaluation metrics, the com-parisons show that principal word embeddings outperform the traditional wordembedding methods and are on par with the more advanced word embeddingmethods.

9.5 Future WorkWord embeddings have led to great improvements in the performance of NLPtasks. As continuous representations of words, the word vectors generated bythese methods enable NLP tasks to benefit from the power of machine learn-ing methods that process continuous data. Word embeddings also bridge thegap between linguistics and mathematics and open new lines of research forstudying words. They encode word properties into the elements of vectors andenable efficient computational processing of words. Therefore, they are im-portant tools for researchers in linguistics, mathematics, artificial intelligence,and computer science.

The research on word embeddings can be further extended in various ways.In this section, we focus of the principal word embedding method and propose

151

two lines of research related to the parameters and the technical aspects of theword embedding method.

We have studied four groups of parameters of principal word embedding:1) feature variables, 2) number of dimensions, 3) weighting matrices, and 4)transformation functions. In this section, we address several lines of researchbased on these parameters.

Feature variables allow principal word embedding to make use of deep lin-guistic notions in word embedding. The specification of a feature variable hastwo parts: a contextual feature and a context function. The contextual featureprovides for using different types of linguistically motivated features such asmorphological, syntactic, and semantic features of words. The context allowsus to make use of different syntactic theories for building word embeddings.For example, one can use different types of dependency grammars for trainingword embeddings. These parameters can be defined in various ways. In thethesis, we studied limited types of contextual features and context functions.Further studies on these parameters provide us with more information aboutthe effect of these parameters on word embedding.

Other parameters of principal word embedding that can be studied from alinguistic point of view are the two weighting matrices used to weight featurevariables and words. Our studies of these matrices were restricted to diagonalmatrices whose elements were functions of word and feature frequencies. Thisstudy can be extended by using other matrices that incorporate linguisticallymotivated information about words and features rather than using only thefrequency-based features. The weight matrices also provide a systematic wayof feature selection to determine the most informative features for word em-bedding. From this perspective, the weighting mechanism in principal wordembedding can be seen as a platform to study the effect of different featureson the quality of word embeddings.

A parameter of principal word embedding that can be of interest from amathematical point of view is an eigenvalue weighting matrix. This matrixrelaxes the PCA eigenvector selection rule and gives the principal word em-bedding method more freedom to choose arbitrary eigenvectors of a word fre-quency covariance matrix, instead of using only a small number of top eigen-vectors. In our experiments, we instantiate this matrix based on some assump-tions about the tasks used for evaluating word embeddings. This matrix, how-ever, can be trained in a smarter way using machine learning techniques.1 Thisapproach is also helpful to diminish the negative effect of the top eigenvaluesof a word frequency covariance matrix in word embedding.

Another parameter of principal word embedding, which needs a deep math-ematical study, is the transformation function used to reshape the distributionof the word frequency data to a distribution closer to the normal distribution.

1The idea of training an eigenvalue weight matrix is generated by Magnus Sahlgren in a seminardevoted to a preliminary version of thesis.

152

In our studies, we considered some power functions whose parameters weretuned by an entropy-based optimization method. One can also study otherfunctions such as different variants of the sigmoid function for this aim.

In addition to a deeper study of the parameters of our parameters, futureresearch could also consider extensions to the method. Here we briefly con-sider two such extensions. The first is to allow the method to associate wordswith random vectors instead of the usual word vectors, which can potentiallyincrease the effectiveness of the word embedding method. The second is toallow it to generate word embeddings sequentially, which would improve ef-ficiency.

Principal word embedding generates a set of word embeddings by perform-ing PCA on a mixture model. The components of the mixture model are vec-tors associated with words. The elements of these vectors, called contextualword vectors, are functions of word frequencies. Principal word embeddingsare principal components of a matrix sampled from the mixture model.

The idea of generating word embeddings from a mixture model can be ex-tended to a method that associates words with random vectors. This is opposedto the ordinary word embedding methods that associate words with static real-valued vectors. If we associate a word with a random vector, then each occur-rence of a word will be associated with a realization of the random vector. Inother words, each occurrence of a word is associated with a vector that mayvary with respect to the syntactic environment of the word. Such a word em-bedding method can be considered as a generative word embedding method.Random word vectors are expected to be more informative than the ordinaryword vectors in terms of the information about the occurrences of words. Thisis because random word vectors generate word embeddings with regard to thecontextual environments where words appear. However, the ordinary wordvectors are vectors with static values trained in a training phase. These vectorswill remain unchanged when they are in use.

The principal word embedding method, introduced in this thesis, is a vector-based word embedding method, i.e., it associates each word with a static vec-tor. Principal word vectors, built by this method, are principal components ofa mixture of contextual word vectors. The principal components are realiza-tions of a vector of latent random variables modelling most of the variationin the mixture model. In other words, principal word embedding uses PCAto project a vector sampled from the mixture model to a vector in the spaceof the latent variables. This approach can be extended to a generative wordembedding method if we replace PCA with other methods such as restrictedBoltzmann machines (RBM), which creates a link between two vectors of ran-dom variables together. A generative word embedding method links a mixtureof contextual word vectors to a low-dimensional vector of latent variables suchthat the distance between the distribution of the mixture model and the distri-bution of the latent variables is minimized. This method is expected to model

153

the dynamic syntactic and semantic roles of the words in a more effective waythan the classic word vectors.

Another extension to the principal word embedding method is to add a func-tionality to the method to process word frequency data in a sequential waywhile scanning a corpus. This would lead to a significant reduction in thememory required to process the data. As described in the thesis, principalword embedding uses an SVD algorithm that makes it easier to perform PCAon word frequency data. An advantage of using this algorithm is its abilityto preserve the sparsity of the data. This mean that the algorithm can benefitfrom highly efficient algorithms designed for processing sparse data. It leadsto substantial improvements in the efficiency of the word embedding method.We expect that the efficiency of principal word embedding can be further im-proved if the data matrix is processed sequentially.

In general, word embeddings and other continuous representations havedramatically changed the landscape in NLP in recent years and have improvedthe state of the art for a very wide range of tasks. In addition, they give us novelinsights into the nature of the linguistic units such as the word, which is oneof the fundamental concepts not only in linguistics, but also in other researchareas, some of which are philosophy, theology and mathematics. However, adeeper understanding of these representations, their strengths and their weak-nesses, has not developed at quite the same pace. Such an understanding willbe necessary if we want to continue to make progress in our field. The hopeis that the research presented in this thesis may contribute to this developmentand inspire further research in the same vein.

154

References

Achlioptas, D. (2001). Database-friendly random projections. In Proceedings of the

20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database

Systems, pages 274–281.Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! A systematic

comparison of context-counting vs. context-predicting semantic vectors. InProceedings of the 52nd Annual Meeting of the Association for Computational

Linguistics, pages 238–247.Basirat, A. and Nivre, J. (2017). Real-valued syntactic word vectors (RSV) for

greedy neural dependency parsing. In Proceedings of the 21th Nordic Conference

on Computational Linguistics (NoDaLiDa), pages 20–28.Basirat, A. and Tang, M. (2018). Lexical and morpho-syntactic features in word

embeddings - A case study of nouns in swedish. In Proceedings of the 10th

International Conference on Agents and Artificial Intelligence (ICAART 2018),pages 663–674.

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilisticlanguage model. Journal of Machine Learning Research, 3:1137–1155.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning.In Proceedings of the 26th Annual International Conference on Machine Learning

(ICML ’09), pages 41–48.Berg-Kirkpatrick, T., Burkett, D., and Klein, D. (2012). An empirical investigation of

statistical significance in NLP. In Proceedings of the 2012 Joint Conference on

Empirical Methods in Natural Language Processing and Computational Natural

Language Learning, pages 995–1005.Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal

of Machine Learning Research, 3:993–1022.Cann, R., Kempson, R., and Gregoromichelaki, E. (2009). Semantics: An

Introduction to Meaning in Language. Cambridge Textbooks in Linguistics.Cambridge University Press.

Chen, D. and Manning, C. (2014). A fast and accurate dependency parser usingneural networks. In Proceedings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 740–750.Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for

language modeling. In Proceedings of the 34th Annual Meeting on Association for

Computational Linguistics, pages 310–318.Chiu, B., Korhonen, A., and Pyysalo, S. (2016). Intrinsic evaluation of word vectors

fails to predict extrinsic performance. In Proceedings of the 1st Workshop on

Evaluating Vector-Space Representations for NLP, August 2016, pages 1–6.Church, K. W. and Hanks, P. (1990). Word association norms, mutual information,

and lexicography. Computational linguistics, 16(1):22–29.

155

Clark, S. (2015). Vector Space Models of Lexical Meaning, chapter 16, pages493–522. Wiley-Blackwell.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.(2011). Natural language processing (almost) from scratch. Journal of Machine

Learning Research, 12:2493–2537.Craddock, J. M.; Flood, C. R. (1969). Eigenvectors for representing the 500 mb.

geopotential surface over the northern hemisphere. Quarterly Journal of the Royal

Meteorological Society, 95:576–593.Dahl, G., Adams, R., and Larochelle, H. (2012). Training restricted boltzmann

machines on word observations. In Proceedings of the 29th International

Conference on Machine Learning (ICML-12), pages 679–686.de Lhoneux, M., Shao, Y., Basirat, A., Kiperwasser, E., Stymne, S., Goldberg, Y.,

and Nivre, J. (2017). From raw text to universal dependencies - look, no tags! InProceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text

to Universal Dependencies, pages 207–217.de Marneffe, M.-C. and Manning, C. D. (2008). The Stanford typed dependencies

representation. In Coling 2008: Proceedings of the workshop on

Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8.Egghe, L. (2007). Untangling herdan’s law and heaps’ law: Mathematical and

informetric arguments. Journal of the American Society for Information Science

and Technology, 58(5):702–709.Faruqui, M. and Dyer, C. (2014). Community evaluation and exchange of word

vectors at wordvectors.org. In Proceedings of the 52nd Annual Meeting of the

Association for Computational Linguistics: System Demonstrations, pages 19–24.Goldberg, Y. and Hirst, G. (2017). Neural Network Methods in Natural Language

Processing. Morgan & Claypool Publishers.Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins

University Press, third edition.Hajic, J. and Zeman, D., editors (2017). Proceedings of the CoNLL 2017 Shared

Task: Multilingual Parsing from Raw Text to Universal Dependencies. Associationfor Computational Linguistics.

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure withrandomness: Probabilistic algorithms for constructing approximate matrixdecompositions. SIAM Review, 53(2):217–288.

Hartung, M., Kaupmann, F., Jebbara, S., and Cimiano, P. (2017). Learningcompositionality functions on word embeddings for modelling attribute meaningin adjective-noun phrases. In Proceedings of the 15th Conference of the European

Chapter of the Association for Computational Linguistics, pages 54–64.Heaps, H. S. (1978). Information Retrieval: Computational and Theoretical Aspects.

Academic Press.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data

with neural networks. Science, 313(5786):504–507.Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the

22nd Annual International ACM SIGIR Conference on Research and Development

in Information Retrieval, pages 50–57.Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.

Journal of the ACM (JACM), 5(4):339–342.

156

Jolliffe, I. (2002). Principal Component Analysis. Springer Series in Statistics.Springer-Verlag.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An

Introduction to Natural Language Processing, Computational Linguistics, and

Speech Recognition. Prentice Hall PTR, 2nd edition.Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models.

In Proceedings of the 2013 Conference on Empirical Methods in Natural

Language Processing, pages 1700–1709.Katz, S. (1987). Estimation of probabilities from sparse data for the language model

component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and

Signal Processing, 35(3):400–401.Kiela, D. and Clark, S. (2014). A systematic study of semantic vector space model

parameters. In Proceedings of the 2nd Workshop on Continuous Vector Space

Models and their Compositionality (CVSC) at Conference of the European

Chapter of the Association for Computational Linguistics (EACL), pages 21–30.Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language

modeling. In 1995 International Conference on Acoustics, Speech, and Signal

Processing, pages 181–184.Köhn, A. (2015). What’s in an embedding? Analyzing word embeddings through

multilingual evaluation. In Proceedings of the 2015 Conference on Empirical

Methods in Natural Language Processing, pages 2067–2073.Lai, S., Liu, K., He, S., and Zhao, J. (2016). How to generate a good word

embedding. IEEE Intelligent Systems, 31(6):5–14.Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato’s problem: The latent

semantic analysis theory of acquisition, induction, and representation ofknowledge. Psychological review, 104(2):211.

Lebret, R. and Collobert, R. (2014). Word embeddings through Hellinger PCA. InProceedings of the 14th Conference of the European Chapter of the Association

for Computational Linguistics (EACL), pages 482–490.Lebret, R. and Collobert, R. (2015). Rehabilitation of count-based models for word

vector representations. In Proceedings of the 16th International Conference on

Computational Linguistics and Intelligent Text Processing (CICLing), pages417–429.

LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. (2012). Efficient backprop.In Neural networks: Tricks of the trade, pages 9–50. Springer Berlin Heidelberg.

Levy, O. and Goldberg, Y. (2014a). Dependency-based word embeddings. InProceedings of the 52nd Annual Meeting of the Association for Computational

Linguistics, pages 302–308.Levy, O. and Goldberg, Y. (2014b). Neural word embedding as implicit matrix

factorization. In Advances in neural information processing systems, pages2177–2185.

Lund, K. and Burgess, C. (1996). Producing high-dimensional semantic spaces fromlexical co-occurrence. Behavior Research Methods, Instruments, & Computers,28(2):203–208.

Lyons, J. (1995). Linguistic Semantics: An Introduction. Cambridge UniversityPress.

157

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a largeannotated corpus of English: The Penn treebank. Computational Linguistics -

Special issue on using large corpora, 19(2):313 – 330.Matthews, P. H. (1981). Syntax. Cambridge Textbooks in Linguistics. Cambridge

University Press.Matthews, P. H. (1991). Morphology. Cambridge Textbooks in Linguistics.

Cambridge University Press, 2nd edition.McDonald, R. (2006). Discriminative learning and spanning tree algorithms for

dependency parsing. PhD thesis, University of Pennsylvania.Melamud, O. and Goldberger, J. (2017). Information-theory interpretation of the

skip-gram negative-sampling objective function. In Proceedings of the 55th

Annual Meeting of the Association for Computational Linguistics, pages 167–171.Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of

word representations in vector space. In Proceedings of Workshop at the

International Conference on Learning Representations (ICLR).Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b).

Distributed representations of words and phrases and their compositionality. InAdvances in Neural Information Processing Systems 26, pages 3111–3119. CurranAssociates, Inc.

Mikolov, T., Yih, W., and Zweig, G. (2013c). Linguistic regularities in continuousspace word representations. In Proceedings of the 2013 Conference of the North

American Chapter of the Association for Computational Linguistics: Human

Language Technologies, pages 746–751.Nayak, N., Angeli, G., and Manning, C. D. (2016a). Evaluating word embeddings

using a representative suite of practical tasks. In Proceedings of the 1st Workshop

on Evaluating Vector Space Representations for NLP, pages 19–23.Nayak, N., Angeli, G., and Manning, C. D. (2016b). Evaluating word embeddings

using a representative suite of practical tasks. In Proceedings of the 1st Workshop

on Evaluating Vector-Space Representations for NLP, August 2016, pages 19–23.Nivre, J. (2004). Incrementality in deterministic dependency parsing. In Proceedings

of the Workshop on Incremental Parsing: Bringing Engineering and Cognition

Together, pages 50–57.Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D.,

McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universaldependencies v1: A multilingual treebank collection. In Proceedings of the 10th

International Conference on Language Resources and Evaluation (LREC 2016),pages 1659–1666.

Ó Séaghdha, D. and Korhonen, A. (2014). Probabilistic distributional semantics withlatent variable models. Computational Linguistics, 40(3):587–631.

Padó, S. and Lapata, M. (2007). Dependency-based construction of semantic spacemodels. Computational Linguistics, 33:161–199.

Parzen, E. (1962). On estimation of a probability density function and mode. The

Annals of Mathematical Statistics, 33(3):1065–1076.Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for

word representation. In Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing (EMNLP), pages 1532–1543.

158

Reed, J. W., Jiao, Y., Potok, T. E., Klump, B. A., Elmore, M. T., and Hurson, A. R.(2006). Tf-icf: A new term weighting scheme for clustering dynamic data streams.In 2006 5th International Conference on Machine Learning and Applications

(ICMLA’06), pages 258–263.Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally

linear embedding. Science, 290(5500):2323–2326.Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002).

Multiword expressions: A pain in the neck for NLP. In Gelbukh, A., editor,Computational Linguistics and Intelligent Text Processing, pages 1–15.

Sahlgren, M. (2006). The Word-space model. PhD thesis, Stockholm University.Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text

retrieval. Information processing & management, 24(5):513–523.Schnabel, T., Labutov, I., Mimno, D., and Joachims, T. (2015). Evaluation methods

for unsupervised word embeddings. In Proceedings of the 2015 Conference on

Empirical Methods in Natural Language Processing, pages 298–307.Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis

as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319.Schütze, H. (1992). Dimensions of meaning. In Proceedings of the 1992 ACM/IEEE

conference on Supercomputing, pages 787–796.Straka, M. and Straková, J. (2017). Tokenizing, POS tagging, lemmatizing and

parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task:

Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99.Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the conll-2003

shared task: Language-independent named entity recognition. In Proceedings of

the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -

Volume 4, pages 142–147.Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich

part-of-speech tagging with a cyclic dependency network. In Proceedings of the

2003 Conference of the North American Chapter of the Association for

Computational Linguistics on Human Language Technology-Volume 1, pages173–180.

Vidal, R., Ma, Y., and Sastry, S. S. (2016). Generalized Principal Component

Analysis. Interdisciplinary Applied Mathematic. Springer-Verlag.Vulic, I. and Korhonen, A. (2016). Is "universal syntax" universally useful for

learning distributed word representations? In Proceedings of the 54th Annual

Meeting of the Association for Computational Linguistics, pages 518–524.Wittgenstein, L. (1953). Philosophical Investigations. Cambridge University Press.Yaghoobzadeh, Y. and Schütze, H. (2016). Intrinsic subspace evaluation of word

embedding representations. In Proceedings of the 54th Annual Meeting of the

Association for Computational Linguistics, pages 236–246.Zhang, Y. and Nivre, J. (2011). Transition-based dependency parsing with rich

non-local features. In Proceedings of the 49th Annual Meeting of the Association

for Computational Linguistics: Human Language Technologies, pages 188–193.Zobel, J., Heinz, S., and Williams, H. E. (2001). In-memory hash tables for

accumulating text vocabularies. Information Processing Letters, 80:271–277.

159

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia

Editors: Michael Dunn and Joakim Nivre 1. Jörg Tiedemann, Recycling translations. Extraction of lexical data from parallel

corpora and their application in natural language processing. 2003. 2. Agnes Edling, Abstraction and authority in textbooks. The textual paths towards

specialized language. 2006. 3. Åsa af Geijerstam, Att skriva i naturorienterande ämnen i skolan. 2006. 4. Gustav Öquist, Evaluating Readability on Mobile Devices. 2006. 5. Jenny Wiksten Folkeryd, Writing with an Attitude. Appraisal and student texts

in the school subject of Swedish. 2006. 6. Ingrid Björk, Relativizing linguistic relativity. Investigating underlying assump-

tions about language in the neo-Whorfian literature. 2008. 7. Joakim Nivre, Mats Dahllöf and Beáta Megyesi, Resourceful Language Tech-

nology. Festschrift in Honor of Anna Sågvall Hein. 2008. 8. Anju Saxena & Åke Viberg, Multilingualism. Proceedings of the 23rd Scandinavi-

an Conference of Linguistics. 2009. 9. Markus Saers, Translation as Linear Transduction. Models and Algorithms for

Efficient Learning in Statistical Machine Translation. 2011. 10. Ulrika Serrander, Bilingual lexical processing in single word production. Swedish

learners of Spanish and the effects of L2 immersion. 2011. 11. Mattias Nilsson, Computational Models of Eye Movements in Reading : A Data-

Driven Approach to the Eye-Mind Link. 2012. 12. Luying Wang, Second Language Acquisition of Mandarin Aspect Markers by

Native Swedish Adults. 2012. 13. Farideh Okati, The Vowel Systems of Five Iranian Balochi Dialects. 2012. 14. Oscar Täckström, Predicting Linguistic Structure with Incomplete and Cross-

Lingual Supervision. 2013. 15. Christian Hardmeier, Discourse in Statistical Machine Translation. 2014. 16. Mojgan Seraji, Morphosyntactic Corpora and Tools for Persian. 2015. 17. Eva Pettersson, Spelling Normalisation and Linguistic Analysis of Historical Text

for Information Extraction. 2016. 18. Marie Dubremetz, Detecting Rhetorical Figures Based on Repetition of Words:

Chiasmus, Epanaphora, Epiphora. 2017. 19. Josefin Lindgren, Developing narrative competence: Swedish, Swedish-German

and Swedish-Turkish children aged 4–6. 2018. 20. Vera Wilhelmsen, A Linguistic Description of Mbugwe with Focus on Tone and

Verbal Morphology. 2018. 21. Yan Shao, Segmenting and Tagging Text with Neural Networks. 2018. 22. Ali Basirat, Principal Word Vectors. 2018.

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica...

Documents

Transcript of ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica...