Word2vec on the italian language: first experiments

Word2vec on the Italian language: first experiments Vincenzo Lomonaco 1 1 Alma Mater Studiorum - University of Bologna February 19, 2015 Abstract Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language. 1 Introduction Many current NLP systems and techniques treat words as atomic units, there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons: simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling. However, the simple techniques are at their limits in many tasks. With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data sets, and they typically outperform the simple models now. Probably, one of the most successful concept is to use distributed representations of words [2]. For example, neural network based language models signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born out of this trend. It can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as I know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previously proposed experiments for the English language (especially exploring how this tool performs on smaller data sets) and then trying to figure out if it is possible to reproduce the same accuracy and performance with the Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, I present the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details what experiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions. 2 Word2vec models Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computes distributed representations of words using neural networks, as it was previously shown that they perform signicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu- tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new model architectures for learning distributed representations of words that try to minimize computational complex- ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward 1

