Word2vec on the italian language: first experiments

8
Word2vec on the Italian language: first experiments Vincenzo Lomonaco 1 1 Alma Mater Studiorum - University of Bologna February 19, 2015 Abstract Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language. 1 Introduction Many current NLP systems and techniques treat words as atomic units, there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons: simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling. However, the simple techniques are at their limits in many tasks. With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data sets, and they typically outperform the simple models now. Probably, one of the most successful concept is to use distributed representations of words [2]. For example, neural network based language models signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born out of this trend. It can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as I know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previously proposed experiments for the English language (especially exploring how this tool performs on smaller data sets) and then trying to figure out if it is possible to reproduce the same accuracy and performance with the Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, I present the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details what experiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions. 2 Word2vec models Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computes distributed representations of words using neural networks, as it was previously shown that they perform signicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu- tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new model architectures for learning distributed representations of words that try to minimize computational complex- ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward 1

Transcript of Word2vec on the italian language: first experiments

Page 1: Word2vec on the italian language: first experiments

Word2vec on the Italian language: first experiments

Vincenzo Lomonaco1

1Alma Mater Studiorum - University of Bologna

February 19, 2015

Abstract

Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recentyears. The vector representations of words learned by word2vec models have been proven to be ableto carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce thepreviously obtained results for the English language and to explore the possibility of doing the same forthe Italian language.

1 Introduction

Many current NLP systems and techniques treat words as atomic units, there is no notion of similaritybetween words, as these are represented as indices in a vocabulary. This choice has several good reasons:simplicity, robustness and the observation that simple models trained on huge amounts of data outperformcomplex systems trained on less data. An example is the popular N-gram model used for statistical languagemodeling. However, the simple techniques are at their limits in many tasks. With progress of machinelearning techniques in recent years, it has become possible to train more complex models on much largerdata sets, and they typically outperform the simple models now. Probably, one of the most successfulconcept is to use distributed representations of words [2]. For example, neural network based languagemodels signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born outof this trend. It can be used for learning high-quality word vectors from huge data sets with billions ofwords, and with millions of words in the vocabulary. As far as I know, none of the previously proposedarchitectures has been successfully trained on more than a few hundred of millions of words, with a modestdimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previouslyproposed experiments for the English language (especially exploring how this tool performs on smaller datasets) and then trying to figure out if it is possible to reproduce the same accuracy and performance withthe Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, Ipresent the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details whatexperiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions.

2 Word2vec models

Many different types of models were proposed for estimating continuous representations of words, includingthe well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computesdistributed representations of words using neural networks, as it was previously shown that they performsignicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu-tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new modelarchitectures for learning distributed representations of words that try to minimize computational complex-ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward

1

Page 2: Word2vec on the italian language: first experiments

Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layeris shared for all words. This architecture is called Continuous Bag-Of-Words as the order of words in thehistory does not inuence the projection. Furthermore, words from the future are used. the best performancein the original work was obtained on the task introduced in the next section by building a log-linear classierwith four future and four history words at the input, where the training criterion is to correctly classify thecurrent (middle) word. The second architecture is similar to CBOW, but instead of predicting the currentword based on the context, it tries to maximize classication of a word based on another word in the samesentence and it is called Continuous Skip-gram (Skip-gram) 1. More precisely, each current word is usedas an input to a log-linear classier with continuous projection layer to predict words within a certain rangebefore and after the current word. It is found that increasing the range improves quality of the resultingword vectors, but it also increases the computational complexity.

Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context,and the Skip-gram predicts surrounding words given the current word.

3 Corpora and test sets

Due to computational and memory limits, I was forced to consider only small data sets and to computeword vectors on them. In order to underline the correlation between the data set dimension and the wordvectors quality, I decided to prepare two data set for each language: The former of 100MB and the latterof 200MB. For the English language I choose to use a 200MB chunk of the “One Billion Word LanguageModeling Benchmark” that is a tokenized corpus provided by Google. I futher considered the small sampledand already prerocessed version of the Wikipedia dump corpus that cames natively with word2vec as a demodata set to make some comparations. For the italian language I choose to directly sampling the plain-textItWaC corpus (that counts more than 2 billion words) and reduce it to our demo size of 200MB. In table 1,futher information about each data sets are provided.

2

Page 3: Word2vec on the italian language: first experiments

Table 1: Data sets summary

Lang Name Size Vocab Size Words number Encoding

Eng 1BWLMB 100MB 60745 18037497 utf-8Eng text8 100MB 71291 16718843 utf-8Eng 1BWLMB 200MB 81746 34756679 utf-8Ita ItWac 100MB 90486 16691286 utf-8Ita ItWac 200MB 125625 33394879 utf-8

Before submitting the text directly to word2vec we need some preprocessing. In fact, word2vec gives its bestwhen:

• Punctuation and special characters are removed

• Words are converted to lowercase

• Numerals are converted to their word forms (e.g. 1996 becomes one nine nine six)

Even if these preprocessing steps are not really necessary, they can improve the accuracy and be useful forsome kind of applications. In this work I decided to remove only punctuation and special characters. Thus,I wrote some scripts in Python in order to pre-process all the data in the same way.

With the respect to the test sets, word2vec team provided a specific set of “questions” to evaluate theword vectors accuracy. Although it is easy to show that word France is similar to Italy and perhaps someother countries, it is much more challenging when subjecting those vectors in a more complex similaritytask. The authors follow previous observations that there can be many different types of similarities betweenwords, for example, word big is similar to bigger in the same sense that small is similar to smaller. On thesepremises they denote two pairs of words with the same relationship as a question, as we can ask: “Whatis the word that is similar to small in the same sense as biggest is similar to big?” Somewhat surprisingly,these questions can be answered by performing simple algebraic operations with the vector representationof words. To find a word that is similar to small in the same sense as biggest is similar to big, we cansimply compute vector X = vector(“biggest”) - vector(“big”) + vector(“small”). Then, the word closest toX measured by cosine distance could be searched in the vector space and be used as the answer to the question.

Thus, to measure quality of the word vectors, a comprehensive test set that contains five types of se-mantic questions, and nine types of syntactic questions was defined. Two examples from each category areshown in Figure 2. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in eachcategory were created in two steps: first, a list of similar word pairs was created manually. Then, a largelist of questions is formed by connecting two word pairs. For example, the authors made a list of 68 largeAmerican cities and the states they belong to, and formed about 2.5K questions by picking two word pairsat random. Only single token words are included, thus multi-word entities are not present (such as NewYork). Then the overall accuracy is evaluated for all question types, and for each question type separately(semantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vectorcomputed using the above method is exactly the same as the correct word in the question; synonyms arethus counted as mistakes. This also means that reaching 100% accuracy is likely to be impossible, as thecurrent models do not have any input information about word morphology.

3

Page 4: Word2vec on the italian language: first experiments

Figure 2: Examples of five types of semantic and nine types of syntactic questions in the Semantic- SyntacticWord Relationship test set for the English language.

The same test set has been carried in this work manually translating the original word pairs in the Italianlanguage and than automatically forming questions by connecting two word pairs in all their possible com-binations. However, It is worth saying that sometimes switching from one language to another could leadto meaningless questions. Some words, in fact, are more common than others or can not be translated in asingle word in the other language. All possible efforts have been made to mantain the correspondence whilepreserving the meaning of the test and in the following sections we will try to figure out the goodness of thisapproach.

4 Experiments and results

In this section we present two main experiments performed in parallel on the English and Italian sampledcorpora. The first one is the accuracy measure computed with the test sets on different data sets and thesecond one is an exploratory analysis over an unsupervised words clustering attempt.

4.1 Test sets experiments

All the experiments were performed using the Skip-gram model and a vector size of 200. For the otheroptions default word2vec values were chosen. In Tab. 2 we can see the results directly reported from theoriginal compute-accuracy code for the English language through the demo text8 data set. Total possiblequestions are 19K, but only a 60% has been used for the purpose. An average of 26% of accuracy is reachedwith the TOP-1 metric, meaning that the answer is exactly the first ranked among all the guesses. This isan accettable result giving that all synonims are considered wrong and the data set is very small.

4

Page 5: Word2vec on the italian language: first experiments

Table 2: English text8 accuracy

Specific 110MB-TOP1

capital-common-countries 37.35% (189 / 506)capital-world 23.35% (339 / 1452)currency 6.34% (17 / 268)city-in-state 19.22% (302 / 1571)family 52.94% (162 / 306)gram1-adjective-to-adverb 5.16% (39 / 756)gram2-opposite 13.07% (40 / 306)gram3-comparative 31.43% (396 / 1260)gram4-superlative 12.45% (63 / 506)gram5-present-participle 14.11% (140 / 992)gram6-nationality-adjective 56.24% (771 / 1371)gram7-past-tense 18.17% (242 / 1332)gram8-plural 40.12% (398 / 992)gram9-plural-verbs 17.38% (113 / 650)

Average: 26,17% (3211/12268)

Questions used / total: 62.77% (12268/19544)

Let us consider now the given results for the “One Billion Word Language Modeling Benchmark” in Tab. 3.According to the table, results are slightly worse on this data set reaching an average accuracy of 19,77%.Increasing the data, however, it jumps to 29,98%. This is a huge improvement considering that only anotherchunk of 100MB of data was added.

Table 3: English 1BWLMB accuracy

Specific 100MB-TOP1 200MB-TOP1

capital-common-countries 32.81% (166 / 506) 46.64% (236 / 506)capital-world 20.37% (355 / 1743) 35.20% (598 / 1699)currency 1.56% (2 / 128) 6.48% (7 / 108)city-in-state 9.18% (184 / 2005) 13.34% (267 / 2001)family 52.11% (198 / 380) 54.21% (206 / 380)gram1-adjective-to-adverb 1.72% (16 / 930) 3.01% (28 / 930)gram2-opposite 5.53% (28 / 506) 6.06% (28 / 462)gram3-comparative 38.06% (507 / 1332) 52.25% (696 / 1332)gram4-superlative 11.95% (97 / 812) 20.57% (167 / 812)gram5-present-participle 18.35% (182 / 992) 28.83% (286 / 992)gram6-nationality-adjective 30.11% (370 / 1229) 47.76% (554 / 1160)gram7-past-tense 22.95% (358 / 1560) 35.38% (552 / 1560)gram8-plural 16.29% (172 / 1056) 28.25% (317 / 1122)gram9-plural-verbs 18.46% (120 / 650) 26.15% (170 / 650)

Average: 19,77% (2735/13829) 29,98% (4112/13714)

Questions used / total: 70.76% (13829/19544) 70.17% (13714/19544)

5

Page 6: Word2vec on the italian language: first experiments

Moving to the italian language, the accuracy results are given in Tab. 4. First of all, the third grammarsection was removed from the test set since it was impossible to translate comparative in a single wordin Italian. With respect to the accuracy, it is possible to see a huge drop basically in every test section,comparing the accuracy with the corrisponding English results. The main motivations of this fall concern theoriginal corpora and the sampled data sets. Let us consider the section city-in-state that drops from 19% tonearly 1% for example. This section is based on questions referring USA states and cities. It is clear that inWikipedia or in the Google News corpus there are much more entities of this kind rather than in the ItWaCcorpus that is made by crawling random it domains and is not as clean as the other two. The same can besaid for the other sections, especially non syntactic ones. However, even in some sections in which a betteraccuracy performance was aspected (like in the plural section) we see a steep drop. There are many possibleexplanations for this fall. First of all because, Italian, like other european languages, is morphologicallyricher than English implying that more data are needed to reach the same accuracy. Moreover, selectedwords in the questions are more common in English and it could not be necessarily said the same for theItalian translation. This phenomenon may imply a large difference in accuracy on small data sets, expeciallywith a neural network model. But to have the last word, a larger number of experiments have to be done,varying the dimention of the data sets up to the state-of-the-art data set dimension and using a more ad-hocbenchmark for evaluating the quality of word vectors in Italian. However, also in this case, increasing thedimension of the data set leads to a good improvement in accuracy in almost all the test sections with anoverall accuracy improvement of a few points percentage.

Table 4: Italian ItWaC accuracy

Specific 100MB-TOP1 200MB-TOP1

capital-common-countries 8.01% (37 / 462) 15.37% (71 / 462)capital-world 4.56% (32 / 702) 8.51% (74 / 870)currency 0.00% (0 / 156) 0.00% (0 / 182)city-in-state 1.19% (9 / 756) 3.63% (36 / 992)family 6.76% (25 / 370) 9.90% (49 / 495)gram1-adjective-to-adverb 1.03% (9 / 870) 3.68% (32 / 870)gram2-opposite 0.15% (1 / 650) 2.85% (20 / 702)gram4-superlative 0.71% (5 / 702) 3.98% (37 / 930)gram5-present-participle 6.99% (65 / 930) 12.90% (128 / 992)gram6-nationality-adjective 2.06% (26 / 1260) 5.00% (78 / 1560)gram7-past-tense 1.25% (14 / 1122) 3.68% (49 / 1332)gram8-plural 3.49% (44 / 1260) 5.16% (65 / 1260)gram9-plural-verbs 16.32% (142 / 870) 28.85% (251 / 870)

Average: 4,12% (417/10110) 7,73% (890/10110)

Questions used / total: 71.08% (10110/14223) 80.97% (11517/14223)

4.2 Clustering experments

The word vectors can be also used for deriving word classes from huge data sets. This is achieved byperforming K-means clustering on top of the word vectors. The word2vec original release contains thestraight C implementation and a shell script that demonstrates its uses. The output is a vocabulary file withwords and their corresponding class IDs. Some examples for both language are provided in the Tab. 5 andTab. 6 (Note that there is no correlation between words of different languages: they are put toghether onlyfor formatting purposes).Even if the user can set a specific number of output classes, semantic and syntactic similarities are difficult

6

Page 7: Word2vec on the italian language: first experiments

to separate. This could be good for some applications and less good for others. Through a brief exploratoryanalysis the classes quality seems to be similar for both language, but only a more strictly test with the helpof a wordnet could validate this argument.

Table 5: Clustering Examples

English Italiancarnivores 234 bancario 10carnivorous 234 bonifico 10cetaceans 234 cambiale 10cormorant 234 cambiali 10coyotes 234 cambiari 10crocodile 234 correntista 10crocodiles 234 costitutore 10crustaceans 234 credito 10cultivated 234 debitore 10danios 234 denaro 10

Table 6: Clustering Examples

English Italianacceptance 412 menzogneri 341argue 412 minacciare 341argues 412 minando 341arguing 412 minato 341argument 412 mistificazione 341arguments 412 nefasta 341belief 412 opponendo 341believe 412 opponendosi 341challenge 412 oppressa 341claim 412 oppressore 341

5 Conclusion

In this work we have tried to understand word2vec, a well known tool for learning high-quality word vectors,and to reproduce to some extent the results obtained in the original work for the English language. Moreover,we aimed to start bringing the same experiments to the Italian language and see what happens. Using muchdifferent corpora of limited size as well as translating directly the test set without an accurate linguisticrevision has led to a very low accuracy level for the Italian language. Varying the number of words up tobillions and the size of the vocabulary could certanly raise the total level of accuracy. The next step, however,would be to construct a new test set from scratch and use it as a benchmark for trying all the most usedword space models in the context of the Italian language. In the end, I am confident that, with appropriateefforts, it would be possible to use word2vec and its different parallel implementations the same way as theyare used for the English language, reaching the state-of-the-art in terms of both performance and accuracy.

7

Page 8: Word2vec on the italian language: first experiments

References

[1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic languagemodel. The Journal of Machine Learning Research, 3:1137–1155, 2003.

[2] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing.Explorations in the microstructure of cognition, 2:216–271, 1986.

[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representationsin vector space. arXiv preprint arXiv:1301.3781, 2013.

[4] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernocky. Empirical evaluationand combination of advanced language modeling techniques. In INTERSPEECH, number s 1, pages 605–608, 2011.

[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems,pages 3111–3119, 2013.

[6] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space wordrepresentations. In HLT-NAACL, pages 746–751, 2013.

[7] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.

[8] Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492–518, 2007.

[9] Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. Combining heteroge-neous models for measuring relational similarity. In HLT-NAACL, pages 1000–1009, 2013.

8