POLITECNICODIMILANO · 2016. 11. 9. · POLITECNICODIMILANO SCUOLA DI INGEGNERIA INDUSTRIALE E...

POLITECNICO DI MILANOSCUOLA DI INGEGNERIA INDUSTRIALE E

DELL’INFORMAZIONE

Corso di Laurea Magistrale in Ingegneria Matematica

Machine learning algorithms forsupervised text classification:

an industrial application on lemoteur,the search engine by Orange

Relatore: Prof. Simone Vantini Candidato:Valentina CrippaMatr. 837145

Anno Accademico 2015 - 2016

Alle mie nonne Giuseppina e Maria,per tutto quello che mi hanno insegnatoe per la famiglia che mi hanno donato.

Lo duca e io per quel cammino ascosointrammo a ritornar nel chiaro mondo;

e sanza cura aver d’alcun riposo,

salimmo sù, el primo e io secondo,tanto ch’i’ vidi de le cose belle

che porta ’l ciel, per un pertugio tondo.

E quindi uscimmo a riveder le stelle.

Dante Alighieri, "Divina Commedia"Inferno XXXIV, 133-139

Abstract

In the Internet era, efficient strategies to manage and exploit huge amountsof textual data are more and more required. In this context, Machine Learn-ing techniques turn out to be very useful, thanks to their ability to learnfrom example data or past experience how to fulfil a specific task.

In this thesis we applied some Machine Learning supervised classificationalgorithms to an industrial problem linked to a Search Engine: the aim ofthe project is to build a query interpreter able to classify a query entered byan user into one specific class between a group of predefined categories, or tolabel it as a generic query if no class is recognized.

This problem has been treated in different steps:

• Descriptive analysis: in this first step, we analysed the training dataand realised that some classes were too similar, in term of cosine sim-ilarity, to be distinguished: for this reason, we decided to merge themost similar classes in larger ones.

• Representation of the text data: one of the first challenge was to find agood representation of queries and documents. We found 2 useful rep-resentations: as a binary vector of the size of the training vocabulary,indicating the presence or the absence of a word in the phrase, or asan embedding vector obtained with the word2vec model.

• Implementation of the classifiers: we started with a simple approach,with the Naive Bayes Classifier, we passed to a Markov N-Gram Lan-guage Model, and ended with Neural Networks and Support Vector Ma-chine.

• Improvement of the classification: we combined some of the algorithms

vi

vii

into sequential classifiers, and we used the Bagging technique of Boost-ing.

• Comparison of the performances: all the classifiers have been comparedusing some performance index: the main aim is to maximize the pre-cision (the ratio of the well classified queries over the number of timesthat the classifier assigns a specific class to a query), keeping the recall(the number of times where the classifier is trigged over the total num-ber of test queries) and F1-score (the harmonic mean of precision andrecall) as large as possible.

Abstract

Nell’era di Internet e delle nuove tecnologie, sono sempre più richieste dellestrategie efficienti per gestire e sfruttare grandi quantità di dati testuali. Inquesto contesto, le tecniche di Machine Learning si rivelano molto utili, gra-zie alla loro capacità di apprendere dai dati o dalle informazioni passate comesvolgere un compito preciso.

In questa tesi alcuni algoritmi di classificazione supervisionata di MachineLearning sono stati applicati a una problematica reale legata ad un motoredi ricerca: l’obiettivo di questo progetto consiste nel costruire un interpretedi query capace di classificare una richiesta inserita da un utilizzatore nelmotore di ricerca in una classe specifica scelta in un gruppo di categoriepredefinite, o di etichettarla come query generica se non viene riconosciutanessuna classe.

Il problema è stato trattato in vari step:

• Analisi descrittiva: in questa prima fase, abbiamo analizzato il trainingset e notato che alcune categorie erano tra loro molto simili, in terminidi coseno di similitudine e difficili da distinguere: per questo motivoabbiamo deciso di raggruppare le classi più simili in altre più ampie.

• Rappresentazione dei dati testuali: uno dei primi problemi è statoquello di trovare una buona rappresentazione per query e documenti.Abbiamo trovato due rappresentazioni utili: come vettori binari di di-mensione uguale alla taglia del vocabolario dei dati training, indicandola presenza o l’assenza di ogni parola nella frase, o in vettori embedding,ottenuti con il modello word2vec.

• Implementazione dei classificatori: abbiamo iniziato con un approcciosemplice, usando il Naive Bayes Classifier, per poi passare al più com-

viii

ix

plesso Markov N-Gram Language Model e finire con Neural Networkse Support Vector Machine.

• Miglioramento della classificazione: abbiamo combinato alcuni di questialgoritmi in un classificatore sequenziale, e usato la tecnica di Boostingnota come Bagging.

• Confronto delle performances: Tutti i classificatori sono stati comparatiusando degli indici di performance: l’obiettivo é quello di massimizzarel’indice di precision (il rapporto tra le queries classificate correttamentesul numero di volte in cui il classificatore ha assegnaato una categoriaspecifica alle queries), cercando di mantenere i valori di recall (il numerodi volte in cui il classificatore si attiva sul numero totale di queries) e diF1-score (la media armonica tra precision e recall) il più alti possibile.

Contents

1 Contextualization 31.1 Presentation of the project . . . . . . . . . . . . . . . . . . . . 3

2 Information retrieval and text classification 92.1 A formal definition of Text Classification . . . . . . . . . . . . 102.2 Key performance indicators of a classifier . . . . . . . . . . . . 112.3 Preprocessing techniques . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 First basic text pre-processing . . . . . . . . . . . . . . 132.3.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 Stop Words Removal . . . . . . . . . . . . . . . . . . . 142.3.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Document similarity . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Boolean indexing and query-document similarity . . . . 162.4.2 Tf-idf term weighting and cosine similarity . . . . . . . 16

3 Machine Learning algorithms for text classification 203.1 Launch of the classifier . . . . . . . . . . . . . . . . . . . . . . 203.2 Naive Bayes and n-Gram Language Model . . . . . . . . . . . 21

3.2.1 Naive Bayes text classifier . . . . . . . . . . . . . . . . 213.2.2 Markov n-Grams Language Modeling . . . . . . . . . . 23

3.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 253.3.1 Overview on Biological and Artificial Neural Networks 253.3.2 A simple Artificial Neuron . . . . . . . . . . . . . . . . 263.3.3 Input, hidden and output layers . . . . . . . . . . . . . 273.3.4 Training a Neural Network . . . . . . . . . . . . . . . . 283.3.5 Multilayer perceptron . . . . . . . . . . . . . . . . . . . 293.3.6 Embedding . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 40

xi

CONTENTS xii

3.4.1 Linearly Separable Binary Classification . . . . . . . . 403.4.2 Nonlinear Support Vector Machines . . . . . . . . . . . 443.4.3 SVM with more than two classes . . . . . . . . . . . . 453.4.4 Limits of SVM . . . . . . . . . . . . . . . . . . . . . . 46

4 Descriptive analysis of the data and first treatments 48

5 Implementation of the classifiers 595.1 Generic queries . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Preliminary test of the models . . . . . . . . . . . . . . . . . . 615.3 Naive Bayes Text Classifier . . . . . . . . . . . . . . . . . . . . 625.4 Naive Bayes with n-Grams Text Classifier . . . . . . . . . . . 635.5 Markov n-Grams Language Modeling . . . . . . . . . . . . . . 655.6 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 665.7 Two-steps classifier . . . . . . . . . . . . . . . . . . . . . . . . 685.8 SVM with Bagging . . . . . . . . . . . . . . . . . . . . . . . . 735.9 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . 75

5.9.1 Comparison between the different models . . . . . . . . 77

Bibliography 84

Introduction

Machine Learning is a recent but quickly growing branch of Computer Sci-ence and represents the field of study that gives computers the ability to learnwithout being explicitly programmed (A. Samuel, 1959). Machine Learningexplores the study and the construction of algorithms that can learn fromand make prediction on data.

In this work we came up with some solutions for a real industrial prob-lem proposed by Orange, the biggest French multinational telecommunica-tion corporation: we studied, implemented and tested on real data furnishedby the company some Machine Learning algorithms for supervised classifica-tion applied to queries of a search engine, in order to build a so-called queryinterpreter. Indeed, Orange manages several websites on the net, each onespecialized in one different topic (such as assistance on Orange products, TVprograms, news...) and with its own specialised search engine, in addition toa website with a general search engine, lemoteur 1. Since the different searchengines in each web-pages are deeply specialized in one topic, the companyhad the idea to link lemoteur with them, whenever it is possible. For thisreason, they decided to add a preliminary step in the search of the resultsmade by lemoteur after receiving as input a query entered by an user: a queryinterpreter should be insert in order to understand if the query belongs toone of the classes for which Orange has a specialized search engine and, inthis case, use it instead of the general search engine.

In Chapter 1 we will explain and contextualize the industrial problemproposed by Orange.

Then, in Chapter 2 and 3, we will introduce the theoretical results used in1http://www.lemoteur.fr

1

CONTENTS 2

this work. In particular, in Chapter 2, we will focus on Information Retrievaland Natural Language Processing : we will see different ways to model a textas a mathematical object and all the pre-processing steps necessary beforethe classification. In Chapter 3, instead, we will focus on the theoreticalexplanation of the supervised classification algorithm implemented.

In Chapter 4 we will present the data used for building the classifiers, thetreatments and the changes made.

Finally, in Chapter 5, we will analyse the results of the implementationof all the algorithms on the test set of data.

Chapter 1

Contextualization

1.1 Presentation of the project

The project presented in this Master Thesis is born from an industrial needof the company Orange, the biggest French multinational telecommunicationcorporation, on their search engine lemoteur 1 , whose home page is repre-sented in figure 1.1.

On the Internet, Orange manages also other web pages specialized ondifferent topics, such as:

• http://actu.orange.fr/ (figure 1.2), specialised on news;

• http://assistance.orange.fr/ (figure 1.3), specialised on assistanceon Orange products;

• http://boutique.orange.fr/ (figure 1.4), specialised on Orange of-fers for different kinds of contracts (internet, mobile phone, TV...);

• http://tv.orange.fr/ (figure 1.5), specialised on TV channels andprograms.

1http://www.lemoteur.fr

3

4

Figure 1.1: Homepage of the site http://www.lemoteur.fr

Figure 1.2: Homepage of the site http://actu.orange.fr/

5

Figure 1.3: Homepage of the site https://assistance.orange.fr/

Figure 1.4: Homepage of the site http://boutique.orange.fr/

6

Figure 1.5: Homepage of the site http://tv.orange.fr/

Each website contains a search bar, which corresponds to a specific searchengine specialized on the main topics of the web page. For instance, thesearch engine in the web page http://assistance.orange.fr/ is very use-ful to solve problems of the users of Orange products or services.

Since all the search engines on these websites are very efficient on somespecific topics, the idea of Orange was to link the generic search engine ofhttp://www.lemoteur.fr with all the specialized engines in the differentOrange websites. A way to make this connection, is to add a preliminaryclassification step in the algorithms behind lemoteur, where a query enteredby a user is, if possible, classified in one of the categories of the Orange web-sites (assistance, news, TV...). In this way, if a particular class is recognisedduring this classification step, lemoteur can pass the query to the correspon-dent search engine, which will furnish relevant results. In figures 1.6 and 1.7we represent the mode of operation of lemoteur before and after the additionof this classification step.

7

Figure 1.6: lemoteur without the preliminary classification step

Figure 1.7: lemoteur with the preliminary classification step

From this industrial need the project behind this work has been defined:build a query interpreter for the search engine lemoteur. The query inter-preter receives a string of words typed by an user and has to understandwhich class (from a pre-defined set of possible classes) the query belongs to.

8

If it does not recognize any class, the query is classified as generic.

The aim of this work is to test different supervised learning classificationalgorithms, starting from text training data for some classes, furnished byOrange and directly taken from the databases of lemoteur. Since this isthe first step of a complex work in a Big Data context, in this project onlyfew classes have been considered. The number and kind of classes can bechanged (for instance, they can be grouped or divided), in order to improveperformances. The purpose of this work is to find the best algorithms onfew classes, that will be tested directly by Orange on the entire set of class(which is still under construction and could increase over time).

Chapter 2

Information retrieval and textclassification

In the last decades, content-based document tasks have gained an importantstatus in the information system field, thanks to the increasing availability ofdocuments in digital form: this is the reason why a new branch of computerscience, called Information Retrieval (IR) (Manning et al., 2008) has beengrowing since the early ’90s. More precisely:

Definition 1 Information retrieval (IR) is finding material (usually docu-ments) of an unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers). (Manninget al.,2008)

According to this definition, IR used to be an activity that only a few peopleengaged in. Now, as a result of the technological revolution, hundreds ofmillions of people all over the world engage in Information Retrieval everyday when they use a web search engine or filter their spam e-mail. IR isfast becoming the dominant form of information access, which involves animportant branch of this field: Text Classification.

Text Classification (TC), the activity of labelling natural language textswith thematic categories from a predefined set, is now a major sub-field ofthe information system discipline, thanks to the increased applicative inter-est and to the availability of more powerful hardware.

9

10

Until the late ’80s the most popular approach to TC, at least in thereal-world applications, was the knowledge engineering (KE) one (Studer etal., 1998) , consisting in manually defining a set of rules encoding expertknowledge on how to classify documents under the given categories. In the’90s (Manning et al., 2008), this approach has increasingly lost popularity(especially in the research community) in favour of the machine learning(ML) paradigm, according to which a general inductive process automaticallybuilds an automatic text classifier by learning, from a set of pre-classifieddocuments, the characteristics of the categories of interest. This kind ofapproach, which have an accuracy comparable to that achieved by humanexperts, brings to a considerable savings in terms of expert manpower andof time.

2.1 A formal definition of Text Classification

Computing a Text Classification means assigning a value (boolean or real)to each pair

< dj, ci >∈ D × Cwhere:

• D = {d1, ..., d|D|} is a collection of documents;

• C = {c1, ..., c|C|} is a set of predefined categories.

More formally, the task is to approximate the unknown target function:

Φ : D × C → V (2.1)

that describes how the documents should be classified, with a function:

Φ : D × C → V (2.2)

called classifier, modelled such that Φ and Φ coincide as much as possible.

In 2.1 and 2.2, the image set V is the same and we can distinguish twocases:

• Hard categorization: V = {True, False}

11

• Ranking categorization: V ⊆ R (often V = [0, 1])

In plain words, in hard categorization, for each document dj the classi-fier says if it belongs to the class ci or not: in the first case we will haveΦ(dj, ci) = T , in the other case we will find Φ(dj, ci) = F .

On the other hand, with ranking categorization, the classifier assigns ascore to each couple of document dj and class ci, which represents a "degreeof belief" that dj belongs to the category ci. In this case, for each documentdj, we can rank all the classes ci (i ∈ {1, ..., |C|}) from the most to the lessplausible with respect to the values of Φ(dj, ci).

2.2 Key performance indicators of a classifier

The main function of a classification algorithm is the enhancement of retrievaleffectiveness. Effective retrieval depends on two main factors:

• Items likely to be relevant to the user’s needs must be retrieved;

• Items likely to be extraneous must be rejected.

Two measures are normally used to assess the ability of a system to re-trieve the relevant and reject the nonrelevant items of a collection, known asrecall and precision, respectively. Recall is the proportion of relevant itemsretrieved, measured by the ratio of the number of relevant retrieved items tothe total number of relevant items in the collection; precision, on the otherhand, is the proportion of retrieved items that are relevant, measured bythe ratio of the number of relevant retrieved items to the total number ofretrieved items.

In principle, it is preferable that a classifier produces both high recallby retrieving everything that is relevant, and also high precision by reject-ing all items that are extraneous. The recall function of retrieval appearsto be best served by using broad, high-frequency terms that occur in manydocuments of the collection. Such terms may be expected to pull out manydocuments, including many of the relevant documents. The precision fac-tor, however, may be best served by using narrow, highly specific terms thatare capable of isolating the few relevant items from the mass of nonrelevant

12

ones. In practice, compromises are normally made by using terms that arebroad enough to achieve a reasonable recall level without at the same timeproducing unreasonably low precision.

These two index are generally merged into one generic performance index:the F1-measure, which consists on the harmonic average of the recall and theprecision:

F1−measure =recall × precisionrecall + precision

However, in our particular problem, the two index do not have the sameimportance. Since the idea behind this problem is to classify queries of ageneral search engine in order to call a specific search engine of the classpredicted, it is very important to be sure of the classification. It is muchmore dangerous to make a mistake classing a query in the wrong class (andto call the wrong classifier) instead of wrongly consider it as a query whichdoes not belong to any of the predefined categories. For this reason, inour problem, the main objective function of our query interpreter will be tomaximize the precision: it is more important to be sure to furnish the rightclassification if the classifier is trigged. Of course, it is preferable to have alsoa high value for the recall and, consequently, for the F1-score, and, for thisreason, we will supervise also their values.

2.3 Preprocessing techniques

Texts are more complex than numbers, and they can not be directly inter-preted by a classifier algorithm: that is the reason why, before computingthe classification, a preprocessing step is necessary both for the training andtest set of documents. Generally, in Text Classification (as well as in In-formation Retrieval) a text dj is represented as a vector of term weightsdj = [wj,1, ..., wj,|T |], where T is the set of terms that occur at least once inat least one document in the collection D.

Information Retrieval is essentially a matter of deciding which documentsin a collection should be retrieved to satisfy a user’s need for information.The user’s need for information is usually represented by a query (we canthink for instance to a user looking for some information on a search engine),

13

and contains one or more search terms. The retrieval decision is made bycomparing in some way the vector of terms of the query with the ones of thedocuments themselves. There are different ways to compute these vectors andto compare them. In particular, differences among approaches are accountedfor by:

• Different ways to understand what a term is;

• Different ways to select the significant terms for the meaning of a doc-ument;

• Different ways to compute terms weights (the elements of the termsvector).

However, before computing the weights, a preprocessing step is per-formed, in order to extract interesting and non-trivial knowledge from un-structured text data. In fact, the words that appear in documents and queriesoften have many structural variants. So, at the beginning, some data pre-processing techniques are applied, in order to increase the effectiveness of theIR System.

2.3.1 First basic text pre-processing

The first necessary step is to normalize all the texts. If we consider thefollowing strings:

1. ‘Je veux écouter le dernier album d’Elton John.’

2. ‘Je veux ecouter le dernier album d’Elton John.’

3. ‘Je veux écouter le dernier album d Elton John’

4. ‘je veux écouter le dernier album d’elton john.’

we will agree on the fact that they all have exactly the same meaning ("Iwant to listen to the last album by Elton John"), but for a computer, they areall different. In particular, the string number 2 is different from 1 because‘é’6=‘e’, string number 3 does not have the apostrophe and the dot andstring number 4 has lower letters instead of upper letters.

14

With this example, we can easily understand that, at the beginning of thepre-processing step, it is necessary to compute a normalization of the text,in which:

• All the upper letters are converted to lower letters;

• All the accents (`´^¨...) are removed;

• All the special characters (.,;:?!/-_...) are removed.

2.3.2 Tokenization

Tokenization is the process of breaking a stream of text into terms or tokens,which represent the atoms of the text. In general, tokens are identified withwords, but in some cases it might be interesting to consider groups of words(delimited for instance by a punctuation mark) or sentences as tokens. Thelist of tokens becomes the input for further processing.

2.3.3 Stop Words Removal

Sometimes, some extremely common words which would appear to be mean-ingless in helping select documents matching a user need are excluded fromthe vocabulary entirely. These words are called stop words, they are very fre-quently used words (like conjunctions, prepositions and pronouns) but theyare not useful in classification of documents. However, the development ofsuch stop words list in each language can be difficult and inconsistent betweentextual sources.

2.3.4 Stemming

Stemming consists on conflating the variant forms of a word into a commonrepresentation, called stem. For instance, the words presenting, presentedand presentation could all be reduced to a common stem present.

Stemming represents the most delicate step of pre-processing and thereare different ways of thinking about this step. It might happen that wordsare over-stemmed or under-stemmed. Moreover, in a big data framework,stemming is computationally very expensive and sometimes does not pro-duce optimal results: if the size of the training set is very large, stemming is

15

not always worthwhile.

There are different stemming algorithm, such as Table Look Up Approach,Successor Variety, Snowball Stemmer (for a complete list, see Frakes, 1992),which have an ad hoc implementation for the most common languages insome package of the most used programming languages1.

2.4 Document similarity

As we have already seen, in Information Retrieval a text (such as a document,a query...) is represented by a vector of terms weight. More formally, if weanalyse the context of search engines, a term vector of a document or a queryis obtained by including in each vector all possible content terms allowed inthe system (namely all the |T | = T terms t1, ..., tT that appear in at least onedocument after the pre-processing steps described in section 2.3) and addingthe associated term weight, which of course depends on the text considered:

• Document:

dj = (t0, wdj,0; t1, wdj,1; ...; tT , wdj,T ) ∀j ∈ {1, ..., |D|}

• Query:

q = (t0, wq,0; t1, wq,1; ...; tT , wq,T )

In the next subsections, we will analyse some of the most common ap-proaches to compute terms weights. Weights are used to measure the simi-larity between documents but also to compute some simple classifier. In thissection we present the term frequency–inverse document frequency weights(tf − idf weights), that will be used in the first analysis of the training doc-uments, to discover if there are some classes which are more similar thenothers.

1For instance, the nltk.stem.snowball module in Python provides an implementationof the Snowball Stemmer in 14 different languages

16

2.4.1 Boolean indexing and query-document similarity

A first simple way to define the terms weights of a text is the Boolean index-ing : the weight wdj,i of the term ti in the document dj (or the weight wq,i ofthe term ti in the query q) can only take values in {0, 1}. In particular:

• wdj,i = 1 if term ti appears at least once in document dj;

• wdj,i = 0 if term ti does not appear in document dj.

Using this definition of weight, we can introduce a measure of the similar-ity between two generic documents dj and di (the query-document similarityfunction):

Similarity(di,dj) =T∑k=1

wdi,k · wdj,k (2.3)

Since in this model the term weights are restricted to 0 and 1, the dotproduct in 2.3 simply counts the number of terms that appears both in thequery q and in the document dj.

This approach is the simplest one, but it is almost never used in realapplications, since it is more useful and significant to provide a greater degreeof discrimination among terms assigned for content representation than howit is possible using only weights of 0 and 1.

2.4.2 Tf-idf term weighting and cosine similarity

With this second approach, we will discover a more complete and signifi-cant way to assign a term weight (which will now vary continuously between0 and 1) and to measure the distance between documents. The tf − idffunction (terms frequency - inverse document frequency) has been conceivedand represents now the most used term weights definition. It is often used tocompute the similarity between documents, but is still a too simplistic way toclassify queries. However, it is very useful for a first analysis of the text data.

Using the tf − idf weights, the model of similarity between documentsexpressed in 2.3 is enhanced. Two main considerations appear important inthis improvement:

17

1. Terms that are frequently mentioned in individual documents, or doc-ument excerpts, appear to be useful for the computation of the termweights. This suggests that a term frequency (tf ) factor should beused as part of the term-weighting system, measuring the frequency ofoccurrence of the terms in the document or query texts, that we willexpress using the notation in 2.4 :

tf(tk,dj) = #(tk,dj) (2.4)

2. Term frequency factors alone cannot ensure acceptable retrieval per-formance. Specifically, when the high frequency terms are not concen-trated in a few particular documents, but instead are prevalent in thewhole collection, all documents tend to be retrieved, and this affectsthe search precision. Hence a new collection-dependent factor must beintroduced that favours terms concentrated in a few documents of acollection. The inverse document frequency (idf ) (or inverse collectionfrequency) factor performs this function. The idf factor varies inverselywith the number of documents in which a term appears (#{j : tk ∈ dj})in a collection of |D| = D documents. A typical idf factor (Salton etal., 1998) of a term tk in a collection of documents D can be computedas in 2.5

idf(tk,D) = log

(D

#{j : tk ∈ dj}

)(2.5)

Term discrimination considerations suggest that the best terms for doc-ument content identification are those able to distinguish certain individualdocuments from the remainder of the collection. This implies that the bestterms should have high term frequencies but low overall collection frequen-cies. Hence, a reasonable measure of term importance in a document can beobtained by multiplying the term tf (2.4) and idf (2.5):

tfidf(tk,dj) = #(tk,dj) · log(

D

#{j : tk ∈ dj}

)(2.6)

In order for the weights to fall in the [0, 1] interval and for the documentsto be represented by vectors of equal length, the term weights resulting from2.6 are often normalized by cosine normalization:

18

wdj,tk =tfidf(tk,dj)√∑Ts=1(tfidf(ts,dj))2

(2.7)

After the definition of the term weights, reminding that a text (query ordocument) is now represented as a vector of dimension T whose componentsare given by 2.7, we can see a text as a vector in a T -dimensional vectorspace: indeed, this is the so-called Vector Space Model. Each axe of thisvector corresponds to a term in T .

Now, we have a vectorial representation of all the documents, and wewould like to find a way to measure the similarity between them, indepen-dently from their lengths. Indeed, we can have documents with a differentnumber of terms, but this aspect should not have any consequence on thesimilarity measure.

In order to calculate the distance between two documents di and dj, wecan always use the formula 2.3, with the tf-idf weights (2.7):

Similarity(di,dj) =T∑k=1

wdi,k · wdj,k

=T∑k=1

(tfidf(tk,di)√∑Ts=1(tfidf(ts,di))2

· tfidf(tk,dj)√∑Ts=1(tfidf(ts,dj))2

)(2.8)

The function in 2.8 is known as cosine similarity : in fact, it is the dotproduct between two normalized vectors, which represents the cosine of theangle between the weights vectors of the two documents in the T−dimentionalvector space 2.

2We remind that, given two vectors a and b in an N−dimentional vector space with θthe angle between these vectors, the dot product between them is defined as:

a · b =‖ a ‖‖ b ‖ cos(θ) therefore cos(θ) =a · b

‖ a ‖‖ b ‖

19

Cosine Similarity generates a metric that says how related are two docu-ments by looking at the angle instead of magnitude3. With this model, evenif we had two vectors pointing to two far points because of their differentlength, they could still have a small angle and be considered as similar, andthat is the central point on the use of Cosine Similarity: the measurementtends to ignore the higher term count on documents, and all relevant docu-ments are treated as equally important for retrieval purpose, independentlyfrom their length.

3Notice that the function cos(x) is decreasing on [0, π2 ], which is the only interval weconsider since the weights can not be negative. This means that the more 2 vectors areclose (the smallest is the angle between them), the biggest is the cosine of the angle betweenthem.

Chapter 3

Machine Learning algorithms fortext classification

3.1 Launch of the classifier

The aim of this work is to find a classifier which is able to classify a query intoone specific class. However, it may happens that the classifier does not haveenough information in its training corpus to label a query in a specific class.Since the objective of the classifier is, first of all, to maximize the precision,and, secondly, the recall, it is better to wrongly label a query as Generic(even if it actually belongs to one of the classes) rather than to classify it ina wrong class.

For this reason, before computing the classification, we implemented apreliminary step whose aim is to understand if in the training data thereare enough information to classify the query in one of the categories of if itshould be labelled as Generic, without trigging the classifier.

We decide to perform a simple test at the beginning of each classificationin order to answer to the question ’Do we have enough information in ourtraining documents to trigger the classifier and to believe that the classifi-cation will be reliable?’. The different implemented tests are explained inchapter 5.

20

21

3.2 Naive Bayes and n-Gram Language Model

The Naive Bayes classifier is one of the most popular algorithm for supervisedclassification of texts and the first one we studied and tested: its implemen-tation is simple, but there are some hypothesis about the independence ofthe words in a text which are not realistic. For this reason we decided tofind a way to relax these hypothesis, using the n-Gram Language Model, tobuild a more realistic model.

3.2.1 Naive Bayes text classifier

One of the most known supervised learning methods is the Naive Bayesmodel. It consists on a probabilistic learning method based on the Bayes’srule: using the random variables D and C to denote the document andcategory values respectively, applying the Bayes’ formula we have:

P (C = c|D = d) =P (C = c ∩D = d)

P (D = d)=P (C = c)× P (D = d|C = c)

P (D = d)

To simplify this formula, we will write:

P (c|d) =P (c)× P (d|c)

P (d)(3.1)

Bayes’ rule decomposes the computation of a posterior probability (P (c|d))into the computation of a likelihood (P (d|c)) and a prior probability (P (c)).

As we have already seen, in text classification a document d is representedby a vector of T attributes d = (t1, ..., tT ) (notice that the attributes t1, ..., tTcan be words but also groups of n contiguous words, called n − Grams).Computing the likelihood P (d|c) is not generally trivial, since the space ofpossible documents d = (t1, ..., tT ) is vast. To simplify this computation,the Naive Bayes model introduces an additional assumption that all of theattribute values, tj, are independent given the category label, c. That is, fori 6= j, ti and tj are conditionally independent given c. This assumption isnot realistic (it would imply that the words in a sentence are independent)but greatly simplifies the computation, by reducing the equation 3.1 to:

22

P (c|d) = P (c)×∏T

j=1 P (tj|c)P (d)

(3.2)

Based on the equation 3.2, the MAP (MAximum Posterior) classifier canbe constructed by seeking the optimal category which maximizes the poste-rior P (c|d):

c = arg maxc∈C

P (c|d)

= arg maxc∈C

P (c)×∏T

j=1 P (tj|c)P (d)

= arg maxc∈C

P (c)×T∏j=1

P (tj|c)

(3.3)

(we remind that P (d) is a constant for every category c). The priordistribution P (c) can also be used to incorporate additional assumptions. Ifwe do not have any particular information a priori about the classes, we cansuppose that:

C ∼ Unif(C)In this case the equation 3.3 can be simplified as follows:

c = arg maxc∈C

T∏j=1

P (tj|c) (3.4)

Equation 3.4 is called the Maximum Likelihood Naive Bayes Classifier.Hence, the classification problem is directly linked to the computation ofP (tj|c).

The simplest and most intuitive way to compute these probabilities, is toconsider the frequencies of the terms tj in the class c:

P (tj|c) =N cj

N c(3.5)

where N cj represents the number of the occurrences of the term tj in the

class c, while N c is the total number of words in the class c.

23

There are some techniques, called smoothing, used to avoid 0-probabilitiesfor a new word never observed in the training corpus (Manning et al., 2008).However, since in our problem we want a classifier strictly based on thetraining corpus (which in the practice becomes richer and richer), we willnot use this kind of techniques: if a query contains only words which are notpresent in the training corpus, our classifier must class it as a generic query.

3.2.2 Markov n-Grams Language Modeling

Although the dominant motivation for language modeling has come fromspeech recognition, statistical language models have recently become morewidely used in many other application areas, including information retrieval.The goal of language modeling is to predict the probability of natural wordssequences, or, more simply, to put high probability on words sequences thatoften occur.

The simplest and most successful basis for language modeling is the n-Gram Model (Sidorov et al., 2014). An n − gram is a contiguous sequenceof n items (words) from a given sequence of text or speech.

In general, we know that, by the chain rule of probability, we can writethe probability of any ordered sequence of word t1, t2, ..., tT as:

P (t1, t2, ..., tT ) =T∏i=1

P (ti|t1, ..., ti−1) (3.6)

An n-gram model approximates this probability by assuming that the onlywords relevant to predicting P (ti|t1, ..., ti−1) are the previous n − 1 words,i.e. it assumes the Markov n-gram independence assumption:

P (ti|t1, ..., ti−1) = P (ti|ti−n+1, ..., ti−1) (3.7)

A straightforward maximum likelihood estimate of n-gram probabilitiesfrom a corpus is given by:

P (ti|ti−n+1, ..., ti−1) =#(ti−n+1, ..., ti)

#(ti−n+1, ..., ti−1)(3.8)

24

where #(·) is the number of occurrences of a specified gram in the train-ing corpus.

An n-gram language model can be applied to text classification in a similarmanner to a Naive Bayes Model. In this case, always assuming a uniformprior over categories:

c = arg maxc∈C

{P (c|d)}

= arg maxc∈C

{P (d|c)P (c)}(3.9)

Since we suppose that C Unif(C), the equation 3.9 becomes:

c = arg maxc∈C

{P (d|c)}

and, using the formula 3.7 of the Markov Chains, we finally obtain:

c = arg maxc∈C

T∏i=1

Pc(ti|ti−n+1, ..., ti−1) (3.10)

where P (...|c) = Pc(...).

The principle for using an n-gram language model as a text classifier is todetermine the class that makes a given document most likely to have beengenerated by the category model. Thus, we train a separate language modelfor each category, and classify a new document by evaluating its likelihoodunder each category, choosing the category according to 3.10. Hence, in or-der to classify a query, we need to know how to compute Pc(ti|ti−n+1, ..., ti−1).

In the Naive Bayes text classifier, attributes (words) are considered inde-pendent of each other given the category, However, in a language modelingbased approach, this is enhanced by considering a Makov dependence be-tween adjacent words.

We can notice that the independence assumption for the Naive Bayestext classifier can be made if we consider the terms tj as words, but it isless reasonable if the terms are n-grams (with n>1). In fact, it would meanthat, in the query "This is a query", the 2-grams "This is" and "is a" are

25

independent, while we can clearly understand that they are correlated sincethey have one word in common. For this reason, the Markov n-gram approachfurnishes a more correct model from a theoretical point of view, even if, ina big data context, the performance of the two models could be not verydifferent (Peng et al., 2003).

3.3 Artificial Neural Networks

Artificial Neural Networks (ANN) (Hagan et al., 1996), or more briefly Neu-ral Networks, are a family of machine learning models which emulate themore complex biological human neural system. Neural Networks are repre-sented by a graph, whose nodes are called Neurons and are organised in layers(input, output and hidden layers). In general, the ANN receives some inputdata which may activate a set of input neurons. After being weighted andtransformed by a function, the activations of these neurons are then passedon until finally, the output neuron is activated.

Nowadays, Neural Networks are widely exploited in several fields of Ma-chine Learning and Data Science, including classification, especially thanksto their remarkable ability to derive meaning from complicated or impre-cise data and extract patterns and detect trends that are too complex to benoticed by either humans or other more standard computer techniques.

3.3.1 Overview on Biological and Artificial Neural Net-works

The nervous system is a network of cells specialized for the reception, integra-tion and transmission of information. It includes the brain, the spinal cordand the sensory and motor nerve fibres. The fundamental unit of the nervoussystem is the neuron, which exchange messages in the form of electrical pulse.

In human beings, there is a huge number of neurons (about 1011), inter-connected using their synapses, as shown in fig. 3.1.

Summarily neurons are specialized in:

• receiving information from the internal and external environment;

26

Figure 3.1: Neurons in a Biological Neural Network (Retrieved fromhttp://www.mind.ilstu.edu)

• transmitting signals to other neurons and to effector organs;

• processing information;

• determining or modifying the differentiation of sensory receptor cellsand effector cells.

An Artifical Neural Network consists on a group of simple and identicalcomputing units (called neurons or nodes) which are able to perform complextask.

Artificial Neural Networks are similar to Biological Neural Networks inthe performing by its units of functions collectively and in parallel, ratherthan by a clear delineation of subtasks to which individual units are as-signed. The term Neural Network usually refers to models employed instatistics and artificial intelligence (Artificial Neural Network, in Wikipedia,from en.wikipedia.org/wiki/Artificial_neural_network).

3.3.2 A simple Artificial Neuron

The basic computational element of a NN is the neuron. It represents a nodeof the network and it always receives input from some other units or an ex-

27

ternal source. Each input j of a neuron i has an associated weight wi,j (whichmodels the synapses of the BNN), whose value represents the strength of theconnection between the input and the neuron.

As shown in figure 3.2, the neuron computes the weighted sum of all theinputs and evaluates the activation function in the weighted sum:

ai = g(∑j

wi,jxj)

which represents the output of the neuron.

Figure 3.2: Artificial Neuron. Retrieved fromhttp://kryten.mm.rpi.edu/SEP/index8.html

The designer chooses the activation function: the simplest way is to defineit as the identity function, and in this case we obtain a linear neuron, useful incase of linear problems (since we obtain a linear regression model). However,the most interesting application of Neural Networks is in non-linear context,and that is the reason why the most frequently used activation functions arenon-linear.

3.3.3 Input, hidden and output layers

In a Neural Network, it is important to distinguish three types of neurons:

• Input neurons: they receive data from outside the NN, their outputrepresents an input for others neurons;

28

• Output neurons: they receive data from other neurons of the NN, theiroutput is the output of the NN;

• Hidden neurons (optional): their inputs and outputs signals remainwithin the NN.

Figure 3.3: Different kinds of neurons. Retrieved fromhttp://cs231n.github.io/neural-networks-1/

3.3.4 Training a Neural Network

A key feature of neural networks is an iterative learning process in whichtraining data are presented to the network one at a time, and the weightsassociated with the input values are adjusted each time. After all trainingdata are presented, the process often starts over again. During this learningphase, the network learns by adjusting the weights so as to be able to predictthe correct class label of input samples.

The most popular neural network algorithm is Back-Propagation (an ab-breviation for "Backward propagation of errors") algorithm, used in conjunc-tion with an optimization method such as the Gradient Descent for super-vised learning. The method calculates the gradient of a loss function withrespect to all the weights in the network. The gradient is fed to the opti-mization method which in turn uses it to update the weights, in an attemptto minimize a loss function. After the definition of the architecture of the

29

neural network (number of layers, number of neurons for the hidden layers,activation function) made by the designer, the main steps of the learningphase are the following:

• We initialize the weights of the NN with random values;

• We provide the network with input data mathbfx (in our case, lines ofthe training documents represented as a binary vector) and the correctlabel y (the class of the text);

• The input is propagated forward through the network until activationreaches the output neuron;

• We compare the answer which the network has calculated y with thatwhich we wished to get:

– If y = y: no change to the network;

– If y 6= y: weights are adjusted (using a method which depends onthe type of the Neural Network implemented).

3.3.5 Multilayer perceptron

A Multilayer Perceptron (MLP) is a Feedforward Neural Network (a NNwhere connections between the units do not form a cycle) that maps sets ofinput data onto a set of appropriate outputs. An MLP consists of multiplelayers of nodes in a directed graph, with each layer fully connected to thenext one, as represented in figure 3.4. Except for the input nodes, each nodeis a neuron with a non-linear activation function. This is one of the most usedmodel of NN in classification, and it is very powerful since it can distinguishdata that are not linearly separable.

30

Figure 3.4: MLP with one hidden layer. Retrieved fromhttps://www.dtreg.com/solution/view/21

Activation function

If a Multilayer Perceptron has a linear activation function in all neurons,that is, a linear function that maps the weighted inputs to the output of eachneuron, then it is easily proved with linear algebra that any number of layerscan be reduced to the standard two-layer input-output model1. What makesa MLP different is that some neurons use a non-linear activation function:the most frequently used are

• Sigmoid function: σ(t) =1

1 + e−t

• Hyperbolic tangent function: tanh(t) =et − e−t

et + e−t

represented in figure 3.5.

Designing a MLP

Before the training phase, we have to decide the number of hidden layers andhow many neurons to use in them.

1The collapsing of hidden layer can be accomplished by simply taking an input neuronand an output neuron, and taking linear combination of all intermediate functions on thenodes that connect those two neurons together by passing through hidden layer.

31

Figure 3.5: Sigmoid function (left) and hyperbolic tangent (right)

For nearly all problems, one or two hidden layer are sufficient. Two hid-den layers are required for modeling data with discontinuities such as a sawtooth wave pattern. Using two hidden layers may improve the model, butit obviously computationally more expensive. There is no theoretical reasonfor using more than two hidden layers.

One of the most important characteristics of a perceptron network is thenumber of neurons in the hidden layer(s). If an inadequate number of neu-rons are used, the network will be unable to model complex data, and theresulting fit will be poor. If too many neurons are used, the training timemay become excessively long, and, worse, the network may over fit the data.When overfitting occurs, the network will begin to model random noise inthe data. The result is that the model fits the training data extremely well,but it generalizes poorly to new, unseen data.

There is no particular rule for choosing the number of hidden layers andneurons: they strictly depend on the problem. In order to make a goodchoice, a validation test (such as cross-validation) must be used.

Back-propagation in MLP2

Supervised learning occurs in the perceptron by changing connectionweights after each training data is processed, based on the amount of er-

2In this paragraph we will introduce a change in the notation: here the symbol t standsfor the target activation vector of the MLP and not for the vector of terms of the dictionary.

32

ror in the output compared to the expected result. In supervised learning,the objective is to tune the weights in the network such that the networkperforms a desired mapping of input to output activations. The mappingis given by a set of examples of this funcion, the training set which, in thesupervised learning context, is also called pattern set P .

Each pattern pair p of the pattern set consists of an input activationvector xp and its target activation vector tp (in our case it is a vector of oneelement containing). After training the weights, when an input activation xp

is presented, the resulting output vector ap of the net should equal the targetvector tp. The distance between the target and the actual output vector, inother words the fitness of the weights is measured by the following energy orcost function E:

E :=1

2

∑p∈P

∑n

(tpn − aps)2

where n is the number of units in the output layer. Fulfilling the learninggoal now is equivalent to finding a global minimum of E. The weights in thenetwork are changed along a search direction d(t), driving the weights in thedirection of the estimated minimum:

∆w(t) = η · d(t)

w(t+ 1) = w(t) + ∆w(t)

where the learning parameter η scales the size of the weight-step. To de-termine the search direction d(t), first order derivative information, namelythe gradient ∇E := ∂E

∂wis commonly used (algorithm of Gradient Descent).

The back-propagation algorithm performs successive computation of thegradient ∇E by propagating the error back from the output layer towardsthe input layer.

The basic idea, used to compute the partial derivatives ∇E := ∂E∂wi,j

foreach weight in the network, is to repeatedly apply the chain rule:

∂E

∂wi,j=∂E

∂ai

∂ai∂wi,j

33

where

∂ai∂wi,j

=∂ai

∂inputi

∂inputi∂wi,j

= g′(inputi)aj

with:

• inputi =∑

j wi,jxj

• g(x): activation function of the neuron.

To compute ∂E∂ai

, or the influence of the output ai of unit i on the globalerror E, the following two cases are distinguished:

• If i is ad output unit, then:

∂E

∂ai=

1

2

∂(ti − ai)2

∂si= −(ti − si)

• If i is not an output unit, then the computation of ∂E∂si

is a little morecomplicated. Again, the chain rule is applied:

∂E

∂si=

∑k∈succ(i)

∂E

∂ak

∂ak∂ai

=∑

k∈succ(i)

∂E

∂ak

∂ak∂inputk

∂inputk∂ai

=∑

k∈succ(i)

∂E

∂akg′(inputk)wk,i

(3.11)

where succ(i) denotes the set of all units k in successive layers to whichunit i has a non-zero weighted connection wk,i.

Equation 3.11 assumes knowledge of the values ∂E∂sk

for the units in suc-cessive layers to which unit i is connected. This can be provided by startingthe computation at the output layer and then successively computing thederivatives for the units in preceding layers, applying 3.11. In other words,the gradient information is successively moved from the output layer back

34

towards the input layer. Hence the name Back-propagation algorithm.

Once the partial derivatives are known, the next step in back-propagationlearning is to compute the resulting weight update. In its simplest form, theweight update is a scaled step in the opposite direction of the gradient, inother words the negative derivative is multiplied by a constant value, thelearning rate η. This minimization technique is commonly known as GradientDescent :

∆w(t) = −η · ∇E(t)

or, for a single weight:

∆wi,j(t) = −η · ∂E∂wi,j

(t)

The training phase ends when the backpropagation algorithm stop pro-ducing changes into the values of the weights, i.e. when for each trainingdata the predicted output is equal to the real expected output.

3.3.6 Embedding

Until now, in all the models we have presented, the documents/queries wererepresented by sparse vectors whose dimension was equal to the size of thedictionary of the training corpus. However, there are some other ways tomodel text data, like Embedding (Bian et al., 2014), whose idea is to repre-sent words in a continuous vector space where semantically similar words aremapped to nearby points. This new representation is obtained thanks to aspecific Artificial Neural Network.

More precisely, the Embedding is a predictive method, i.e. it tries topredict a word from its neighbours (or the neighbours from a word) in terms oflearned small, dense embedding vectors, considered parameters of the model.One of the most famous and computationally-efficient predictive model forlearning word embeddings from raw text is Word2vec (Mikolov et al., 2013).This model comes in two flavors, the Continuous Bag-of-Words model andthe Skip-Gram model. Algorithmically, these models are similar, except thatCBOW predicts target words from source context words, while the skip-gramdoes the inverse and predicts source context-words from the target words. In

35

general, the first one is more efficient on small dataset and the second onewith larger ones: that is the reason why we will focus on the skip-gram model.

Theoretical Skip-Gram model

The training objective of the Skip-gram model is to find word representa-tions that are useful for predicting the surrounding words in a sentence or adocument.

Figure 3.6: The Skip-Gram model. Retrieved from Rong, 2014.

As represented in figure 3.6, the embedded representation h ∈ RN of aninput word x ∈ {0, 1}V (where V is the size of the training corpus vocabularyand N < V ) is obtained as the one who best predict its context y1, ...,yC,with yi ∈ RV and C is the considered size of the context (the context ofa word is composed by the C nearest words). In the training phase of thisNeural Network, the algorithm of back-propagation finds the best values forthe elements of 2 matrix of weights:

• the matrix W = {wij} ∈ RV×N of the weights from the input to thehidden layer;

36

• the matrix W ′ = {w′ij} ∈ RN×V of the weights from the hidden to theoutput layer.

Each row of W is the N -dimension vector representation vw of the asso-ciated word of the input layer. Given an input word, assuming xk = 1 andxk′ = 0 ∀k′ 6= k, then

h = xTW = W(k,·) =: vwI

which os essentially copying the k-th row ofW to h. vwIis the vector rep-

resentation of the input word wI . This implies that the activation functionof the hidden layer units is simply linear, since it passes directly its weightedsum of inputs to the next layer.

From the hidden layer to the output layer, there is a different weightmatrix W ′. Using these weights, we can compute a score uc,j for each word jof the vocabulary which might be the c-th word of the context (c = 1, ..., C):

uc,j = v′Twj· h

where v′wjis the j-th column of the matrix W ′. Then, we can use soft-

max (Rong, 2014), a log-linear classification model, to obtain the posteriordistribution of words, which is a multinomial distribution for each contextword (at the end we will have C multinomial distributions), computed withthe same hidden/output matrix:

p(wc,j = wO,j|wI) = yc,j =exp(uc,j)∑Vj′=1 exp(uj′)

where wc,j is the j-th word on the c-th panel of the output layer, wO,c isthe actual c-th word in the output context words, wI is the only input word,yc,j is the output of the j-th unit on the c-th panel of the output layer, uc,j isthe net input of the j-th unit on the c-th panel of the output layer. Becausethe output layer panels share the same weights, thus:

uc,j = uj = v′wj

T · h for c = 1, ..., C

Using the back-propagation algorithm, we can find the updating equationfor the weights of the matrix W and W ′3.

3For the computation of the updating formula, see: Xin Rong, Word2vec ParameterLearning Explained arXiv preprint arXiv:1411.2738 , pp 8-9, 2014

37

Optimizing Computational Efficiency

In the theoretical Skip-gram model, there exist two vector representationsfor each word in the vocabulary: the input vector vw and the output vectorv′w. Learning the input vectors vw is cheap; but learning the output vectorsv′w is very expensive, making it impractical to scale up to large vocabulariesor large training corpora. To solve this problem, an intuition is to limit thenumber of output vectors that must be updated per training instance. Oneelegant approach to achieving this is hierarchical softmax; another approachis through sampling, called negative sampling.

Hierarchical Softmax Hierarchical softmax is an effcient way of comput-ing softmax. The model uses a binary tree to represent all words in thevocabulary. The V words must be leaf units of the tree. It can be provedthat there are V −1 inner units. For each leaf unit, there exists a unique pathfrom the root to the unit; and this path is used to estimate the probabilityof the word represented by the leaf unit. In figure 3.7 an example tree isrepresented: the white units are words in the vocabulary, and the dark unitsare inner units. An example path from root to w2 is highlighted. In theexample shown, the length of the path L(w2) is equal to 4 and n(w, j) meansthe j-th unit on the path from root to the word w

Figure 3.7: An example binary tree for the hierarchical softmax model. Re-trieved from Rong, 2014

In the hierarchical softmax model, there is no output vector representa-tion for words. Instead, each of the V − 1 inner units has an output vector

38

v′n(w,j). The probability of a word being the output word is defined as:

p(w = wO) =

L(w)−1∏j=1

σ(Jn(w, j + 1) = ch(n(w, j))K · v′Tn(w,j)h

)where ch(n) is an arbitrary fixed child of n, v′n(w,j) is the vector repre-

sentation of the inner unit n(w, j), h is the output value of the hidden layer(in the skip-gram model h = vwI

), JxK is a function defined as:

JxK =

{1 if x is true−1 otherwise

This means that the probability that the probability of a word being theoutput word is defined as the probability to take the unique path from theroot of the binary tree that brings to the desired word.

For this model, we can easily find the update equations for the weights 4

Negative Sampling Given a pair (w, c) of word and context, we denoteby p(D = 1|w, c) the probability that (w, c) comes from the corpus data Dand p(D = 0|w, c) = 1 − p(D = 1|w, c) the probability that (w, c) does notcome from he corpus data. Our goal is to find parameters to maximize theprobabilities that all of the observations indeed came from the data:

arg maxvci ,vwi

∏(w,c)∈D

p(D = 1|w, c) = arg maxvci ,vwi

log∏

(w,c)∈D

p(D = 1|w, c)

= arg maxvci ,vwi

∑(w,c)∈D

log p(D = 1|w, c)

The quantity p(D = 1|w, c) can be defined using the softmax:

p(D = 1|w, c) =1

1 + e−vc·vw

Leading to the objective:4For the computation of the updating formula, see: Xin Rong, Word2vec Parameter

Learning Explained arXiv preprint arXiv:1411.2738 , pp 11-12, 2014

39

arg maxvci ,vwi

∑(w,c)∈D

log1

1 + e−vc·vw

This objective has a trivial solution if we set the parameters vc and vw

such that p(D = 1|w, c) = 1 for every pair (w, c). This can be easily achievedby setting the parameters such that vc = vw and vc · vw = K for all vc andvw, where K is large enough number (practically, we get a probability of 1as soon as K ≈ 40).

We need a mechanism that prevents all the vectors from having the samevalue, by disallowing some (w, c) combinations. One way to do so, is topresent the model with some (w, c) pairs for which p(D = 1|w, c) must be low,i.e. pairs which are not in the data. This is achieved by generating the set D′of random (w, c) pairs, assuming they are all incorrect (the name negative-sampling stems from the set D′ of randomly sampled negative examples).The optimization objective now becomes:

arg maxvci ,vwi

∏(w,c)∈D

p(D = 1|w, c)∏

(w,c)∈D′

p(D = 0|w, c)

= arg maxvci ,vwi

∏(w,c)∈D

p(D = 1|w, c)∏

(w,c)∈D′

(1− p(D = 1|w, c)

)= arg max

vci ,vwi

∑(w,c)∈D

log p(D = 1|w, c) +∑

(w,c)∈D′

log(1− p(D = 1|w, c)

)= arg max

vci ,vwi

∑(w,c)∈D

log1

1 + e−vc·vw+

∑(w,c)∈D′

log(

1− 1

1 + e−vc·vw

)= arg max

vci ,vwi

∑(w,c)∈D

log1

1 + e−vc·vw+

∑(w,c)∈D′

log1

1 + evc·vw

= arg maxvci ,vwi

∑(w,c)∈D

log σ(vc · vw) +∑

(w,c)∈D′

log σ(−vc · vw)

In order to sample the elements of D′, word2vec uses unigram distributionraised to the 3

4th power for best quality of results 5 Even for this model, we

5Thomas Mikolov et al., "Distributed representations of words and phrases and theircompositionality.. Advances in neural information processing systems. 2013.

40

can now easily find the update equations for the weights 6

3.4 Support Vector Machine

Support Vector Machines (SVMs) are a set of supervised learning methodsused for classification and also for regression, initially conceived in Cortes etal., 1995.

3.4.1 Linearly Separable Binary Classification

We start with a training set formed by L training points, where each inputxi has D attributes (i.e. xi is a vector of dimensionality D), and is in one oftwo classes yi ∈ {1,−1}. It means that the training data is of the form:

{xi, yi} where i = 1, ..., L; yi ∈ {1,−1}; xi ∈ RD

We assume that the data is linearly separable, i.e. that we can find anhyperplane on graphs of x1, ..., xD separating the two classes. An examplewith D = 2 is represented in figure 3.8.

Figure 3.8: Hyperplane through two linearly separable classes (Retrievedfrom Fletcher, 2009)

6For the computation of the updating formula, see: Xin Rong, Word2vec ParameterLearning Explained arXiv preprint arXiv:1411.2738 , pp 13-14, 2014

41

The hyperplane can be described by

w · x + b = 0 (3.12)

where:

• w is normal to the hyperplane;

• b‖w‖ is the perpendicular distance from the hyperplane to the origin.

We define the Support Vectors as the examples of each class closest tothe separating hyperplane, and the aim of the SVM is to orientate this hy-perplane in such a way to be as far as possible from the closest members ofboth classes.

Referring to figure 3.8 and to the formula 3.12, implementing a SVMboils down to selecting the variables w and b so that our training data canbe described by:

xi ·w + b ≥ +1 for yi = +1

xi ·w + b ≤ −1 for yi = −1

These equations can be combined into:

yi(xi ·w + b)− 1 ≥ 0 ∀i (3.13)

If we now just consider the points that lie closest to the separating hy-perplane, i.e. the Support Vectors (shown in circles in the figure 3.8), thenthe two planes H1 and H2 that these points lie on can be described by:

xi ·w + b = +1 for H1

xi ·w + b = −1 for H2

Referring to figure 3.8, we define d1 as being the distance from H1 to thehyperplane and d2 from H2 to it. The hyperplane’s equidistance from H1

and H2 means that

d1 = d2 (3.14)

42

The quantity defined in 3.14 is known as the SVM’s margin. In order toorientate the hyperplane to be as far from the Support Vectors as possible,we need to maximize this margin.

Simple vector geometry (Fletcher, 2009) shows that the margin is equalto 1‖w‖ and maximizing it subject to the constraint in 3.13 is equivalent to

finding:

min ‖ w ‖ s.t. yi(xi ·w + b)− 1 ≥ 0 ∀i

Minimizing ‖ w ‖ is equivalent to minimizing 12‖ w ‖2 and the use of

this term makes it possible to perform Quadratic Programming (QP) opti-mization later on (Bertsekas, 1999). We therefore need ot find

min1

2‖ w ‖2 s.t. yi(xi ·w + b)− 1 ≥ 0 ∀i

In order to cater for the constraints in this minimization, we need toallocate them Lagrange multipliers α (Bertsekas, 1999), where αi ≥ 0 ∀i:

LP ≡1

2‖ w ‖2 −

L∑i+1

αi[yi(xi ·w + b)− 1]

≡ 1

2‖ w ‖2 −

L∑i+1

αiyi(xi ·w + b) +L∑i+1

αi

(3.15)

We wish to find the w and b which minimizes, and the α which maximizes3.15 (keeping αi ≥ 0 ∀i). We can do this by differentiating LP with respectto w and b and setting the derivatives to zero:

∂LP∂w

= 0⇒ w =L∑i=1

αiyixi (3.16)

∂LP∂b

= 0⇒L∑i=1

αiyi = 0 (3.17)

Substituting these expression into 3.15 gives a new formulation which,being dependent on α, we need to maximize:

43

LD ≡L∑i=1

αi −1

2

∑i,j

αiαjyiyjxi · xj s.t. αi ≥ 0∀i,L∑i=1

αiyi = 0

≡L∑i=1

αi −1

2

∑i,j

αiHi,jαj where Hi,j = yiyjxi · xj

≡L∑i=1

αi −1

2αTHα s.t. αi ≥ 0∀i,

L∑i=1

αiyi = 0

This new formulation LD is referred to as the Dual form of the PrimaryLP (Bertekas, 1999). It is worth noting that the Dual form requires only thedot product of each input vector xi to be calculated, this is important forthe Kernel Trick described in the next section.

Having moved from minimizing LP to maximizing LD, we need to find:

maxα

L∑i=1

αi −1

2αTHα s.t. αi ≥ 0∀i and

L∑i=1

αiyi = 0

This is a convex quadratic optimization problem, and we run a QP solver7which will return α and from 3.16 will give us w. What remains is to calcu-late b.

Any data point satisfying 3.17 which is a Support Vector xs will have theform:

ys(xs ·w + b) = 1

Substituting in 3.16 we obtain:

ys(∑m∈S

αmymxs + b) = 1

where S denotes the set of indices of the Support Vectors. S is determinedby finding the indices i where αi > 0. Multiplying through by ys and thenusing y2s = 1 from 3.4.1 and 3.4.1:

7see for instance http://abel.ee.ucla.edu/cvxopt/examples/tutorial/qp.html

44

y2s(∑m∈S

αmymxm · xs + b) = ys

b = ys −∑m∈S

αmymxm · xs

Instead of using an arbitrary Supporto Vector xs, it is better to take anaverage over all the Support Vector in S:

b =1

NS

∑s∈S

(ys −∑m∈S

αmymxm · xs)

We now have the variablesw and b that define our separating hyperplane’soptimal orientation and hence our Support Vector Machine.

3.4.2 Nonlinear Support Vector Machines

When applying the SVM to linearly separable data, we have defined a matrixH from the dot product of our input variables:

Hi,j = yiyjxi · xj = yiyjxiTxj (3.18)

We can re-write the 3.18 as following:

Hi,j = yiyjk(xi,xj) (3.19)

where k(xi,xj) is a family of functions called Kernel Functions (Cristian-ini et al., 2000) where we can find also the dot product k(xi,xj) = xi · xj,known as Linear Kernel.

The set of kernel functions is composed of variants of the dot product,in that they are all based on calculating inner products of two vectors. Thismeans that if the functions can be recast into a higher dimensionality spaceby some potentially non-linear feature mapping function x→ φ(x), only in-ner products of the mapped inputs in the feature space must be determined,without needing to explicitly calculate φ.

The reason that this Kernel Functions are useful is that there are manyclassification and regression problems that are not linearly separable in thespace of the inputs x, which might be in a higher dimensionality feature

45

space given a suitable mapping x → φ(x), like in the example in figure 3.9,where a Radial Basis Kernel8 has been used.

Figure 3.9: Dichotomous data re-mapped using Radial Basis Kernel (Re-trieved from Fletcher, 2009)

As we can see in figure 3.9, a data set that is not linearly separable inthe two dimensional data space x (as in the left hand side of the figure), canbe separable in the nonlinear feature space (right hand side of the figure)defined implicitly by a non-linear kernel function.

There are many kernel functions in literature: for a complete list, werecommend to see Souza, 2010.

3.4.3 SVM with more than two classes

Even if SVM have been created to compute binary classification, they can beextended also in multi-class frameworks, by decomposing such problems intobinary classification problems. In particular, two approaches are commonly

8

k(xi,xj) = e−‖xi−xj‖

2

2σ2

46

used: One-Vs-All and One-Vs-One (Chih-Wei et al., 2002).

The strategy One-Vs-All consists on fitting one classifier per class. Foreach classifier, the class is fitted against all the other classes. In addition toits computational efficiency (only a number of classifiers equal to the num-ber of class is needed), one advantage of this approach is its interpretability.Since each class is represented by one and one classifier only, it is possibleto gain knowledge about the class by inspecting its corresponding classifier.This is the most commonly used strategy and is a fair default choice.

One-Vs-One instead constructs one classifier per pair of classes. At pre-diction time, the class which received the most votes is selected. In the eventof a tie (among two classes with an equal number of votes), it selects theclass with the highest aggregate classification confidence by summing overthe pair-wise classification confidence levels computed by the underlying bi-nary classifiers. Since it requires to fit n∗(n−1)

2classifiers (where n is the

number of classes), this method is usually slower than One-Vs-All, due toits O(n2) complexity. However, this method may be advantageous for al-gorithms such as kernel algorithms, which don’t scale well with the numberof training samples. This is because each individual learning problem onlyinvolves a small subset of the data whereas, with One-Vs-All, the completedataset is used n times.

3.4.4 Limits of SVM

SVM is a very powerful algorithm for classification, but its implementationbecomes very slow in Big Data context: indeed the it time complexity is morethan quadratic with the number of samples which makes it hard to scale todataset with more than a couple of 10000 samples9.

For this reason, some Boosting techniques are often applied to train aclassifier with a large number of samples. One of the simplest and most usedBoosting technique for SVM is the Bagging (Kim et al., 2002).

9from http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

47

Bagging

A Bagging classifier is an ensemble meta-estimator that fits base classifierseach on random subsets of the original dataset and then aggregate their in-dividual predictions (either by voting, for the classification algorithms, orby averaging, for the regression) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of an estima-tor, by introducing randomization into its construction procedure and thenmaking an ensemble out of it, and to avoid overfitting.

Chapter 4

Descriptive analysis of the dataand first treatments

At the beginning of this work we received the first data to build the classifiers.More precisely, Orange furnished one training document for each one of thefollowing classes:

1. Accessories (for mobile phones, tablets, ...);

2. Assistance (technical assistance);

3. Contracts (for mobile phones, internet, ...);

4. Movies

5. People

6. Phones

Each document is composed of a huge number of lines taken from webpages belonging to the correspondent class. In this big data context, we ex-pect to find also some noise within the data.

The file Assistance being too big to be exploited in the computation (morethan 1 Gb, while the other file’s size are much more reduced), it has beendivided into 14 different files of about 100 Mb and we used only one of themfor the computation.

48

CHAPTER 4. DESCRIPTIVE ANALYSIS OF THE DATA AND FIRST TREATMENTS49

The table 4.1 shows some first features of the documents we used to buildthe classifiers.

Accessories Assistance Contracts Movies People PhonesSize 277 KB 107 MB 58 KB 6.27 MB 4.71 MB 2.05 MBNb ofwords 23 300 17 054 210 4 362 540 871 413 650 157 151

Nb of differentwords 1 998 106 516 612 40 606 32 529 2 416

Nb of differentstems 775 41323 237 15753 12620 937

Table 4.1: Dimensions of the training documents

In the tables from 4.2 to 4.7 we can find the 10 words with the highesttf-idf for each document (between brackets the translation in English), afterthe elimination of the stopwords.

Stem Tf-idfetui (holder for mobile phone) 0.469iphon (model of mobile phone) 0.414samsung (mobile phone brand) 0.302

folio (type of mobile phone holder) 0.271noir (black) 0.268

coqu (casing for mobile phone) 0.223galaxy (model of samsung mobile phones) 0.184

protect 0.137enceinte (speaker, but also pregnant) 0.130

plus 0.110

Table 4.2: 10 words with the highest tf-idf of the class Accessories


Words Tf-idfdépannag (repairing) 0.429

mobil 0.300install 0.253

accueil (welcome) 0.180samsung (mobile phone brand) 0.165

dossi (dossier) 0.157modif (modify) 0.155appui (press) 0.151

techniqu (technical) 0.148orang (orange, telecommunication corporation) 0.139

Table 4.3: 10 words with the highest tf-idf of the class Assistance

Words Tf-idfmobil 0.261

illimit (unlimited) 0.255internet 0.228

bouquet (contract) 0.222appel (phone calls) 0.206Go (Giga Byte) 0.191

4G 0.182runat 0.182intru 0.176

engag (commitment) 0.160

Table 4.4: 10 words with the highest tf-idf of the class Contracts


Words Tf-idfjeun (young) 0.262

plus 0.211femm (woman) 0.196

jour (day) 0.179homm (man) 0.162tout (all) 0.160vi (life) 0.160an (year) 0.155

film 0.140deux (two) 0.140

Table 4.5: 10 words with the highest tf-idf of the class Movies

Words Tf-idffilm 0.282

rol (role,part) 0.250realis (achieve) 0.250

an (year) 0.195anne (year) 0.191

the 0.179carrier (career) 0.171puis (then) 0.155

acteur (actor) 0.135cinem (cinema) 0.132

Table 4.6: 10 words with the highest tf-idf of the class People


Words Tf-idfnshyperlinklienextern 0.367

inserv 0.341runat 0.341

navigateurl 0.271inwc 0.260inen 0.245

inmutlink 0.205cssclass 0.205iphon 0.201plus 0.175

Table 4.7: 10 words with the highest tf-idf of the class Telephone

Analysing all these tables, we can notice that the most characteristicwords are coherent within their own class, except for the class Phones (4.7):the problem is tied to some noise in the training document, whose lines con-tain a lot of phone description as well as a huge number of hyper-links likethe following line:

Samsung Galaxy S6 blanc <wc ns:hyperlinklienexterne runat= inserverin target= in blank in navigateurl= inhttp://sites.orange.fr/shop/forfaitsmobiles/offres/promotion-orange-reprise.html in><wc ns:mediacontrolrunat= inserver in imagewidth= in260 in imageheight= in64 in idmedia=in12529 in></wc ns:mediacontrol></wc ns:hyperlinklienexterne>puissanceet design : osez l’excellence !serti de metal, habille de verreet galbe a la perfection, le galaxy s6 a ete faconne avec soin pourun resultat a couper le souffle. parce que sa perfection ne se limitepas a ses lignes, samsung a concentre tout son savoir-faire au coeurdu galaxy s6.orange intensifie sa couverture 4g dans les centres-villesde forte densite et permet encore plus de rapidite avec des debitsjusqu’a 223 mbit/s. <wc ns:hyperlinklienexterne runat= inserverin title= inen savoir plus sur la 4g+ in target= in blank in tooltip=inen savoir plus sur la 4g+ in text= inen savoir plus sur la 4g+in navigateurl= inhttp://reseaux.orange.fr/ in cssclass= inmutlink01in></wc ns:hyperlinklienexterne>

This kind of link is repeated in almost each line of the text, which causes


Words Tf-idfiphon (model of mobile phone) 0.260

plus 0.245target 0.205blank 0.205

ipad (model of tablet) 0.201Go (Giga Byte) 0.140

samsung (mobile phone brand) 0.132model 0.121

galaxy (model of samsung mobile phones) 0.119

Table 4.8: 10 words with the highest tf-idf of the class Telephone after thefiltering process

this noise in the data. That is the reason why we decided to execute a fil-tering on the training document of the class Phones, in which we delete allthe meaningless words of the class Phones that appeared in all this kind oflines. The words with the highest tf-idf of this class after the filtering arerepresented in the table 4.8 .

We notice that there are some words which appear in more than onetable: this suggests that there are classes more similar than others. Wecould suppose even only intuitively that some classes, like Accessories andTelephone or Movies and People, share a lot of characteristic words. In thehistograms in the figures from 4.1 to 4.6 we represent the 10 words with thehighest tf-idf for each class compared to all the others classes.


Figure 4.1: Words with the highest tf-idf of the class Accessories

Figure 4.2: Words with the highest tf-idf of the class Assistance


Figure 4.3: Words with the highest tf-idf of the class Contracts

Figure 4.4: Words with the highest tf-idf of the class Movies


Figure 4.5: Words with the highest tf-idf of the class People

Figure 4.6: Words with the highest tf-idf of the class Phones

In order to deeply understand the similarity between the classes, we com-pute the cosine similarity between each pair of training documents (we remind


Accessories Assistance Contracts Movies People PhonesAccessories - 0.1478 0.4827 0.0211 0.0314 0.5517Assistance 0.1478 - 0.1672 0.0407 0.0257 0.2450Contracts 0.4827 0.1672 - 0.0427 0.0524 0.5721Movies 0.0211 0.0407 0.0427 - 0.6017 0.0639People 0.0314 0.0257 0.0524 0.6017 - 0.0428Phones 0.5517 0.2450 0.5721 0.0639 0.0428 -

Table 4.9: Cosine similarity between the 6 classes

that we have one document for each class): from the values in table 4.9 wecan clearly notice some important similarities:

• The classes Movies and People are very similar;

• The classes Phones and Contracts are very similar;

• The classes Phones and Accessories are very similar;

• The classes Contracts and Accessories are quite similar;

• The classes Movies and People are very different from the others.

In order to compute these values, first of all, the training documents havebeen pre-processed, following the steps described in section 2.3 (consider-ing words as tokens). In order to compute tokenization, elimination of stopwords and stemming, the Python packages scikit-learn1 and nltk2, spe-cialized in machine learning classification algorithms and in natural languageprocessing, have been used.

At the end of the pre-processing step, the tf-idf matrix was computed:it consists on a matrix where the lines represent the training documents ofthe collection D = {d1, ..., dD} and the columns the terms T = {t1, ..., tT}which appear in at least one document after pre-processing. Each element ofthe matrix in position (j, k) represent the tf-idf weight, which is equal to 0if term tk does not appear in document dj, otherwise is calculated using the

1http://scikit-learn.org2http://www.nltk.org


Assistance Cinema StoreAssistance - 0.0482 0.2627Cinema 0.0482 - 0.0671Store 0.2627 0.0671 -

Table 4.10: Cosine similarity between the classes

formula 2.7.

Then, the cosine similarity (2.8) between each pair of documents is com-puted through a dot product.

Since the results show that some classes are very similar, we decided togroup them. In particular, we made the following changes:

• The classes Accessories, Contracts and Phones have been grouped intothe same class called Store;

• The classes Movies and People have been grouped into the same classcalled Cinema.

The cosine similarity between the 3 new classes is shown in the table 4.10.

We notice that the classes Assistance and Store are more similar, whilethe class Cinema is clearly more isolated.

After these modifications, Orange furnished some test queries for the 3new classes Assistance, Cinema and Store, that will be used from now on inall the classifiers, whose implementation is explained in the next chapter.

Chapter 5

Implementation of the classifiers

The first aim of this work was to build a query interpreter for the Orangesearch engine le moteur 1. The query interpreter receives a string of wordstyped by a user and has to understand which class (from a pre-defined set ofpossible classes) the query belongs to.

The classifiers we have analysed in the previous chapter have been im-plemented on real data collected from the search engine. In particular, weremind that for the training part, we had at our disposal some text docu-ments belonging to the following classes:

• Accessories (for mobile phones, tablets...);

• Assistance (technical assistance);

• Contracts (for mobile phones, internet...);

• Movies ;

• People;

• Phones.

and after the descriptive analysis step, we decide to group them into 3classes:

• Assistance;1http://www.lemoteur.fr

59

60

• Store (obtained merging the classes Accessories, Contracts and Phones);

• Cinema (obtained merging the classes People et Movies).

Starting from these documents, we trained our classifiers in order to letthem interpret a new query and classify it in one specific class or to put itin the Generic class if no matching was found. Orange asked for very goodresults in terms of performances: in particular, it is better to classify a querywhich actually belongs to one of our classes in the "General" class (false neg-ative error) instead of classifying it in the wrong class (false positive error).The false negative error should be reduced as much as possible.

In particular, Orange asked to evaluate each classifier using the sameperformance index: precision, recall and F1-score. The objective functionof our classifier is to maximise the precision, keeping the recall and the F1-measure as high as possible. We remind that:

• the precision index represents the percentage of the true positive classi-fications over all the times that the classifier proposes a class (differentfrom the Generic class):

precision =#true positives

#queries classified

• the recall index represents the percentage of the queries classified (cor-rectly or wrongly) in a class different from Generic over all the queriestests:

recall =#queries classified

#queries

• the F1-measure is a way to combine the precision and the recall and itconsists on their harmonic mean:

F1− score = 2× precision× recallprecision+ recall

61

5.1 Generic queries

Since we have a limited number of classes and they do not cover all the pos-sible themes of the queries and all the words of the French vocabulary, weexpect to find some queries which can not be classified in any of the classeswe know or which contains some words never found in the training corpus:in this case, the classifier is not triggered and the query should be labelledas Generic. Orange specified that a good classifier should minimize as muchas possible the number of queries wrongly classified (namely the number ofqueries for which the classifier has been trigged but which have been classi-fied in the wrong category) and that it is better to wrongly label a query asGeneric rather than to class it in a wrong class.

Independently from the classification algorithm, the first step that shouldbe performed is to understand if we have enough information to trigger theclassifier or if we should label the query as Generic.

We decide to perform a simple test at the beginning of each classifica-tion in order to answer to the question ’Do we have enough information inour training documents to trigger the classifier and to believe that the clas-sification will be reliable?’ : after applying the preprocessing techniques, wecount the number (or the percentage) Q of terms of the query present alsoin at least one of the training documents and we compare it with a threshold(which can be fixed or dependent to the length of the query) S :

1. if Q < S: the query is labelled as Generic;

2. if Q ≥ S: the classifier is triggered and the query will be classified withthe classification algorithm.

5.2 Preliminary test of the models

Being in a Big Data context, we will probably find lot of noise in our data,which can cause some mistakes in the test of the classifiers.

In order to confirm the model before testing it on the real data, we builtan ad hoc dataset, composed by 100 Assistance queries and 100 Cinema

62

queries that should be well classified with our training documents, whichhas been used to be sure that the model works well in an ideal context andthat there are no mistakes in its theoretical construction or in the code. Ifthis preliminary test is positive, we can proceed with the application of thealgorithm on the real test data, trying to understand the reasons behind themisclassification errors.

5.3 Naive Bayes Text Classifier

We implemented the Naive Bayes Text Classifier in Python, using only onetype of preliminary test to decide whether trigger the classifier or label thequery as Generic: a query is labelled as Generic if we do not recognize anyof its words after the preprocessing step. This means that the preliminarytest is: {

Q = 0 the query is labelled as GenericQ ≥ 1 compute the classification

We tested the model on:

• 34687 queries Assistance

• 5947 queries Cinema

• 2413 queries Store

and we obtained the results represented in table 5.1

Predicted classAssistance Cinema Store Generic

Real classAssistance 30155 1418 1963 1151Cinema 1266 3732 168 781Store 1521 162 543 187

Table 5.1: Results with Naive Bayes Text Classifier

We obtain the following index:

precision = 84.12%

63

recall = 95.08%

F1− score = 89.27%

5.4 Naive Bayes with n-Grams Text Classifier

We implemented the Naive Bayes Text Classifier using not only the words,but all the n-grams with n ∈ {1, 2, 3} together in Python: this means that webuilt a dictionary from the training documents with all the words (1-grams),2-grams and 3-grams and, for each query, we analysed all the words, 2-gramsand 3-grams. We used two types of preliminary test to decide whether trig-ger the classifier or label the query as Generic: a less conservative approach,where a query is labelled as Generic if we do not recognize any of its wordsafter the preprocessing step (as for the Naive Bayes Text Classifier), and amore conservative approach, where a query of K words is labelled as Genericif we recognize less than K n-grams) (with n ∈ 1, 2, 3). For instance, if weconsider a query of 4 words (with 9 n-grams : 4 1-grams, 3 2-grams and 2 3-grams), with the non conservative approach, the query is labelled as Genericif we do not recognize any word, while with the conservative approach, if werecognize less than 4 of its 9 n-grams.

Therefore, for a query q with K words and G n-grams, we count thenumber Q of known n-grams, and we compute the non conservative test:{


or the conservative test:{Q < K the query is labelled as GenericQ ≥ K compute the classification




64


and we obtained the results represented in tables 5.2 and 5.3.



Table 5.2: Results with Naive Bayes and n-grams Text Classifier - non con-servative



Table 5.3: Results with Naive Bayes and n-grams Text Classifier - conserva-tive

We obtained the following index with the non conservative (nc) approach:

precisionnc = 86.15%

recallnc = 95.08%

F1− scorenc = 90.40%

and, with the conservaitve (c) approach:

precisionc = 87.08%

recallc = 87.48%

F1− scorec = 87.28%

65

5.5 Markov n-Grams Language Modeling

We implemented the classifier based on Markov n-Grams Language Mod-eling with, at the same time, n ∈ {1, 2, 3} in Python: this means that webuilt a dictionary from the training documents with all the words (1-grams),2-grams and 3-grams and, for each query, we analysed all the words, 2-gramsand 3-grams. We used two types of preliminary test to decide whether trig-ger the classifier or label the query as Generic: a less conservative approach,where a query is labelled as Generic if we do not recognize any of its wordsafter the preprocessing step (as for the Naive Bayes Text Classifier), and amore conservative approach, where a query of K words is labelled as Genericif we recognize less than K n-grams) (with n ∈ 1, 2, 3). For instance, if weconsider a query of 4 words (with 9 n-grams : 4 1-grams, 3 2-grams and 2 3-grams), with the non conservative approach, the query is labelled as Genericif we do not recognize any word, while with the conservative approach, if werecognize less than 4 of its 9 n-grams.

Therefore, for a query q with K words and G n-grams, we count thenumber Q of known n-grams, and we compute the non conservative test:{


or the conservative test:{Q < K the query is labelled as GenericQ ≥ K compute the classification





and we obtained the results represented in tables 5.4 and 5.5.We obtained the following index with the non conservative (nc) approach:

precisionnc = 85.77%

66



Table 5.4: Results with non-conservative Markov n-Grams Language Mod-elling



Table 5.5: Results with conservative Markov n-Grams Language Modelling

recallnc = 95.08%

F1− scorenc = 90.18%

and, with the conservaitve (c) approach:

precisionc = 87.18%

recallc = 87.48%

F1− scorec = 87.33%

5.6 Multilayer Perceptron

We started implementing a Multilayer Perceptron in Python. In order totrain the MLP, we used as input each line of the training data, whose labelwas known. In particular, we built a dictionary composed by the V stemsappearing in the whole training corpus and we represented each training

67

input i as a V -dimensional binary vector xi = [xi,1, ..., xi,V ] where xi,k = 1 ifthe k-th word of the dictionary appears in the text and xi,k = 0 otherwise.We used the function MLPClassifier of the Python package scikit-learn2

and we tested different models with 1 and 2 hidden layers with a differentnumber of neurons in order to empirically find the best one. In order todecide whether trigger the classifier or label the query as Generic, we usedthe less conservative approach, where a query is labelled as Generic if we donot recognize any of its words after the preprocessing step. Therefore, fora query q with K words, we count the number Q of known words, and wecompute the non conservative test:{






The best MLP we obtained was composed by 2 hidden layers (the firstone with 15 neurons and the second one with 3 neurons): this model gaveun the results represented in table 5.6.



Table 5.6: Results with MLP

We obtained the following index:

precision = 40.73%

2http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html

68

recall = 95.08%

F1− score = 57.03%

5.7 Two-steps classifier

The results of the MLP are not satisfying since the precision value is verylow. From the table 5.6 we realize that the classifier seems not to distinguishthe classes Assistance and Store (which are the most similar classes - cfr thevalues of the cosine similarities in table 4.10).

Further to this observation, we decided to design a Two-step classifier.Since the classification problem between the classes Assistance and Store,which are very similar, seems to be different from the classification problembetween Cinema and the union of the two other classes, the idea is to splitthe classification in 2 different steps, as represented in figure 5.1:

1. First of all, given a query, we compute a first classification between theclasses Cinema and NotCinema, where NotCinema is the union of theclasses Assistance and Store;

2. If in the first step, the query is classified as Cinema, the algorithmends, otherwise we compute a second classification between the classesAssistance and Store.

This kind of algorithm are known as Sequential Classification Algorithms(Kołakowska et al., 2003). We tested different models and the combinationwhich gave the best results were:

1. First classifier: Naive Bayes; Second classifier: MLP with 2 hiddenlayers (2 and 16 neurons);

2. First classifier: Markov n-Grams ; Second classifier: MLP with 2 hiddenlayers (2 and 5 neurons);

3. First classifier: MLP with 2 hidden layers (2 and 30 neurons); Secondclassifier: MLP with 2 hidden layers (2 and 5 neurons)

69

Figure 5.1: Two-step classifier

4. First classifier: MLP with 2 hidden; layers (2 and 30 neurons) anda probability threshold of 0.6; Second classifier: MLP with 2 hiddenlayers (2 and 5 neurons) and a probability threshold of 0.6.

The difference between the third and the fourth models is the following:in the classic MLP (as the ones used in the third model) which computes abinary classification between 2 classes A and B, the Neural Network evaluatethe probability that the input belongs to class A (pA) and to class B (pB =1− pA) and it classifies the input in the classes with the biggest probability(i.e. in class A if pA > 0, 5 and in B otherwise). After the evaluationof the second two-steps classifier, we notice that the errors in classificationwas concentrated on low values of probabilities, while the average of theprobabilities of the true classes in true classification was higher. We decidedto use a more conservative approach, in which we defined a threshold t and:

• If pA > 0, 5 + t, the input is classed as A;

• If pB > 0, 5 + t, the input is classed as B;

• Otherwise (if pA, pB ∈ [0, 5−t; 0, 5+t]), the input is classed as Generic.

With this kind of classifier, we expect to obtain a lower recall, but wehope to improve the precision, a more important parameter in our problem.

We obtained the results expressed in tables 5.7, 5.8, 5.9 and 5.10.

70



Table 5.7: Results with Naive Bayes+MLP



Table 5.8: Results with Markov Language Modelling+MLP



Table 5.9: Results with MLP+MLP



Table 5.10: Results with MLP+MLP with threshold t = 0, 6


71

1st classifier 2nd classifier Precision Recall F1-ScoreNaive Bayes MLP 85.16% 95.08% 89.84%

Markov MLP 86.05% 95.08% 90.34%MLP MLP 82.99% 95.08% 88.62%

MLP MLP withthreshold 92.70% 81.61% 86.80%

Moreover, we tested the models with only MLP (the two-step classifierMLP+MLP without and with threshold t = 0, 6) using as input the resultof the embedding algorithm, instead of the binary vector we used before.Starting from the binary vector xi ∈ {0, 1}V of the input query (a vectorof V elements, where V is the size of the training corpus vocabulary - inthis case V = 71645 - and xi,k = 1 if the k-th word of the vocabularyappears in the input, 0 otherwise), we compute the embedded vector hi ∈ RN

(with N � V ) following the algorithm explained in 3.3.6. We tested theembedding algorithms on different values of N and we empirically choseH = 200. In figure 5.2 we represent some of the stems of the vocabulary ofour training data on the plane identified by the first and the second principalcomponent after application of PCA (Principal Component Analysis) on the200 elements (features) of the embedded vectors.

72

Figure 5.2: Some stems on the plan of the two first Principal Components ofthe 200 features of the embedding

We recognize some clusters, highlighted by the following circles:

• In the green one, we find some mobile phones’ brands;

• In the blue one, the words salut (’hello’), bonjour (’good morning’) andbonsoir (’good evening’);

• In the purple one, some colors;

• In the brown ones, some name and surname of actors.

some words from Assistance vocabulary spread all over the plane andsome outliers.

We obtained the results expressed in tables 5.11 and 5.12.

73



Table 5.11: Results with MLP+MLP with embedding



Table 5.12: Results with MLP+MLP with threshold t = 0, 6 and embedding


1st classifier 2nd classifier Precision Recall F1-ScoreMLP withembeddings

MLP withembeddings 83.44% 90.50% 86.83%

MLP withembeddings

MLP withembeddingsand threshold

90.26% 81.46% 85.64%

5.8 SVM with Bagging

As already said in 3.4.4, the SVM is too slow to be implemented in its basicform on our training dataset composed by more than 100000 training data.For this reason, we directly implemented the SVM with Bagging in Python,using the package scikit-learn3.

In order to decide whether trigger the classifier or label the query asGeneric, we used the less conservative approach, where a query is labelled as

3http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html

74

Generic if we do not recognize any of its words after the preprocessing step.Therefore, for a query q with K words, we count the number Q of knownwords, and we compute the non conservative test:{






We obtain the best compromise between computational time and perfor-mances using 1000 SVM implemented on the 5% of the training data and10% of the 200 features given by the embeddings terms. this model gave unthe results represented in table 5.13.



Table 5.13: Results with MLP


precision = 88.45%

recall = 95.08%

F1− score = 91.64%

75

5.9 Analysis of the results

In the table REF we compare the performances and the computational timeof all the classifiers we have implemented.

Computationaltime Recall Precision F1-Score

Naive Bayes 13.43 s 95.08% 84.12% 89.27%Naive Bayeswith n-Grams

non conservative18.51 s 95.08% 86.15% 90.40%

Naive Bayeswith n-Gramsconservative

19.02 s 87.48% 87.08% 87.28%

Markov Ngramsnon conservative 118.1 s 95.08% 85.77% 90.18%

Markov Ngramsconservative 118.1 s 87.48% 87.18% 87.33%

MLP 40.89 s 95.08% 40.73% 57.03%Naive Bayes+ MLP 21.19 s 95.08% 85.16% 89.84%

Markov Ngrams+ MLP 139.05 s 95.08% 86.05% 90.34%MLP + MLP 68.39 s 95.08% 82.99% 88.62%MLP + MLPthreshold t=0.6 70.01 s 81.61% 92.70% 86.80%

MLP + MLPembedding

136.58s (embedding)+32.32s 90.50% 83.44% 86.83%

MLP + MLPembedding, threshold t=0.6

136.58s (embedding)+33.10s 81.61% 90.26% 85.64%

SVM with Bagging 641 s 88.45% 95.08% 91.64%

After the implementation of the models, we analysed the false positives(the queries wrongly classified) in order to better understand the reasons ofthe mistakes and to look for some trends in errors.

The first aspect we noticed was that, in each model, the best perfor-mances are obtained with the queries of the class Assistance, while the worstones are often linked to the class Store: the Assistance training document is

76

the largest one, which implies a deeper precision in the recognition of queriesbelonging to this class. Moreover, the classes Assistance and Store are quitesimilar, and this probably caused a huge number of false positives in the classStore.

With a deeper analysis, we realized that the biggest proportion of mis-takes in classification of the class Assistance was linked to short queries (1 or2 words) that can not be directly classified as Assistance queries since theyare very general and can receive different labels. Some examples are:

• facture (bill)

• livebox (Orange modem)

• chat

• wifi extender

The Cinema test set includes a large number of queries containing:

• the names and the cities of the movie theatres

• the postal code or the name of a city

while the training corpus contains only information about films and ac-tors. As a consequence, all these kind of queries (which represents about 20%of all the Cinema queries) are classified as Generic or even in another class.Some other queries with only one or two significant words representing thetitle of a film, manga or cartoon which do not belong to the training set andwhose meaning is not directly linked to the cinema vocabulary, are classifiedin other classes. Some examples are:

• taupe from the movie ’La Taupe’ is also a colour and is classified as Storesince this word often appears in the training corpus of Accessories ;

• message from the movie ’The Message’ is classified as Assistance be-cause of all the sentences in training set like ’error message’ ;

• rogue one is a cartoon but also a PC virus.

Finally, we notice that a relevant part of the boutiques queries are classi-fied as Assistance, probably since the two classes are quite similar and thereare lot of ambiguous queries.

77

5.9.1 Comparison between the different models

The most high-performing algorithm in term of precision is the two-stepMLP+MLP model with a threshold of 0.6. This is not surprising, since theintroduction of the threshold makes the classifier more conservative: it is lessoften trigged, but when it happens, it is due to a larger degree of belief thatthe predicted class is the true one. However, this approach causes a rise ofthe number of Generic queries, which means a lower recall and, consequently,a lower F1-score.

The best compromise between precision and recall is obtained with theSVM with Bagging, which gives the highest F1-score. However, this algo-rithm is the slowest one, and this can cause some problems when it will beapplied on all the data.

We can notice that the Naive-Bayes, being the simplest algorithm, is alsothe speediest one, but its performances are comparable with the other morecomplex methods. It seems to be the best compromise between performances,complexity and computational time.

Finally, adding the embedding term model, we reduce the size of thevector representing the text lines, making the computation of the algorithmsless expensive. The computation of the embeddings requires a little bit oftime at the beginning, but let gain a lot of seconds in the classification. Thisaspect is very important if we consider that all these algorithms will be testedby Orange on a much more big data set.

Conclusion and FurtherDevelopments

The aim of this work was to build a query interpreter for the search enginelemoteur, able to recognize if a query belongs to a specific class or if it shouldbe labelled as Generic. One of the main interest of this project consists onits industrial application: a deep research in the field of text mining has beenled, where different machine learning algorithms for supervised text classifi-cation have been studied and applied on data coming from the search engineby Orange, with the aim of improving its performances.

After a descriptive analysis showing that some classes were very similar,we decided to merge some of them in order to increase the accuracy of theclassification.

Before computing the algorithms, we had to find a mathematical modelfor the text lines. After the classical preprocessing techniques, we created twodifferent models for the text: the binary vector of size equal to the size of thevocabulary and the embedding vector obtained with the word2vec algorithm.We fixed a preliminary step in order to decide if a query can be classified orif it should be labelled as Generic: this stage consisted on understanding ifin the training data there were enough information to have a large degree ofconfidence that the proposed classification was actually the true one. Then,we implemented and tested several algorithms and we combined them in se-quential classifier.

The most challenging part of this study is the industrial adaptation andapplication of some important methods and models created in research frame-works. In fact it is not always easy to conciliate the need of a company, which

78

79

often analyse only the performances and the computational time and are of-ten stagnant in old techniques, and the innovative algorithms proposed bythe literature, which requires an high level of understanding of the subject,a great open mindedness and a big and risky investment of time and money.

The implemented model can satisfy different needs: the simplest algo-rithm (the Naive Bayes Classifier) is the fastest one and its performancesare quite satisfying, but the best results are obtained with more complexmodels (with a sequence of Multi-layer Perceptrons and with a Support Vec-tor Machine with Bagging), which requires a larger computational time. Itis important to well define the need and the priority of the problem in orderto choose the best solution.

This work is the first research stage for a complex project of the companyOrange. In this thesis we have studied several algorithms and approacheson a small subset of data compared to the huge amount of data producedeach hour by lemoteur. The aim was to study the different approaches ina simpler context in order to find the best ways to undertake to extend thesolution on the entire framework. For this reason, the natural developmentsare linked to the extension of the best algorithms to all the training data andtheir validation in a more complex context.

Ringraziamenti

À toutes les personnes qui m’ont suivie chez Sogeti High Tech et Orange,merci de m’avoir donnée l’opportunité de travailler sur ce projet si intéres-sant et enrichissant. Merci en particulier à Flavien et Hatem, pour avoirautant cru en moi. Merci aussi à Alix pour tous ses conseils. Et enfin, àtous les stagiaires que j’ai rencontrés chez Sogeti, merci pour ces 6 mois quel’on a partagés ensemble dans l’open space et en dehors.

Al professor Simone Vantini, grazie per aver ancora una volta accettato diseguirmi a distanza per questa tesi. Grazie per la disponibilità e per i preziosiconsigli nella stesura di questo elaborato.

Al mio papà, grazie per essere sempre un modello di tenacia e di forzadi volontà. Alla mia mamma, grazie perché, da vicino o da lontano, tu nonmi lasci mai. A mia sorella Veronica, e coinquilina negli ultimi anni, grazieper esserti presa cura di me (so che per te resterò sempre una "piccola sedi-cenne"!). A tutti e tre, un immenso grazie per il vostro amore incondizionatoche tiene unita tutta la nostra famiglia.

Alle mie nonne Maria e Giuseppina, a cui dedico questa tesi, grazie peravermi mostrato che si può avere uno spirito semplice e grande allo stessotempo.

Alle mie zie e ai miei cugini, grazie perché con voi è proprio vero chel’unione fa la forza.

Alle mie amiche di Sondrio, le mie meraviglie, grazie perché, nonostantela distanza, è sempre bello ritrovarvi. E grazie perché so che ci siete.

81

82

A tutti gli Ingegneri Matematici, quelli incontrati fin dai primi due anni equelli conosciuti o riscoperti dopo il mio rientro, grazie perché alla fine, convoi al mio fianco, non è poi stata cosí male. E grazie perché prima di esserestati compagni di università, siete stati amici e compagni di vita.

À tous mes amis rencontrés pendant mes deux années à l’Ecole Centralede Lyon, merci car je n’aurais jamais pu imaginer que vous pourriez me don-ner et me changer autant. Merci car grâce à vous, si un jour je devrai parlerdes plus beaux années de ma jeunesse, je ne pourrai que commencer par lesmoments partagés avec vous.

À mon Pierre-Étienne, un immense merci pour avoir toujours cru en moiet pour m’avoir aidé à fair ressortir le meilleur de moi. Merci pour tout ceque l’on a vecu ensemble, malgré la distance, et pour tout ce qui encore nousattend, car le mieux doit encore arriver.MAzion

Valentina

Bibliography

[1] Frakes,William B. . 1992. "Stemming Algorithms". In s Software Engi-neering Guild, Sterling, VA 22170, chap. 8.

[2] Riedmiller, Martin. 1994. "Advanced supervised learning in multi-layerperceptrons — from backpropagation to adaptive learning algorithms".Computer Standards & Interfaces 16.3: 265-278.

[3] Cortes,Corinna and Vapnik,Vladimir . 1995. "Support-vector net-works". Machine learning 20.3: pp 273-297.

[4] Hagan, Martin T. , Demuth, Howard B., Hudson Beale, Mark ,De Jesús, Orlando. 1996. Neural network design. Vol. 20. Boston: PWSpublishing company..

[5] Salton,Gerard and Buckley, Christopher. 1998. "Term-weighting ap-proaches in automatic text retrieval". Information processing & manage-ment, 24.5: 513-523.

[6] Studer, Rudi, Benjamins, V. Richard, Fensel, Dieter. 1998. "Knowl-edge engineering: principles and methods". Data & knowledge engineer-ing 25.1: pp 161-197.

[7] Bertsekas, Dimitri P. . 1999. Nonlinear programming. Belmont: Athenascientific.

[8] Cristianini, Nello and Shawe-Taylor, John. 2000. An introduction tosupport vector machines and other kernel-based learning methods. Cam-bridge university press.

[9] Chih-Wei, Hsu and Chih-Jen, Lin. 2002. "A comparison of methodsfor multiclass support vector machines". IEEE transactions on NeuralNetworks 13.2, pp 415-425.

84

BIBLIOGRAPHY 85

[10] Johnson, Richard A. and Wichern, Dean. 2002. Applied multivariatestatistical analysis. Vol. 5. No. 8. Upper Saddle River, NJ: Prentice hall.

[11] Kołakowska, Agata and Malina, Witold. 2003. "Sequential Classi-fication". Neural Networks and Soft Computing: pp. 430-445. SpringerBerlin Heidelberg.

[12] Peng, Fuchun and Schuurmans, Dale. 2003. Combining naive Bayesand n-gram language models for text classification. Springer.

[13] Manning, Christopher D., Raghavan, Prabhakar, Schütze, Hinrich.2008. Introduction to information retrieval. Vol. 1. No. 1. Cambridge:Cambridge university press.

[14] Fletcher, Tristan . 2009. Support Vector Machinesexplained. [Online]. http://sutikno. blog. undip. ac.id/files/2011/11/SVM-Explained. pdf.[Accessed 06 06 2013].

[15] Souza,César R. 2010. Kernel functions for machine learning applica-tions. Creative Commons Attribution-Noncommercial-Share Alike 3.

[16] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg,Dean, Jeffrey. 2013."Distributed Representations of Words and Phrasesand their Compositionality". Advances in neural information processingsystems : pp. 3111-3119.

[17] Bian, Jiang, Gao, Bin, Liu, Tie-Yan. 2014. "Knowledge-powered deeplearning for word embedding". Joint European Conference on MachineLearning and Knowledge Discovery in Databases. Springer Berlin Heidel-berg.

[18] Goldberg, Yoav, Levy, Omer. 2014. word2vec Explained: Deriv-ing Mikolov et al.’s Negative-Sampling Word-Embedding Method. arXivpreprint arXiv:1402.3722.

[19] Gurusamy, Vairaprakash and Kannan, Subbu. 2014. PreprocessingTechniques for Text Mining. Madurai Kamaraj University.

[20] Rong, Xin. 2014. word2vec parameter learning explained. arXiv preprintarXiv:1411.2738.

BIBLIOGRAPHY 86

[21] Sidorov, Grigori, Velasquez, Francisco, Stamatatosb, Efstathios,Gelbukha, Alexander, Chanona-Hernándezc, Liliana. 2014. "Syn-tactic N-grams as machine learning features for natural language process-ing". Expert Systems with Applications 41.3: pp 853-860.

[22] Kim, Hyun-Chul, Pang, Shaoning, Je, Hong-Mo, Kim, Daijin, Bang,Sung-Yang. 2002. "Support Vector Machine Ensemble with Bagging".Pattern recognition with support vector machines. Springer Berlin Heidel-berg: pp 397-408.

List of Figures

1.1 Homepage of the site http://www.lemoteur.fr . . . . . . . 41.2 Homepage of the site http://actu.orange.fr/ . . . . . . . 41.3 Homepage of the site https://assistance.orange.fr/ . . . 51.4 Homepage of the site http://boutique.orange.fr/ . . . . . 51.5 Homepage of the site http://tv.orange.fr/ . . . . . . . . . 61.6 lemoteur without the preliminary classification step . . . . . . 71.7 lemoteur with the preliminary classification step . . . . . . . 7

3.1 Neurons in a Biological Neural Network (Retrieved from http://www.mind.ilstu.edu) 263.2 Artificial Neuron. Retrieved from http://kryten.mm.rpi.edu/SEP/index8.html 273.3 Different kinds of neurons. Retrieved from http://cs231n.github.io/neural-networks-1/ 283.4 MLP with one hidden layer. Retrieved from https://www.dtreg.com/solution/view/21 303.5 Sigmoid function (left) and hyperbolic tangent (right) . . . . . 313.6 The Skip-Gram model. Retrieved from Rong, 2014. . . . . . . 353.7 An example binary tree for the hierarchical softmax model.

Retrieved from Rong, 2014 . . . . . . . . . . . . . . . . . . . . 373.8 Hyperplane through two linearly separable classes (Retrieved

from Fletcher, 2009) . . . . . . . . . . . . . . . . . . . . . . . 403.9 Dichotomous data re-mapped using Radial Basis Kernel (Re-

trieved from Fletcher, 2009) . . . . . . . . . . . . . . . . . . . 45

4.1 Words with the highest tf-idf of the class Accessories . . . . . 544.2 Words with the highest tf-idf of the class Assistance . . . . . 544.3 Words with the highest tf-idf of the class Contracts . . . . . . 554.4 Words with the highest tf-idf of the class Movies . . . . . . . 554.5 Words with the highest tf-idf of the class People . . . . . . . . 564.6 Words with the highest tf-idf of the class Phones . . . . . . . 56

87

LIST OF FIGURES 88

5.1 Two-step classifier . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Some stems on the plan of the two first Principal Components

of the 200 features of the embedding . . . . . . . . . . . . . . 72

List of Tables

4.1 Dimensions of the training documents . . . . . . . . . . . . . . 494.2 10 words with the highest tf-idf of the class Accessories . . . . 494.3 10 words with the highest tf-idf of the class Assistance . . . . 504.4 10 words with the highest tf-idf of the class Contracts . . . . 504.5 10 words with the highest tf-idf of the class Movies . . . . . . 514.6 10 words with the highest tf-idf of the class People . . . . . . 514.7 10 words with the highest tf-idf of the class Telephone . . . . 524.8 10 words with the highest tf-idf of the class Telephone after

the filtering process . . . . . . . . . . . . . . . . . . . . . . . . 534.9 Cosine similarity between the 6 classes . . . . . . . . . . . . . 574.10 Cosine similarity between the classes . . . . . . . . . . . . . . 58

5.1 Results with Naive Bayes Text Classifier . . . . . . . . . . . . 625.2 Results with Naive Bayes and n-grams Text Classifier - non

conservative . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 Results with Naive Bayes and n-grams Text Classifier - con-

servative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Results with non-conservative Markov n-Grams Language Mod-

elling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.5 Results with conservative Markov n-Grams Language Modelling 665.6 Results with MLP . . . . . . . . . . . . . . . . . . . . . . . . . 675.7 Results with Naive Bayes+MLP . . . . . . . . . . . . . . . . . 705.8 Results with Markov Language Modelling+MLP . . . . . . . . 705.9 Results with MLP+MLP . . . . . . . . . . . . . . . . . . . . . 705.10 Results with MLP+MLP with threshold t = 0, 6 . . . . . . . . 705.11 Results with MLP+MLP with embedding . . . . . . . . . . . 735.12 Results with MLP+MLP with threshold t = 0, 6 and embedding 735.13 Results with MLP . . . . . . . . . . . . . . . . . . . . . . . . . 74

89

POLITECNICODIMILANO · 2016. 11. 9. · POLITECNICODIMILANO SCUOLA DI INGEGNERIA INDUSTRIALE E...

Documents

Transcript of POLITECNICODIMILANO · 2016. 11. 9. · POLITECNICODIMILANO SCUOLA DI INGEGNERIA INDUSTRIALE E...