Text Classification of Political Documents Using ... · able, such a recommender system can be...

Text Classification of Political Documents UsingParsimonious Language Models

Sicco N.A. van [email protected]

Thesis Submitted in Partial Fulfillment of the Requirementsfor the Degree of Master’s in Artificial Intelligence

Faculty of ScienceUniversity of Amsterdam

Supervisordr. Maarten Marx

Information and Language Processing SystemsInformatics Institute

University of Amsterdam

May 6, 2013

The journey that led to the thesis that lies in front of you has been an excit-ing one. I would like to thank the people who contributed to it in differentways throughout the process. Foremost, Maarten for his stimulating ideas,constructive feedback and active guidance. Ben, Breyten and the others atthe Open State Foundation / Stichting Het Nieuwe Stemmen for the oppor-tunity to combine my research with an internship, providing the initial ideafor this project and opening up my mind to the possibilities for artificialintelligence to make politics more accessible. Finally, my family, friends andLau for being there and supporting me in so many ways. Thank you all!

The code I have written for this research will be made open source and avail-able at: https://github.com/siccovansas/plm-text-classification

https://github.com/siccovansas/plm-text-classification

Abstract

EuroVoc is the official thesaurus used to classify EU legislation anddocuments produced by the European parliament. It contains nearly7000 hierarchically structured concepts and is available in all officialEU languages. Manually labeling documents with EuroVoc conceptsis a time-consuming task which may be facilitated by a system which,given a document, automatically recommends a ranked list of thesaurusconcepts. As a large number of hand-labeled EU documents is avail-able, such a recommender system can be developed using supervisedlearning techniques. This thesis describes such a system based on par-simonious language models (PLM), and compares its effectiveness toa state of the art system, JEX, which is based on the vector spacemodel. The PLM system is language independent and needs only twoparameters to optimize. This in contrast to JEX which uses large stoplists and has 6 parameters. We show that PLM outperforms JEX on adata set of European Union legislation labeled with EuroVoc conceptsand a data set of Dutch parliamentary question labeled with a smallertaxonomy. More importantly, the PLM system is robust: it still givesuseful recommendations when trained on EU documents and appliedto non EU, but still, political documents.1

1A one-page abstract of this thesis is accepted for BENELEARN’13 and a twelve-pagepaper has been submitted for TPDL’13.

ii

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 52.1 Multi-label (Text) Classification . . . . . . . . . . . . . . . . . 52.2 EuroVoc and EUR-Lex . . . . . . . . . . . . . . . . . . . . . . 62.3 Parsimonious Language Models . . . . . . . . . . . . . . . . . 6

3 Theory & Method 73.1 Vector Space Model and JEX . . . . . . . . . . . . . . . . . . 8

3.1.1 JEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Language Models and PLM . . . . . . . . . . . . . . . . . . . 10

3.2.1 Parsimonious Language Models . . . . . . . . . . . . . 11

4 Experimental Setup 124.1 Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.1 EuroVoc . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1.2 Taxonomie Beleidsagenda . . . . . . . . . . . . . . . . 14

4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.1 Acquis . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.2 Dutch Parliamentary Questions . . . . . . . . . . . . . 15

4.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.1 Acquis . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.2 Dutch Parliamentary Questions . . . . . . . . . . . . . 19

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4.1 Concept Models . . . . . . . . . . . . . . . . . . . . . 214.4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Experiments and Results 245.1 Can a PLM Classification System Outperform JEX in the

Effectiveness of Its Classifications? . . . . . . . . . . . . . . . 255.1.1 Acquis Experiment . . . . . . . . . . . . . . . . . . . . 255.1.2 Parliamentary Questions Experiment . . . . . . . . . . 27

5.2 What Are the Effects of Different Document Representationson the PLM Method? . . . . . . . . . . . . . . . . . . . . . . 275.2.1 Does Lemmatization Increase the Classification Scores? 275.2.2 Do Bigram Models Increase the Classification Scores? 28

5.3 How Robust Is the PLM Method? . . . . . . . . . . . . . . . 29

iii

5.3.1 Does Training Data from a Certain Time Frame Affectthe Performance? . . . . . . . . . . . . . . . . . . . . . 29

5.3.2 How Does the PLM System Perform on Different DataSets and Taxonomies? . . . . . . . . . . . . . . . . . . 30

5.3.3 How Does the System Perform When Train and TestDocuments Stem from Different Sources? . . . . . . . 31

5.3.4 Does Classification Improve When Using More Gen-eral and Thus Less Concepts? . . . . . . . . . . . . . . 32

5.3.5 How Much Parameter Tuning Is Needed? . . . . . . . 33

6 Conclude and Discuss 346.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Calculation of Log-likelihood 38

B Concept Models 39

iv

1 Introduction

1.1 Motivation

More and more data becomes digital and increasing amounts of data isshared via the internet. In the case of text documents, there is a need toeasily find documents relevant to an information need. One approach is touse search engines to search for words deemed present in the to be retrievedrelevant documents. Another method is to find documents via an indexwhich has the documents classified into categories, be it using a controlledvocabulary or freely applied tags. Whereas the search approach is wellsuited for automation, the index approach works on a higher semantic level,as the topics of whole documents need to be inferred and summarized into ajust few terms. An advantage compared to the search approach is that withcategorized documents users will only be presented with documents relevantto a certain concept, given that the requested concept exists. Moreover,indexed documents may not even contain the exact concept term (e.g., adocument labeled with the concept fish might not actually contain the wordfish), but can still be fully relevant to this concept. Vice versa, documentscontaining a concept term, might not be relevant to a certain concept (e.g.,if the term is used just once) and therefore the document should not bemarked as relevant for this concept. The search approach focuses more onoccurrence and does not take relevance as much into account as the indexapproach.

Manual classification is laborious and this poses a challenge with theincreasing amount of documents that could be classified. Fully automatedor semi-automated approaches could reduce this effort. Recently, the Euro-pean Commission’s Joint Research Centre (JRC) released such a system [19],called JRC EuroVoc Indexer (JEX), which classifies arbitrary texts withconcepts from the EuroVoc thesaurus. This is the official thesaurus usedto classify EU legislation and documents produced by the European par-liament. It contains nearly 7000 hierarchically structured concepts and isavailable in all official EU languages.

JEX does multi-label text classification: with respect to a document eachEuroVoc concept is given a relevance score which is used to rank the con-cepts. A user can then use these ranked concepts as assistance in the manualclassification process, or the system can be set to automatically annotate thedocument with, e.g., the 6 highest ranked concepts. JEX is created usingsupervised learning on a data set of European legislation in 22 languages,called JRC-Acquis, which is also released by the JRC [18]. It differs from thepopular Reuters RCV1 data set [11] in having less documents, on average32 000 for each language compared to more than 800 000 for RCV1, yet itfeatures two orders of magnitude more concepts, on average 3963 comparedto Reuter’s 103 concepts. This poses a tough problem as each concept has

1

fewer documents labeled with it, i.e., there is less training data. Further-more, at the classification stage the system has to deal with more concepts.Each document in the JRC-Acquis data set has on average 6 manually as-signed concepts. Generally, out of the 6 highest assigned concepts by JEX,3 are correct (more precisely, for the English JRC-Acquis data set it scoresan R-precision of 0.56 and mean average precision of 0.58, see Section 5.1).

In essence, JEX uses information retrieval much like an inverted search-engine: it treats a to-be labeled document as a query, and it returns allEuroVoc concepts ranked based on its relevance to that query. Like a searchengine, it represents concepts as vectors assigning scores to words. Themain two tasks in training JEX are 1) choosing which words to include inthe model of a specific concept and 2) if included, which score to give thatword. For the first task, JEX uses large language dependent lists of stop-words and the log-likelihood corpus comparison technique [9]. Scores arecalculated using a variant of tf-idf. Tuning the system is quite complicated,as it contains 6 parameters which need to be set.

We wanted to build a classifier with similar or better performance thanJEX but which 1) uses no language dependent tools and 2) is easier to tune.We chose the language modeling approach to search as it has shown goodperformance with long queries and is closely related to the Naive Bayes ap-proach to text classification [10, 13]. We used the parsimonious languagemodeling technique [7] in order to select the words which are most relevantfor each concept. Creating a parsimonious language model for a EuroVocconcept requires only one parameter to optimize: the influence of the back-ground corpus. The system is evaluated and compared against JEX on po-litical documents consisting of data sets from JRC-Acquis in 19 languagesand a data set of Dutch parliamentary questions. The PLM system is fur-ther tested for robustness by applying it to different situations, e.g., trainingit on one data set and classifying documents from another data set.

1.2 Main Results

The PLM system is language independent and needs only two parametersto optimize: the influence of the background corpus to the concept modeland to the model of the to-be-labeled document. With unigram languagemodels, PLM significantly outperforms JEX in 11 of the 19 tested EU lan-guages. With bigrams, PLM outperforms JEX in all languages. The differ-ence in performance is even larger on a Dutch political dataset with only110 different concepts. The PLM system is robust: it still gives useful rec-ommendations when trained on EU documents and applied to non EU butstill political documents.

2

1.3 Research Questions

The goal of this thesis is to create a robust language independent text clas-sification system with as few parameters as possible. The following researchquestions (RQs) address these issues in a hierarchical fashion.

Main RQ Can parsimonious language models be used to create a lan-guage independent, easy to optimize, multi-label text classification sys-tem?

The research focuses on the creation of a multi-label text classificationsystem with the parsimonious language models technique at its core. Howfeasible is this technique for such a system and can it reduce reliance onlanguage dependent techniques and the amount of parameters that needoptimization.

RQ 1 Can a PLM classification system outperform JEX in the ef-fectiveness of its classifications?

JEX uses a similar information retrieval approach to the task ofmulti-label text classification. It differs compared to the PLM systemin that it uses stop lists and more parameters.

RQ 2 What are the effects of different document representations onthe PLM method?

The text in the documents can be represented in different ways.The default method works with unigrams, but what are the effectsof other documents representations?

RQ 2.1 Does lemmatization increase the classification scores?

With lemmatization, inflected versions of the same word are castto one representation, thereby reducing the amount of differentwords.

RQ 2.2 Do bigram models increase the classification scores?

Bigrams will add to the amount of unique words used to rep-resent the document, as each word now consists of two co-occurring words.

RQ 3 How robust is the PLM method?

When presented with different types of data, how robust are the re-sults of the PLM classification system? Is it possible to infer generalstrong and weak points when dealing with certain types of data?

3

RQ 3.1 Does training data from a certain time frame affect theperformance?

Many document collections are formed over time which influ-ences the content as the language usage changes as well. Howdoes this affect the PLM system?

RQ 3.2 How does the PLM system perform on different datasets and taxonomies?

More insight in the performance range of the system can beacquired by comparing its results when using different data setsand taxonomies.

RQ 3.3 How does the system perform when train and test doc-uments stem from different sources?

The system is trained on data from a certain source and evalu-ated on the same source. In practice the system might be usedthough to classify documents from different sources.

RQ 3.4 Does classification improve when using more generaland thus less concepts?

Taxonomies often have their concepts structured hierarchically.What happens if the system is trained to only classify the moregeneral concepts in the hierarchy?

RQ 3.5 How much parameter tuning is needed?

Do the parameters of the system have a significant effect on theresults and does it require that these parameters are tuned fordifferent data sets?

1.4 Overview

Section 2 deals with related work on multi-label (text) classification, Euro-Voc and parsimonious language models. Section 3 hosts the theory andmethod of the JEX and PLM systems. Section 4, the experimental setup,describes the taxonomies and data sets, together with the effects of prepro-cessing and implementation. The experiments an their results, structuredby research question, are given in Section 5. The conclusion, discussion andfuture work is found in Section 6.

4

2 Related Work

Related work on multi-label (text) classification, EuroVoc and parsimoniouslanguage models is described in this section.

2.1 Multi-label (Text) Classification

Text classification is extensively researched [17, 1] and can be reduced to aprocess of three steps: feature selection, classifier induction and evaluation.Different information retrieval (IR) techniques can be used at all stages,though classifier induction is often done using non-IR methods [17].

Typical IR feature selection methods use set/bag of words in some n-gramform and calculate weights for these words using a variant of tf-idf [13]. Thequality of the data can be improved using stop lists and stemmers or lemma-tizers. Non-IR methods include various ways of dimensionality reduction [1]as the large feature space can be problematic for some classifier induction al-gorithms. Our classifier induction method, based on parsimonious languagemodels, is basically a dimensionality reduction approach as it tries to findonly the relevant words for each concept (i.e., giving them a high probabil-ity) and ignores irrelevant words (i.e., setting the probability to zero). Wewill show that JEX’ approach also reduces the dimensionality.

We focus on multi-label (text) classification but many single-label clas-sification methods cannot be directly applied to the multi-label problem.In [20] the approaches to multi-label classification are divided into twogroups Problem transformation and Algorithm adaptation. In the first group,the multi-label data is converted into single-label data, thereby allowing nor-mal classification algorithms to be applied. Our PLM method fits in thisgroup as we do not directly feed our algorithm the texts labeled with mul-tiple concepts, but rather one large concatenated text consisting of all textslabeled with the same concept. With this input our algorithm functions thesame as if it were single-label classification. Algorithm adaptation is the op-posite, as normal single-label classification algorithms are adapted to workdirectly with multi-label data. Adapted versions are available for, amongstothers, AdaBoost [16], back-propagation [21], SVM [5, 6] and Bayesian net-works [2].

Multi-label classification also requires different evaluation measures thansingle-label classification [20, 13]. Furthermore, different measures are useddepending on whether a classifier outputs a ranked list of concepts or if theoutput is a bipartition of unranked positive and negative concepts. Ourinterest is in a ranked output for which measures like 1-Error, ranking loss,eleven-point interpolated average precision, mean average precision (MAP),R-precision (Rprec) and normalized discounted cumulative gain can be used.Measures used in the bipartition case are F-measure (based on precision andrecall) and Hamming-Loss.

5

2.2 EuroVoc and EUR-Lex

In our research we perform multi-label text classification on multi-lingualEUR-Lex (i.e., the EU website disclosing European legislation) documentslabeled with concepts from the EuroVoc thesaurus. Several other multi-label text classification studies [12, 3, 4, 19, 15] have been conducted using(subsets of) this data.

[12] used a data set of English EuroVoc labeled EUR-Lex documents2

and performed experiments with three perceptron-based classifiers. Theirmethod uses a stop list together with a stemmer and applies tf-idf. 5000tokens, those with highest document frequency, were kept out of approx-imately 200 000 unique tokens to reduce the dimensionality. Their bestresults yielded a MAP of 0.53.

[3] uses a method to split each multi-label document into smaller single-label documents (using pointwise mutual information), assuming that eachpart of the text belongs to only one concept. They use a lemmatizer and onlykeep informative words (nouns, verbs, etc.). They then use SVM on thesesingle-labeled documents. The method is evaluated on the Italian Acquisdata set and a F1@63 score of 0.58 is obtained.

[4] models the EuroVoc hierarchical structure (which also contains relatedterms and use/used for relations) as a Bayesian network. This method canfunction without any training data, but is able to incorporate new trainingdata to improve its performance. The data (used for testing, and optionallyfor training) is preprocessed with a stop list and stemmer, and of each docu-ment only two or three lines of text are used. Weights are calculated using avariant of tf-idf. The evaluation is carried out on parliamentary resolutionsfrom the regional Parliament of Andalusia (Spain) which are labeled withEuroVoc concepts. Without training documents the best score was a F1@5of 0.31, with training documents it was a F1@5 of 0.58.

JEX [19, 15] uses a custom approach based on the vector space modelwith a variant of tf-idf weights. Together with a stop list they obtain F1@6scores of 0.44-0.54 on the 22 languages in the Acquis data set.

2.3 Parsimonious Language Models

[7] shows that parsimonious language models have less non-zero featuresthan regular language models while still performing at least as well in theinformation retrieval tasks of indexing, retrieval and feedback. Another ad-vantage is that this method automatically removes words which occur bothtoo frequently (i.e., stop words) and too rarely. In [14] parsimonious lan-guage models are used to improve blind relevance feedback, i.e., extending a

2http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html ; last retrievedon 2013-06-03, all following URLs were also last retrieved on this date.

3The F-measure with β = 1 at rank 6, i.e., taking the F1 score over the 6 highestranked concepts.

6

http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html

user’s query with informative terms from the set of documents relevant to thequery. [8] shows that parsimonious language models outperform languagemodels in four large TREC data sets (TREC-8, Terabyte ’04/’05/’06).

3 Theory & Method

In this section we describe the multi-label text classification problem. Werecall two main implementations: the vector space method and languagemodels. For both, the details are presented of the specific implementationswhich we studied: JEX and parsimonious language modeling.

Multi-label Text Classification. Given is a set C of concepts (e.g.,from the EuroVoc thesaurus), and a set Q of query documents (i.e., thedocuments that need to be labeled). Multi-label text classification is thetask of assigning to documents q ∈ Q a ranked list of concepts c ∈ C. Dis the set of documents which previously have been manually labeled withat least one concept c ∈ C and which will be used as training data forsupervised learning.

Let V be the vocabulary of D, i.e., the set of all tokens in D. What isseen as a token depends on the way documents are modeled; they can beunigrams, bigrams, lemmatized unigrams, etc. We make a clear distinctionbetween a real query document q or an actual concept term c (e.g., environ-mental impact) and their respective models q and c. Training documentsd ∈ D are also turned into models, d, and used to create c. All these modelsare |V |-ary vectors, assigning to each t ∈ V a real-valued score (dependingon the used method this will be a weight or a probability). Assigning aconcept c to a document q can be seen as computing a score similarity(c, q)between their respective vectors. We thus study:

Multi-label text classification with concepts C: Given a document q,for each concept c ∈ C, compute similarity(c, q) and rank the conceptsaccordingly.

Note that we do not model the task of deciding how many labels to give toa document. This task forces us to make the following modeling decisions:

1. How to represent concepts and documents?

2. How to estimate the concepts models c and query documentmodels q?

3. How to compute the similarity of models?

4. How to assign a ranked list of concepts to a document?

We now describe two approaches to this task: the vector space model (VSM)and language modeling.

7

3.1 Vector Space Model and JEX

In the VSM, c is a vector in R|V | whose elements are term-weights. One ofthe best known methods to calculate these weights is tf-idf. The standardway to calculate the similarity between a document model q and a conceptmodel c is cosine similarity:

similarity(c, q) =c · q|c||q|

(1)

3.1.1 JEX

The JRC developed JEX4 [19], software which performs the described multi-label text classification task. By default, it is trained on the JRC-Acquisdata set (see Section 4.2.1 for more details on this data set) and is thus ableto classify documents with concepts from EuroVoc in 22 languages. Theirmethod can be retrained on different data sets labeled with concepts fromdifferent taxonomies.

JEX’ method of estimating concept models and calculating similarity isthe result of more than 1500 tests in which different formulas and parameterswere combined and evaluated [15]. JEX is a variant of VSM, which we willnow briefly describe. For clarity, some parameters used in our descriptionwill be denoted P1, . . . , Pn and the default values used by JEX are listed atthe end in Table 1.

JEX models documents and concepts using unigram tokens. V denotesthis set of tokens. For d a document, and t ∈ V a token, tf (d(t)) denotesthe frequency of t in d.

3.1.1.1 Concept Model Estimation. The estimation of a concept vec-tor c can be split into two major steps. First, a document model is createdfor each document d ∈ D using log-likelihood. In the second step, c is esti-mated based on the set of models d whose documents d are labeled with theconcept c. The formula used is a modified idf function (which they foundto yield better results than tf-idf ).

Step 1: Creating Document Models. Let d be a document. Itsmodel d is a bit vector where d(t) = 1 if token t is deemed relevant tothe document, as shown in Equation 2. Relevancy is determined using log-likelihood G2 [9] in which the frequency of t in d is compared to the frequencyof t in the complete corpus D.

d(t) =

{1 if G2(t, d,D) ≥ P1

0 otherwise(2)

4http://ipsc.jrc.ec.europa.eu/?id=60

8

http://ipsc.jrc.ec.europa.eu/?id=60

Parameter P1 is a threshold value. The formula G2(t, d,D) is given in Ap-pendix A.

Step 2: Creating Concept Models. To create a concept modelwe define Dc = {d | d is labeled with c}, i.e. the set of document modelswhose documents are labeled with concept c. For each token t ∈ V , c(t) iscalculated using Equation 3. Note that the part detailing the log functionis a variant of idf.

c(t) =

{α · log

(arg maxt Nc

P3·Nc+ 1)

if |{d ∈ Dc | d(t) = 1}| ≥ P2

0 otherwise(3)

α =∑

{d∈Dc|tf (d(t))≥1}

1

Nd(4)

The variables are:

• Nd = |{c ∈ C | d is labeled by c}|, the number of concepts used tolabel d.

• Nc = |{c ∈ C | ∃d ∈ Dc : d(t) = 1}|, the number of concepts c ∈ C forwhich there exists a document d labeled with c and where the token twas deemed relevant.

• arg maxtNc, the maximum value of Nc considering all tokens t.

Optimizations and Parameter Values. We now present further de-tails of the JEX-approach. The values of the parameters are given in Table 1.

• If a stop list is available for the current language then it will be usedto remove tokens from the documents and queries. Multi-word stoplists are also advised as it removes frequently occurring phrases, e.g.,Having regard to the opinion of the European Parliament.

• All elements in concept vector c are set to 0 if it does not meet thefollowing condition: |{d ∈ Dc |

∑t∈V tf (d(t)) ≥ P5}| ≥ P4. In other

words, at least P4 documents labeled with concept c with at least P5

tokens per document. This is done to ensure that concept vectors areonly trained if there is enough data.

• Elements c(t) with a weight below threshold P6 are set to 0.

• If a concept vector c has less than P7 non-zero elements then theseremaining non-zero elements are also set to 0, thereby effectively re-moving the concept vector.

9

Table 1: Parameter values used by JEX.

parameter value description

P1 5 The minimum required G2 score.

P2 4 The minimum number of documentslabeled with the same concept forwhich the document model’s bit oftoken t is 1.

P3 10 Used to punish tokens which areused in too many concepts.

P4 4 Described above this table.

P5 100 ” ”

P6 2 ” ”

P7 10 ” ”

3.1.1.2 Classification. Given a query document q we compute, for eachc ∈ C, similarity(c, q), yielding a ranked list of concepts for this document.In JEX, similarity(c, q) is calculated as follows.

• A query document q is tokenized into unigram tokens and the elementsin query model q are set to token frequencies, i.e., q(t) = tf (q(t)).

• The similarity is calculated using the cosine similarity as shown inEquation 5. The similarity score is set to 0 if c and q share less than4 tokens with a non-zero weight. This is done to ensure that thesimilarity of the concept and query vector cannot be based on too fewoverlapping tokens with high weights.

similarity(c, q) =

{c·q|c||q| if |{t ∈ V | c(t) 6= 0 and q(t) 6= 0}| < 4

0 otherwise

(5)

• The concepts are ranked based on similarity(c, q) scores.

3.2 Language Models and PLM

While the vector space model works with real valued weights, language mod-els estimate the probability of observing (or generating) a token for a partic-ular document and the values are thus in the range of 0 to 1. The probabilityof a token t in document d can be estimated as shown in Equation 6, whichis the maximum likelihood estimate. A language model of a document d can

10

now be seen as a |V |-ary vector d with elements P (t | d).

d(t) = P (t | d) =tf (d(t))∑t tf (d(t))

(6)

Tokens which do not occur in document d will get a probability of 0 andthus need to be smoothed. This can be done by mixing the document modelwith a general (or background) language model b5, which is estimated on thewhole collection of documents as shown in Equation 7. The mixture, Equa-tion 8, uses a parameter λ which depicts the weight distribution betweenthe background model and document model.

b(t) = P (t | D) =

∑d∈D tf (d(t))∑

d∈D∑

t tf (d(t))(7)

d(t) = (1− λ)b(t) + λd(t) (8)

A language model q of query q can be compared to a document languagemodel d using Kullback-Leibler divergence as in Equation 9. It calculatesthe dissimilarity between both models and the ranking should thus be sortedfrom the lowest to the highest score.

KL(d||q) =∑t∈V

q(t) logq(t)

d(t)(9)

3.2.1 Parsimonious Language Models

Our system uses parsimonious language models [7] to estimate the documentmodels. PLMs are based on the mixture language models in Equation 8,but this removes the maximum likelihood estimate. In a PLM, expectation-maximization is applied though to estimate the maximum likelihood param-eters again, as shown in Equations 10 and 11. d and b are initialized usingEquations 6 and 7 after which the E-step and M-step are applied to eachtoken t until the values of d converge. Depending on the value of λ, sometokens in d will be assigned a probability of 0. These are tokens which oc-cur much throughout the document collection (effectively an automatic stoplist) as well as terms that are highly specific to the current document. Sucha stop list is thus not only automatically tuned to each language, but alsoto the domain of the specific data set.

E-step: et = tf (d(t)) · λd(t)

(1− λ)b(t) + λd(t)(10)

M-step: d(t) =et∑t∈V et

(11)

5This kind of smoothing is also known as Jelinek-Mercer smoothing.

11

3.2.1.1 Concept Model Estimation. In our case, the documents d ∈D consist of the concatenation of all documents labeled with the same con-cept c. The number of documents in D is thus the same as the numberof concepts |C|. When applied to Equations 6, 7, 10 and 11 this resultsin c = d. In other words, a concept model c holds the tokens representativeof concept c in comparison to all other concepts.

3.2.1.2 Classification. A parsimonious language model q is created forevery query document q. This is achieved by substituting d and d with,respectively, q and q in Equations 6, 10 and 11, but leaving the backgroundlanguage model b unchanged (i.e., still based on D). The comparison of bothmodels is done using cross-entropy, Equation 12, which results in a rankingequal to Kullback-Leibler divergence [10]. The best matching models havinga score closer to zero than the more dissimilar models.

similarity(c, q) = H(q, c) = −∑t∈V

q(t) · log(c(t))

(12)

3.2.1.3 Parameters. Our method has only two parameters which needto be optimized manually. They are the λ parameters used for estimatingthe concept models and query models, which from now on we will refer toas respectively λc and λq.

4 Experimental Setup

This section starts with a description of the taxonomies and data sets, fol-lowed by the effects of preprocessing. It finishes with a comparison betweenthe unigram/bigram PLM systems and JEX on the creation of concept mod-els and processing speed.

4.1 Taxonomies

The taxonomies, EuroVoc and taxonomie beleidsagenda, are the controlledvocabularies holding concepts which are used to label the documents in theAcquis and parliamentary questions data sets respectively.

4.1.1 EuroVoc

The EuroVoc thesaurus6 is developed by the Publications Office of the Eu-ropean Union7 with the main purpose of indexing documents produced bythe institutions of the European Union. The controlled vocabulary focusesspecifically on parliamentary documents. EuroVoc version 4.3 consists of

6http://eurovoc.europa.eu/7http://publications.europa.eu/

12

http://eurovoc.europa.eu/

http://publications.europa.eu/

6797 hierarchically structured concepts translated into 24 European lan-guages. The two highest levels in the hierarchy do not hold any actual con-cepts (i.e., they are not used to label documents) and are used to broadlycategorize the concepts in the lower levels. The highest level, called fields,contains 21 subjects shown in Table 2.

Table 2: The 21 fields which make up the highest level of the EuroVochierarchy.

Politics Employment and working conditions

International relations Transport

European communities Environment

Law Agriculture, forestry and fisheries

Economics Agri-foodstuffs

Trade Production, technology and research

Finance Energy

Social questions Industry

Education and communications Geography

Science International organisations

Business and competition

The second level, called microthesauri, divides the field subjects into 127more detailed subjects. Beneath the microthesauri level there are up to 6levels which contain the actual concepts. The amount of concepts used ineach level of the hierarchy are shown in Table 3.

Table 3: The amount of concepts at each level of the hierarchy in EuroVoc.Level 1 being the first level beneath the microthesauri level.

level 1 2 3 4 5 6

concepts 529 4202 2186 426 33 8

The observant reader will have noticed that the sum of all the conceptsin the 6 levels (7384) transcends the previously mentioned 6797 concepts.This is correct as EuroVoc is polyhierarchical, though it tries to limit thekinds of concepts which can be used multiple times throughout the hierarchy(e.g., country name concepts). 236 concepts occur more than once in thehierarchy.

13

4.1.2 Taxonomie Beleidsagenda

The taxonomie beleidsagenda8 is a Dutch taxonomy created by the Dutchgovernment to categorize its documents into broad policy subjects. Thetaxonomy is hierarchical and consists of 2 levels. The top level contains17 main policy subjects (shown in table Table 4), which each containingbetween 4 and 10 children resulting in 111 concepts on the second level.Only these 111 concepts of the second level are used to label the documents.

Table 4: The 17 main subjects of the taxonomie beleidsagenda, translatedinto English.

Governance Education and science

Culture and recreation Public order and safety

Economy Law

Finance Spatial planning and infrastructure

Housing Social security

International Traffic

Agriculture Work

Migration and integration Health care

Nature and environment

4.2 Data Sets

Our experiments make use of two data sets, Acquis and parliamentary ques-tions. Documents in both data sets have been manually labeled with re-spectively the EuroVoc and taxonomie beleidsagenda taxonomies.

4.2.1 Acquis

The JRC releases the JRC-Acquis9 [18] data set, which contains Europeanlegislation distributed over the years 1952 to 2011 as shown in Figure 1.The data set offers documents translated into 22 European languages, allmanually labeled with EuroVoc concepts. The JRC used the JRC-Acquisdata set together with more labeled legal texts from EUR-Lex10 to trainJEX (the data set is released as part of the advanced version of the JEXsoftware11). It is this data set that we use for some of our experiments andwe simply refer to it as the Acquis data set.

8 http://standaarden.overheid.nl/owms/3.5/doc/waardelijsten/overheid.

taxonomiebeleidsagenda9http://ipsc.jrc.ec.europa.eu/index.php?id=198

10Website used for distribution of EU legislation: http://eur-lex.europa.eu/11http://ipsc.jrc.ec.europa.eu/index.php?id=60#c2693

14

http://standaarden.overheid.nl/owms/3.5/doc/waardelijsten/overheid.taxonomiebeleidsagenda

http://standaarden.overheid.nl/owms/3.5/doc/waardelijsten/overheid.taxonomiebeleidsagenda

http://ipsc.jrc.ec.europa.eu/index.php?id=198

http://eur-lex.europa.eu/

http://ipsc.jrc.ec.europa.eu/index.php?id=60#c2693

1960 1970 1980 1990 2000 2010year

0

500

1000

1500

2000

2500

3000

3500

amou

nt of d

ocum

ents

Figure 1: Bar plot with the amount of Dutch Acquis documents publishedin each year.

The amount of documents per language, shown in Table 5, varies as morerecent members of the EU did not have obsolete documents translated intotheir languages. There are 46 008 unique documents out of which 17 520documents occur in each of the languages. The average amount of tokens(the tokenization process is explained in Section 4.3) per document differswidely which is probably due to the different grammar of the languages.Each document is labeled with at least 1 concept and is labeled with 5.6concepts on average. A few languages have a handful of documents withseveral hundred concepts which affects its variance, but not the medianwhich is 6 for each language.

4.2.2 Dutch Parliamentary Questions

This data set was obtained from http://overheid.nl at which the Dutchgovernment publishes, amongst others, its parliamentary documents in XMLformat. It consists of 39 637 questions asked by members of the Dutchnational parliament together with the corresponding answers by ministersand state secretaries distributed over the years 1995 to 2011 as shown inFigure 2. The documents are manually labeled with concepts from thebeleidsagenda taxonomy. 733 documents did not have a concept and wereremoved. The statistics for the remaining 38 904 documents are shown inTable 6. Compared to the Acquis data set there are roughly half the amountof tokens per document, less concepts per document and the variance of theconcepts per document is lower.

15

http://overheid.nl

Table 5: The statistics for each language in the Acquis data set. All docu-ments have at least 1 concept and the median is 6 for each language.

concepts per document

lang. # documents avg. tokens unique tokens average variance maximum

bg 22 844 1405 184 286 5.8 63.4 636

cs 22 831 1202 196 002 5.7 4.1 102

da 41 731 1149 346 498 5.4 3.2 102

de 41 889 1216 397 713 5.4 3.3 102

el 41 842 1384 281 108 5.4 3.3 102

en 41 877 1306 127 530 5.4 3.1 102

es 41 564 1470 162 506 5.4 3.2 102

et 22 925 977 356 319 5.7 4.0 102

fi 38 566 921 560 267 5.5 3.2 102

fr 42 094 1473 142 609 5.4 2.8 75

hu 22 933 1172 322 440 5.7 4.2 102

it 41 923 1362 161 195 5.4 3.2 102

lt 22 952 1106 215 943 5.7 4.1 102

lv 22 983 1101 197 523 5.6 4.0 102

mt 20 052 1521 175 401 6.2 121.7 1018

nl 41 929 1327 253 890 5.4 3.2 102

pl 23 121 1208 193 443 5.7 21.5 640

pt 41 239 1426 162 640 5.4 3.2 102

ro 25 251 1897 331 379 5.7 45.6 822

sk 22 690 1224 212 721 5.7 4.1 102

sl 22 346 1221 183 403 5.7 3.9 102

sv 38 328 1227 343 473 5.5 3.2 102

average 31 996 1286 250 377 5.6 14 224

Table 6: The statistics for the parliamentary questions data set. All docu-ments have at least 1 concept and the median is 2.

concepts per document

# documents avg. tokens average variance maximum

39 637 653 2.0 1.0 9

4.3 Data Preparation

The raw data sets need to be preprocessed before both classification sys-tems can start to estimate the concept models. First off, the documentswere tokenized. In the PLM system all punctuation, numbers and tags wereremoved. Only alphanumeric words together with the dash were allowed.The resulting tokens were used as unigrams. Some experiment were carriedout using bigrams and lemmatization. Bigrams were created through com-

16

1996 1998 2000 2002 2004 2006 2008 2010 2012year

0

500

1000

1500

2000

2500

3000

3500

amou

nt of d

ocum

ents

Figure 2: Bar plot with the amount of parliamentary questions published ineach year.

bination of two sequential unigrams. For Dutch texts, lemmatization wasdone using Frog12.

Here, we first show the effect of creating Dc, i.e., the set of documentslabeled with the same concept c. A document labeled with more than oneconcept (e.g., climate change and extraction of oil) will thus occur multipletimes in Dc (e.g., once in the dc with concept c being climate change andonce in the dc with c being extraction of oil). Both the JEX and PLMsystem make use of Dc (though in different ways). This step is followedby depicting the effect of applying a threshold used by JEX, which we alsoapplied to the PLM system in order to level the playing field. The threshold:a concept model is only trained if there are at least 4 documents labeled withthat concept which each have at least 100 tokens.

We now show the effects of these preprocessing steps on both data sets.Note that these statistics depict the full data sets, while these data sets aresplit into training and test sets when used in the experiments. This affectsthe statistics, for example, the amount of documents per concept will belower.

4.3.1 Acquis

Table 7 shows the effect of creating Dc on the Acquis data sets. The morerecent EU member states have their documents labeled with roughly 3700unique EuroVoc concepts (out of 6797), while the older member states havearound 4200 unique concepts. The total amount of documents increasednearly sixfold, which is expected as the documents are on average labeledwith 5.6 concepts. The average amount of tokens per document has gone

12http://ilk.uvt.nl/frog/

17

http://ilk.uvt.nl/frog/

up from 1286 to 1348, which means that the most reused documents aredocuments with many tokens. This makes sense as long documents aremore likely to address multiple topics and are therefore labeled with moreconcepts.

Table 7: Statistics of Dc for all languages in the Acquis data set.

lang. # unique concepts # docs avg. docs per concept avg. tokens

bg 3784 132 622 35 1494

cs 3691 129 113 35 1249

da 4226 226 895 54 1204

de 4235 227 825 54 1274

el 4226 227 611 54 1458

en 4232 227 298 54 1369

es 4224 226 238 54 1539

et 3697 129 575 35 1019

fi 4109 211 833 52 964

fr 4236 227 967 54 1546

hu 3698 129 709 35 1221

it 4235 227 819 54 1427

lt 3699 129 751 35 1150

lv 3701 129 839 35 1153

mt 3681 124 990 34 1654

nl 4234 227 803 54 1395

pl 3699 131 510 36 1260

pt 4210 224 694 53 1498

ro 3889 143 641 37 1949

sk 3686 128 351 35 1280

sl 3685 126 863 34 1266

sv 4112 211 109 51 1277

average 3963 177 412 44 1348

Table 8 shows the statistics of the data sets after applying the mentionedthreshold where each dc ∈ Dc needs to consist of at least 4 documentswith at least 100 tokens. On average, each language data set contained973 documents with less than 100 tokens. On average 1298 concepts didnot meet the threshold and were thus removed, leaving on average 2665concepts with enough data. Accordingly, the total number of documentsshows a slight decrease while the average number of documents per concepthas a large increase. The average tokens per document did not change much.

A more detailed view is given of the Dutch Acquis data set in Figure 3.It shows a bar plot depicting the amount of documents labeled with eachEuroVoc concept. The plot is sorted by amount of documents, so on the

18

Table 8: Statistics for all languages in the Acquis data set with the thresholdof at least 4 documents with at least 100 tokens per concept applied to Dc.

lang. # unique concepts # docs avg. docs per concept avg. tokens

bg 2460 129 930 53 1493

cs 2398 126 025 53 1251

da 2932 223 611 76 1202

de 2942 224 172 76 1275

el 2927 222 726 76 1467

en 2932 224 195 76 1367

es 2927 223 167 76 1537

et 2399 126 229 53 1022

fi 2804 207 461 74 968

fr 2945 225 111 76 1542

hu 2404 127 051 53 1218

it 2945 224 756 76 1425

lt 2404 126 851 53 1150

lv 2408 127 030 53 1152

mt 2393 122 636 51 1649

nl 2938 224 762 77 1393

pl 2412 128 688 53 1259

pt 2915 221 735 76 1495

ro 2555 140 942 55 1949

sk 2398 125 370 52 1282

sl 2386 121 590 51 1289

sv 2812 207 998 74 1276

average 2665 174 183 64 1348

left are concepts with few documents and on the right are concepts withmany. The graph shows a linear or steeper increase of documents per con-cept (the y-axis is in logarithmic scale), so the amount of documents perconcept is quite unbalanced. The most used concept, import, was used tolabel 4054 documents. The dashed line marks the threshold below which lie1288 concepts with less than 4 documents (note that more documents areremoved by the preprocessing threshold as some concepts might have, e.g.,6 documents out of which only 2 might have more than 100 tokens).

4.3.2 Dutch Parliamentary Questions

Table 9 shows the effect of creating Dc on the parliamentary questions dataset. 110 out of the 111 available concepts are used. The changes in the dataset (e.g., increase of average tokens) are similar to the changes in the Acquis

19

0 500 1000 1500 2000 2500 3000 3500 400010-1

100

101

102

103

104

all 4234 concepts sorted by amount of documents

am

ou

nt

of

docu

men

ts

Figure 3: Bar plot with the amount of documents for each of the 4234EuroVoc concepts in the Dutch Acquis data set, sorted from the conceptswith the lowest amount of documents to the highest. The dashed line showsthe threshold of at least 4 documents. Note that the y-axis is in logarithmicscale.

data set. The biggest difference between both data sets is that there are lessconcepts in the parliamentary questions data set and consequently a higheraverage of documents per concept.

Table 9: Statistics for Dc in the parliamentary questions data set.

# unique concepts # docs avg. docs per concept avg. tokens

110 76 672 697 659

Table 10 shows the statistics after application of the threshold. Theeffect is minor as only one concept appears to have less than 4 documentswith 100 tokens. Compared to the Acquis data set, this data set has a moreeven use of its (fewer) concepts as shown in Figure 4 (note that this figure’sy-axis is not in logarithmic scale unlike Figure 3). The most used conceptis constitutional law.

Table 10: Statistics for the parliamentary questions data set with the thresh-old of at least 4 documents with at least 100 tokens per concept applied toDc.

# unique concepts # docs avg. docs per concept avg. tokens

109 76 666 703 659

20

0 20 40 60 80 1000

500

1000

1500

2000

2500

3000

3500

4000

4500

all 110 concepts sorted by amount of documents

am

ou

nt

of

docu

men

ts

Figure 4: Bar plot with the amount of documents for each of the 110taxonomie beleidsagenda concepts in the parliamentary questions data set,sorted from the concepts with the lowest amount of documents to the high-est. The dashed line shows the threshold of at least 4 documents.

4.4 Implementation

In this section we describe the effect of implementing the theories behindthe JEX and the PLM systems, as depicted in Section 3, by giving a viewof trained concept models and processing speeds of the systems.

4.4.1 Concept Models

The concept models are at the heart of the classification systems. Here wegive an impression of how these concept models look like using several tablesand word clouds. In all cases, the used concept model belongs to the concepthealth control and is trained on the first chronological 90% of the EnglishAcquis data set. Table 11 shows the effect of different λc values on thecreation of a concept model using the PLM system together with unigramtokens. For each concept model it shows the 20 tokens with the highestprobabilities. As a comparison the 20 tokens with the highest weight in theJEX approach are listed as well. Table 12 depicts the top 20 tokens from thePLM systems with bigram tokens for different λc values together with theunigram PLM λc = 0.05 top 20 (to allow for easy comparison to one of thebetter performing unigram PLM systems). Figures 5, 6 and 7 show wordclouds for the 100 highest weight/probability tokens in the JEX, unigramPLM λc = 0.05 and bigram PLM λc = 0.3 systems. These word clouds offera more intuitive comparison of the weight/probability distribution betweenthe highest scoring tokens in each system.

Viewing the tables and word clouds makes it clear that the concepthealth control is often used to label documents related to animal health and

21

diseases. Furthermore, high λc values result in too general PLM conceptmodels as common stop words like the, of and in are considered to berelevant. The lower the λc value the more specific are the tokens for thisconcept compared to all other concepts. Too low λc values result in toospecific tokens which leads to overfitting.

Table 19 in Appendix B shows the average number of non-zero tokensper concept model for each language. JEX’ concept models (on averageacross all languages 222 non-zero tokens per concept model) are one orderof magnitude smaller than the unigram PLM’s concept models (average 4605non-zero tokens) which in turn are one order of magnitude smaller than thebigram PLM’s concept models (average 22 589 non-zero tokens).

Table 11: The 20 tokens with highest weight/probability for JEX and vary-ing λc unigram PLM concept models, for concept health control. The modelsare trained on the first chronological 90% of the English Acquis data set.

JEX PLM λc = 0.001 PLM λc = 0.01 PLM λc = 0.05 PLM λc = 0.1 PLM λc = 0.9

rank token weight token prob. token prob. token prob. token prob. token prob.

1 veterinary 1.09e-01 poultry 4.06e-02 animal 4.58e-02 decision 4.09e-02 decision 3.94e-02 the 8.39e-02

2 swine 1.02e-01 swine 3.16e-02 health 4.00e-02 health 3.18e-02 health 2.39e-02 of 5.36e-02

3 intra-community 9.51e-02 equidae 2.01e-02 meat 3.46e-02 animal 3.02e-02 directive 2.32e-02 in 3.34e-02

4 animal 9.41e-02 pigs 1.92e-02 veterinary 3.37e-02 animals 2.42e-02 animal 2.22e-02 to 3.09e-02

5 health 9.23e-02 semen 1.86e-02 animals 3.23e-02 meat 2.24e-02 animals 1.80e-02 and 2.95e-02

6 disease 9.17e-02 influenza 1.72e-02 disease 2.24e-02 veterinary 2.17e-02 eec 1.76e-02 for 1.63e-02

7 infectious 8.86e-02 bivalve 1.61e-02 poultry 1.99e-02 directive 2.01e-02 meat 1.64e-02 article 1.31e-02

8 bovine 8.75e-02 avian 1.61e-02 swine 1.80e-02 disease 1.43e-02 veterinary 1.59e-02 a 1.24e-02

9 molluscs 8.74e-02 fever 1.60e-02 establishments 1.55e-02 approved 1.14e-02 annex 1.18e-02 be 1.17e-02

10 hygiene 8.72e-02 molluscs 1.59e-02 fresh 1.49e-02 poultry 1.09e-02 disease 1.04e-02 decision 1.16e-02

11 bivalve 8.64e-02 vhs 1.34e-02 bovine 1.47e-02 swine 1.01e-02 products 9.62e-03 by 1.06e-02

12 meat 8.64e-02 ovine 1.24e-02 pigs 1.37e-02 fresh 9.95e-03 approved 9.34e-03 with 1.00e-02

13 diseases 8.61e-02 veterinarian 1.22e-02 fishery 1.11e-02 bovine 9.92e-03 third 9.06e-03 shall 9.41e-03

14 animals 8.45e-02 caprine 1.14e-02 fever 1.08e-02 establishments 9.45e-03 down 8.86e-03 ec 8.40e-03

15 epidemiological 8.27e-02 ihn 1.14e-02 influenza 9.48e-03 list 8.30e-03 poultry 7.74e-03 this 8.27e-03

16 vhs 8.24e-02 tse 1.13e-02 diseases 9.03e-03 pigs 7.89e-03 fresh 7.32e-03 from 7.91e-03

17 viral 8.20e-02 establishments 9.89e-03 semen 8.90e-03 eec 7.84e-03 bovine 7.31e-03 eec 7.85e-03

18 septicaemia 8.19e-02 infected 9.61e-03 avian 8.83e-03 third 7.65e-03 swine 7.22e-03 directive 7.83e-03

19 necrosis 8.11e-02 hatching 8.73e-03 eradication 8.23e-03 fishery 7.57e-03 list 7.19e-03 on 7.80e-03

20 haemorrhagic 8.10e-02 blood 8.12e-03 equidae 7.97e-03 standing 7.38e-03 amended 7.11e-03 or 7.54e-03

Table 12: The 20 tokens with the probability for unigram PLM λc = 0.05and varying λc bigram PLM concept models, for concept health control. Themodels are trained on the first chronological 90% of the English Acquis dataset.

uni PLM λc = 0.05 bi PLM λc = 0.001 bi PLM λc = 0.1 bi PLM λc = 0.3 bi PLM λc = 0.5 bi PLM λc = 0.9

rank token prob. token prob. token prob. token prob. token prob. token prob.

1 decision 4.09e-02 approved zone 3.60e-03 directive eec 7.76e-03 directive eec 7.23e-03 directive eec 6.07e-03 of the 1.38e-02

2 health 3.18e-02 non approved 3.06e-03 this decision 6.98e-03 this decision 6.93e-03 this decision 5.88e-03 to the 7.52e-03

3 animal 3.02e-02 approved zones 2.54e-03 decision ec 6.90e-03 decision ec 5.74e-03 member states 4.89e-03 in the 5.38e-03

4 animals 2.42e-02 identification document 2.14e-03 animal health 6.12e-03 animal health 4.11e-03 decision ec 4.71e-03 member states 5.18e-03

5 meat 2.24e-02 approved farm 2.06e-03 commission decision 4.57e-03 in accordance 3.98e-03 in accordance 4.56e-03 for the 5.10e-03

6 veterinary 2.17e-02 ihn and 2.00e-03 council directive 4.20e-03 accordance with 3.98e-03 accordance with 4.55e-03 directive eec 4.69e-03

7 directive 2.01e-02 vhs and 1.92e-03 decision eec 3.37e-03 commission decision 3.84e-03 to the 4.48e-03 this decision 4.59e-03

8 disease 1.43e-02 approved farms 1.74e-03 fresh meat 3.29e-03 council directive 3.65e-03 of the 4.14e-03 in accordance 4.36e-03

9 approved 1.14e-02 fish farms 1.73e-03 third countries 3.24e-03 member states 3.39e-03 with the 3.62e-03 accordance with 4.36e-03

10 poultry 1.09e-02 bivalve molluscs 1.71e-03 eec of 3.16e-03 third countries 3.12e-03 for the 3.53e-03 with the 4.13e-03

11 swine 1.01e-02 haemorrhagic septicaemia 1.61e-03 the standing 2.93e-03 the competent 2.72e-03 regard to 3.43e-03 the commission 3.86e-03

12 fresh 9.95e-03 diagnostic manual 1.54e-03 competent authority 2.88e-03 eec of 2.69e-03 animal health 3.19e-03 on the 3.70e-03

13 bovine 9.92e-03 viral haemorrhagic 1.53e-03 fishery products 2.66e-03 decision eec 2.50e-03 commission decision 3.15e-03 regard to 3.61e-03

14 establishments 9.45e-03 marine gastropods 1.44e-03 animals and 2.49e-03 laid down 2.48e-03 council directive 3.02e-03 decision ec 3.56e-03

15 list 8.30e-03 equine animal 1.35e-03 health conditions 2.48e-03 regard to 2.42e-03 having regard 2.87e-03 the european 3.54e-03

16 pigs 7.89e-03 septicaemia vhs 1.32e-03 decision article 2.44e-03 competent authority 2.41e-03 the member 2.79e-03 by the 3.49e-03

17 eec 7.84e-03 and ihn 1.31e-03 the animal 2.33e-03 the measures 2.22e-03 third countries 2.64e-03 shall be 3.46e-03

18 third 7.65e-03 the diagnostic 1.27e-03 swine fever 2.32e-03 provided for 2.22e-03 laid down 2.58e-03 having regard 3.05e-03

19 fishery 7.57e-03 necrosis ihn 1.24e-03 of animal 2.30e-03 decision article 2.19e-03 the competent 2.47e-03 the community 3.04e-03

22

Figure 5: Word cloud of the 100 highest weight tokens in JEX’ health controlconcept model.

Figure 6: Word cloud of the 100 highest probability tokens in the unigramPLM’s health control concept model, with λc = 0.05.

Figure 7: Word cloud of the 100 highest probability tokens in the bigramPLM’s health control concept model, with λc = 0.3.

23

4.4.2 Efficiency

Table 13 shows the speed of training and classification on the Dutch Acquisdata set for the JEX and unigram/bigram PLM systems. The times aremeasured in seconds and are obtained on a server with 12 cores runningat 2.80GHz, having 24GB memory. The PLM methods were programmedto make use of all 12 cores, while JEX utilized only 1 core. The train-ing and classification times differ an order of magnitude between JEX andthe unigram PLM system, and another order of magnitude between the bi-gram PLM system (note that the PLM systems have not been optimizedfor speed). This correlates with the average number of non-zero tokens perconcept model (as shown in Appendix B).

Table 13: Training and classification speeds of the JEX and PLM recommen-dation systems on the Dutch Acquis data set. Times reported in seconds.

training (37 736 docs.) classification (4193 docs.)

system preprocessing training preprocessing classification

JEX 123 34 14 205

unigram PLM 165 673 1 4308

bigram PLM 233 6445 21 46 342

5 Experiments and Results

Unless stated otherwise, experiments are carried out by training on the firstchronological 90% of the data and testing on the remaining 10%. Thismimics the way the text classification systems are used in practice as theyare trained on older texts and used to classify new texts (more on this inSection 5.3.1).

The classification systems (JEX and PLM) output a ranked list of con-cepts given a query document from the test set. The original manuallyapplied concepts (i.e., the relevant concepts) for each of these documentsare known, which allows us to automatically measure the effectiveness ofthe classification. This is done using the information retrieval measures R-precision (Rprec) and mean average precision (MAP) [13]. R-precision isthe precision at k, with k being the number of concepts with which theclassified document has been manually labeled (e.g., if a document has beenmanually labeled with 6 concepts then the precision at 6 will be used forthis particular document). MAP is the average of several precision at k ’swith the k ’s being the ranks of each relevant concept (e.g., if there are 3relevant concepts which are returned at ranks 2, 3 and 5, then MAP is theaverage precision at k for k ’s 2, 3 and 5). This score is calculated for each

24

document in the test set and then averaged, hence its name mean averageprecision. The reported R-precision scores are also the average over all doc-uments’ R-precision scores in the test set. These measures were chosen asthey both take the ranking of the concepts into account. The higher theranking of the relevant concepts the higher the score.

The inter-rater reliability should be taken into account when interpretingthese scores. When two human documentalists manually label a documentthey will not always use the same concepts. This influences the maximumobtainable score for the automated classifiers. For example, [15] conductedan experiment in which they let documentalists blindly review the EuroVocconcepts assigned to European legislation labeled by their human colleaguesand by the computer classifier. They found that the agreement between thehumans was 78% in English and 87% in Spanish. It can thus be argued thatperfect scores cannot be obtained.

To make the comparison with JEX as precise as possible we applied oneof their default settings to our experiments, namely the requirement of atleast 4 documents with at least 100 tokens labeled with a concept in orderfor that concept to be used.

The unigram and bigram PLM systems use respectively λc = 0.05 andλc = 0.3 and both use λq = 0.5 unless noted otherwise. JEX’ parameterswere all kept at their default values as denoted in Section 3.1.1.

Some experiments had their scores tested for significance using two-tailedpaired t-tests (unless stated otherwise) with p < 0.01 and p < 0.05 denotedrespectively N (or H) and M (or O).

We will now present the experiments and their results, structured byresearch question.

5.1 Can a PLM Classification System Outperform JEX inthe Effectiveness of Its Classifications?

In this section the effectiveness of the unigram PLM, bigram PLM andJEX classification systems are compared on the Acquis and parliamentaryquestions data sets. The unigram PLM system uses λq = 0.005 in theseexperiments.

5.1.1 Acquis Experiment

The Acquis data set features 22 languages. We managed to compare 19 lan-guages as there are, yet unresolved, issues with getting the Greek, Bulgarianand Romanian data sets to work with the PLM systems. Table 14 shows theresults of the JEX, unigram PLM and bigram PLM systems when trainedand evaluated on these 19 languages. The unigram PLM system significantlyoutperforms JEX in half of the scores and performs significantly worse or

25

similar in the other half. Almost all the bigram PLM system’s scores aresignificantly better than JEX with some being similar.

The results between the languages varies as the data sets, though quitesimilar, do differ from each other (which has become clear in Section 4). Alsothe used parameters, which are the same for each language within a system,might be more beneficial for one language than for the other. While thescores between languages can differ significantly, they do fall within a certainrange of each other. The lowest (fi) and highest (hu) scoring languages inJEX differ 0.0465 in Rprec and 0.0356 in MAP. For the unigram PLM systemthe difference between lowest (en) en highest (et) scoring languages differsmore with 0.0616 in Rprec and 0.0705 in MAP. The difference in the bigramPLM system between the lowest (fr) and highest (sk) scoring languages isthe lowest with 0.0232 in Rprec and 0.0337 in MAP.

Table 14: The R-precision and mean average precision scores for the JEX,unigram PLM and bigram PLM systems for 19 languages from the Acquisdata set. Unigram PLM λq = 0.005.

JEX (baseline) unigram PLM bigram PLM

lang. Rprec MAP Rprec MAP Rprec MAP

cs 0.5532 0.5924 0.5821N 0.6049N 0.5865N 0.6132N

da 0.5532 0.5797 0.5598M 0.5873N 0.5652N 0.5879N

de 0.5482 0.5746 0.5632N 0.5829N 0.5771N 0.5840N

en 0.5552 0.5808 0.5405H 0.5538H 0.5732N 0.5834

es 0.5612 0.5831 0.5454H 0.5629H 0.5704N 0.5858

et 0.5592 0.5900 0.6021N 0.6243N 0.5947N 0.6102N

fi 0.5338 0.5710 0.5712N 0.6043N 0.5835N 0.6012N

fr 0.5514 0.5798 0.5434H 0.5582H 0.5691N 0.5836

hu 0.5803 0.6066 0.5781 0.6112 0.5955N 0.6099

it 0.5528 0.5713 0.5436H 0.5595H 0.5698N 0.5871N

lt 0.5478 0.5981 0.5730N 0.6022 0.5855N 0.6111N

lv 0.5421 0.5870 0.5668N 0.6002N 0.5872N 0.6143N

mt 0.5528 0.5972 0.5683N 0.5948 0.5909N 0.6033

nl 0.5527 0.5770 0.5576 0.5762 0.5673N 0.5906N

pl 0.5529 0.5925 0.5691N 0.5971 0.5797N 0.6056N

pt 0.5564 0.5815 0.5448H 0.5572H 0.5738N 0.5875M

sk 0.5452 0.5758 0.5839N 0.6132N 0.5923N 0.6173N

sl 0.5521 0.5901 0.5867N 0.6136N 0.5864N 0.6162N

sv 0.5533 0.5855 0.5568 0.5886 0.5795N 0.5983N

26

5.1.2 Parliamentary Questions Experiment

We trained and evaluated JEX and the PLM systems on the parliamentaryquestions data set in order to compare their performance on a data setother than Acquis. For this experiment the JEX and PLM systems usedthe same parameters as in the Acquis experiment (including λq = 0.005 forthe unigram PLM system). Table 15 shows the scores obtained by JEX andthe unigram/bigram PLM systems. The unigram PLM system significantlyoutperforms JEX and the bigram PLM system obtained even higher scores.The gap between the R-precision an MAP scores is larger compared to theAcquis scores, with the R-precision being lower across all systems and theMAP scores being higher in the case of the PLM systems. This is due tothe lower average number of concepts (2.0 instead of 5.6) used to label thedocuments in the parliamentary questions data set. E.g., an R-precisionof 0 is given if a document was manually labeled with 1 concept and theclassifier returns this concept at rank 2, while the MAP score would be 0.5.If the classifier returned the correct concept at rank 1 then both measureswould return a value of 1. This is thus an all or nothing situation for the R-precision measure, while the MAP measure only ends up with a score close to0 if the relevant concept is ranked at the bottom. A higher average numberof relevant concepts per document results in the R-precision becoming amore fine-tuned measure like MAP.

Table 15: The R-precision and mean average precision scores for the JEX,unigram PLM and bigram PLM systems trained on the parliamentary ques-tions data set. Unigram PLM λq = 0.005.

system Rprec MAP

JEX (baseline) 0.4120 0.5491

unigram PLM 0.4807N 0.6197N

bigram PLM 0.5175N 0.6436N

5.2 What Are the Effects of Different Document Represen-tations on the PLM Method?

We experimented with lemmatized and bigram representation of the docu-ments. Here we detail the effects of these representations.

5.2.1 Does Lemmatization Increase the Classification Scores?

We tested the influence of lemmatization on the unigram PLM system.Lemmatization maps different inflected forms of a word back to one ver-sion, e.g., the words walked, walking and walks would all be replaced by

27

walk. [15] reports that JEX’ F1 score increased by 0.02 points when lem-matizing the data. While lemmatization requires a language dependent toolwe wanted to see its effect on our system. Given that JEX’ scores improvedwe expected ours to improve as well.

We used the Dutch Acquis data set which, after lemmatization, had itsamount of unique tokens (i.e., the vocabulary V ) decreased with 21.75%from 253 890 to 198 666. As Shown in Figure 8, the lemmatized scores werelower than the normal unigram PLM scores. An explanation for these resultsmight be that the PLM method works best with more distinguishing data.In other words, lemmatizing the data removes information which the PLMmethod might actually use to better distinguish the concept models fromeach other.

0.001 0.005 0.01 0.05 0.1 0.5λc

0.520

0.525

0.530

0.535

0.540

0.545

0.550

0.555

R-p

reci

sion

0.001 0.005 0.01 0.05 0.1 0.5λc

0.528

0.532

0.536

0.540

0.544

0.548

0.552

0.556

0.560

mean a

vera

ge p

reci

sion

unigram PLM lemmatized unigram PLM

Figure 8: Plots of the R-precision and mean average precision scores forthe unigram PLM and lemmatized unigram PLM systems with differing λcvalues. The x-axis is in logarithmic scale.

5.2.2 Do Bigram Models Increase the Classification Scores?

The PLM system’s semantic knowledge of the text is based on statisticalevaluation of word occurrences. One way of incorporating more semanticknowledge into this statistical system is to look at the occurrences of mul-tiple adjacent words (n-grams) instead of just a single word (unigram). Wehypothesized that training our parsimonious language model using bigramsinstead of unigrams increases the quality of the classifications.

The bigrams were created as follows. With n being the total amountof words in a text, for word i in range 2, . . . , n the bigram tokens consistsof words i − 1 + i. For example the text “He walks the dog” would in thecase of the unigram PLM system simply result in one token for each ofthe four words. The bigram system ends up with three tokens “he walks”,“walks the” and “the dog”. The amount of tokens per text stays nearly thesame as it is always reduced by 1 compared to the unigram version of thetext. The effect of using bigrams is most noticed in that the total amount

28

of unique tokens increases significantly. Where the unigram PLM system,trained on the first chronological 90% of the Dutch Acquis data set, has avocabulary of 241 290 unique tokens, the bigram PLM system uses 3 441 041unique tokens (14 times more).

Figure!9 shows the results of the bigram PLM system for different valuesof λc, compared to the unigram PLM system. The bigram system consis-tently outperforms the unigram system. Its best performance is obtainedusing a higher λc value (0.4) than the optimal λc value for the unigramsystem (0.1).

0.001 0.005 0.01 0.05 0.1 0.5λc

0.512

0.520

0.528

0.536

0.544

0.552

0.560

0.568

0.576

R-p

reci

sion

0.001 0.005 0.01 0.05 0.1 0.5λc

0.536

0.544

0.552

0.560

0.568

0.576

0.584

0.592

0.600

mean a

vera

ge p

reci

sion

unigram PLM bigram PLM

Figure 9: Plots of the R-precision and mean average precision scores forunigram PLM and bigram PLM systems with differing λc values. The x-axis is in logarithmic scale.

5.3 How Robust Is the PLM Method?

The robustness of the PLM method is tested in several experiments, allusing the unigram PLM approach. These include: Training on data fromdifferent periods in time, training on a different data set with a differenttaxonomy, testing the system on documents coming from a source differentthan the training set, the effect of using less concepts and an analysis of theeffect of different parameter values.

5.3.1 Does Training Data from a Certain Time Frame Affect thePerformance?

Language changes over time and we wanted to get an indication of this effecton our system. All other experiments were run with the oldest chronological90% of the data as train set and the remaining, most recent, 10% of the dataas test set (we call this trained old, evaluated recent). This was done to sim-ulate the real world conditions under which the classification system wouldoperate. The system will be trained on all currently existing documentsafter which is has to classify new documents created later in time.

29

The setup of this experiment was to train the classifier on the most recent90% of the Dutch Acquis data and then evaluate it on the oldest 10% of thedata (which we call trained recent, evaluated old). As shown in Figure 1,the data set does not contain many documents before 1994, with as resultthat the oldest 10% of all documents spans the period from 1952 to 1991.The unigram PLM classification system was used to train the classifiers.

As shown in Table 16, the trained recent, evaluated old experiment per-formed significantly worse than the normal trained old, evaluated recent ver-sion. This confirms our expectation that there is a significant difference intraining and testing on different chronological parts of the data. It alsosupports our method of testing on the most recent data instead of usingcross-validation. These two experiments constitute 2 out of 10 tests in 10-fold cross-validation. Considering these 2 results, cross-validation wouldmost likely report lower scores than what the users of the recommender sys-tem might experience in practice. The scores of the trained old, evaluatedrecent method of evaluation are thus likely to be a better approximation ofthe real life performance of the system.

Table 16: The R-precision and mean average precision scores for the chrono-logical experiments using the unigram PLM system. trained old, evaluatedrecent is trained on the oldest chronological 90% of the Dutch Acquis dataset and evaluated on the remaining 10%, while trained recent, evaluated oldis trained on the most recent chronological 90% and evaluated on the oldest10%.

experiment Rprec MAP

trained old, evaluated recent (baseline) 0.5510 0.5590

trained recent, evaluated old 0.2979H 0.3092H

5.3.2 How Does the PLM System Perform on Different Data Setsand Taxonomies?

Most experiments were carried out on the Acquis data set. We also testedour method on the parliamentary questions data set to find out whether itperforms consistently across different data sets and taxonomies.

Figure 10 shows the R-precision and mean average precision scores of theunigram PLM system for different values of λc on both the Dutch Acquisdata set and the parliamentary questions data set. The effect of λc is similaracross both data sets, with the best classification scores for λc at 0.05 and0.1. The R-precision scores are lower for the parliamentary questions dataset, though its mean average precision scores are higher. This large gap,due to a lower average number of relevant concepts per document, is further

30

described in Section 5.1.2. This difference between the data sets makes itdifficult to compare the scores.

0.001 0.005 0.01 0.05 0.1 0.5λc

0.488

0.496

0.504

0.512

0.520

0.528

0.536

0.544

0.552

0.560

R-pr

ecis

ion

0.001 0.005 0.01 0.05 0.1 0.5λc

0.525

0.540

0.555

0.570

0.585

0.600

0.615

0.630

0.645

0.660

mea

n av

erag

e pr

ecis

ion

Acquis (nl) parliamentary questions

Figure 10: Plots of the R-precision and mean average precision scores forthe unigram PLM system with differing λc values, trained on the parlia-mentary questions data set and the Dutch Acquis data set. The x-axis is inlogarithmic scale.

5.3.3 How Does the System Perform When Train and Test Doc-uments Stem from Different Sources?

Ideally a single system would be able to perfectly classify documents coveringall kinds of topics. To do this, our system would need labeled data coveringall topics, which is not the case. Still, in practice people might want to useour system trained on data coming from one source to classify documentsfrom another source for which there were no labeled documents. We hy-pothesize that the more related the sources are the better the classificationresults will be.

In order to test this we classified Dutch parliamentary questions with theunigram PLM system trained on the Dutch Acquis data set. Both data setsare related in that they contain political documents, but obviously differ asone focuses on European matters while the other deals with Dutch issues.Furthermore, the Acquis data set covers many kinds of legislation, while theother data set covers only parliamentary questions. We expected that someDutch issues would overlap with the European one’s and consequently theclassifier would return accurate classifications, while issues specific to theDutch parliamentary questions would be unknown to the classifier and thusresult in worse classifications.

For this experiment we classified 60 parliamentary questions. 50 ques-tions were randomly selected. We handpicked the other 10 questions tocontain texts about local Dutch affairs and we thus expected them be clas-sified worse. The evaluation of the 60 questions had to be done manually asthe parliamentary questions are labeled with concepts from the taxonomie

31

beleidsagenda while the classifier return concepts from the EuroVoc tax-onomy. We hired a journalist to read each of the parliamentary questions(without knowing that some texts where randomly selected and some hand-picked) and then select which concepts from the top 100 concepts added byour system were relevant.

It took 6 hours to do this, so on average the time to read and manuallyclassify a parliamentary question using 100 recommended concepts was 6minutes. In [19] it is said that expert documentalists, without the assis-tance of a recommender system, index between 30 and 35 documents perday, which (if assuming an eight hour workday) corresponds to about 13-16 minutes per document. While our (non-expert) documentalist did notretrieve relevant concepts beyond the top 100 recommended concepts andthe average length of the parliamentary questions is half of the Acquis doc-uments, it does give an indication that the use of a recommender systemmight speed up the manual indexing process.

From the results shown in Table 17 it can be concluded that the ques-tions about local Dutch issues were indeed classified significantly worse thanthe random sample. The 50 random most likely also included some localquestions which have negatively affected the scores, so questions with topicssimilar to the Acquis training material will likely have higher R-precisionand mean average precision scores than those shown in the table. The aver-age number of relevant concepts as identified by our documentalist are alsoreported. These values are higher than the average number of concepts usedto label the Acquis documents (5.6).

Table 17: The R-precision and mean average precision scores, calculatedusing the relevant concepts (identified by our documentalist) out of the top100 concepts of 60 parliamentary questions. Significance tested with two-tailed unpaired t-tests.

experiment Rprec MAP average relevant concepts

50 random questions (baseline) 0.3263 0.3412 9.8

10 local questions 0.1604O 0.1725O 7.7

5.3.4 Does Classification Improve When Using More General andThus Less Concepts?

The EuroVoc taxonomy is hierarchical, which can be used to train the sys-tem in different ways. It is conceivable that some users want to classifytheir documents using general concepts instead of specific concepts. In thisexperiment, the concepts of each document in the Dutch Acquis data setwere replaced by their ancestor concepts from the microthesauri level (whichcontains 127 concepts). This results in a lower average amount of concepts

32

per document (4.2 instead of 5.4) as specific concepts sometimes map to thesame microthesauri concept. The unigram PLM system’s training processwas then executed as normal. We expected that the system would performbetter as there is more training data for each concept and there are lessconcepts to choose from during the classification.

Figure 11 shows the R-precision and mean average precision scores withvarying λc values for both the classifier trained with the general conceptsand the classifier using the specific concepts (i.e., the normal Dutch Acquisdata set). The general concepts classifier shows higher R-precision scoresand its improvement in mean average precision scores is even more drastic.This gap between R-precision and mean average precision could again bedue to the lower average amount of concepts per document (as detailed inSection 5.1.2).

0.01 0.05 0.1 0.75λc

0.525

0.530

0.535

0.540

0.545

0.550

0.555

0.560

R-pr

ecis

ion

0.01 0.05 0.1 0.75λc

0.525

0.540

0.555

0.570

0.585

0.600

0.615

0.630

0.645

0.660

mea

n av

erag

e pr

ecis

ion

specific concepts general concepts

Figure 11: Plots of the R-precision and mean average precision scores forthe unigram PLM system with differing λc values, trained on general mi-crothesauri concepts from the Dutch Acquis data set and normal specificconcepts. The x-axis is in logarithmic scale.

5.3.5 How Much Parameter Tuning Is Needed?

As detailed in Section 3, the PLM system requires two parameters to beoptimized, λc and λq, which balances the mixture with the backgroundlanguage model for respectively the concept model c and the query model q.Ideally, these parameters are set empirically for each data set (and maybeeven for each concept or query document). In the previous experimentswe have not optimized these parameters for each data set and used fixedvalues instead. In the experiments it turns out that the optimal range of λcvalues is roughly between 0.005 and 0.6. Scores in this range differ, but notby large margins. Different document representations (and with it the sizeof the vocabulary) have an effect on the optimal λc value as shown in thelemmatized, unigram and bigram experiments.

Figure 12 shows the effect of different λq values on the unigram PLM

33

system trained on the Dutch Acquis data set. A value of 0.005 results inoptimal performance. As with λc, the scores differ with different λq values,but not by large margins.

While optimization of the λc and λq values can improve the scores sig-nificantly, there seem to be certain ranges for which the scores don’t differby large margins.

0.001 0.005 0.01 0.05 0.1 0.5λq

0.548

0.552

0.556

0.560

0.564

0.568

0.572

0.576

0.580

score

mean average precisionR-precision

Figure 12: Plots of the R-precision and mean average precision scores for theunigram PLM system trained with differing λq values on the Dutch Acquisdata set. The x-axis is in logarithmic scale.

6 Conclude and Discuss

6.1 Conclusion

We have shown that parsimonious language models can be used to create arobust text classification system. The unigram and bigram versions of thePLM system outperform JEX on both the Acquis data set and the parlia-mentary question data set. Furthermore, the PLM system automaticallydeals with (also domain specific) stop words and is easier to optimize.

The experiments show that a document representation of lemmatizedunigrams decreases the PLM system’s performance, while bigrams improvethe performance compared to unigrams.

The PLM system has shown to be robust as it performs well withoutspecific optimization on data sets using different taxonomies and is able toacceptably classify documents from sources other than which is has beentrained on. The system needs optimization for 2 parameters, but the ex-periments indicate a stable performance across a certain range of values forthese parameters. A trade-off can be made to improve performance by us-ing more general (and thus less) concepts (which might be beneficial if more

34

general concepts are required).All together the main research question can be answered as follows: par-

simonious language models can be used to create language independent,easy to optimize, multi-label text classification systems.

6.2 Discussion

The comparison with JEX would have been better if their software allowedfor lemmatization and bigrams. They state that both these document rep-resentations slightly improved performance.

As reported, the results in Table 14 do not contain the scores for threelanguages which were not trained and evaluated successfully using the PLMmethod. We want to find out what goes wrong, fix this problem and reportthese scores in the future.

6.3 Future Work

An analysis could be done to find out if there are performance differencesbetween individual concepts in the PLM system. If this is the case, then itmight be related to the amount of training documents available to train aconcept or whether the concept is often used together with other concepts.This information could then be used to make better predictions of the per-formance of a specific concept given the amount and kind of training data.Currently, the λc parameter is the same for each concept model. The abovementioned influences might also affect the optimal λc value. Knowledge ofthis effect might be used to adapt the λc value for each concept model. Thesame holds for λq.

The PLM system as implemented in this research does not remove to-kens from a concept model if their probability falls below a certain threshold.The classification speed might improve if a threshold can be found whichremoves low-probability tokens while not significantly affecting the classifi-cation results.

The scores in 14 are not directly comparable between each languageas they are the results of different data sets, though they are all a, largelyoverlapping, subset of all Acquis documents. The languages could be trainedon the 17 520 documents which occur in the data sets of all languages. Theseresults should give a better indication of the PLM system’ performance givendifferent (types of) languages.

In the experiment in 5.3.1 it is shown that documents created duringdifferent periods in time have a large effect on the performance. As a follow-up, it could be researched if there is a (moving) threshold which removesdocuments if they are too old (e.g., documents before 1990 or documentsolder than 20 years), thereby potentially improving both the efficiency, dueto smaller models, and effectiveness, due to less irrelevant training data.

35

References

[1] Charu C. Aggarwal and ChengXiang Zhai. A survey of text classifica-tion algorithms. In Mining Text Data, pages 163–222. 2012.

[2] Concha Bielza, Guangdi Li, and Pedro Larranaga. Multi-dimensionalclassification with Bayesian networks. International Journal of Approx-imate Reasoning, 52(6):705–727, 2011.

[3] Guido Boella, Luigi Di Caro, Leonardo Lesmo, and Daniele Rispoli.Multi-label classification of legislative text into EuroVoc. In LegalKnowledge and Information Systems: JURIX 2012: the Twenty-FifthAnnual Conference, volume 250, pages 21–30, 2013.

[4] Luis M. de Campos and Alfonso E. Romero. Bayesian network mod-els for hierarchical text classification from a thesaurus. InternationalJournal of Approximate Reasoning, 50(7):932–944, 2009.

[5] Andre Elisseeff and Jason Weston. A kernel method for multi-labelledclassification. Advances in Neural Information Processing Systems,14:681–687, 2002.

[6] Shantanu Godbole and Sunita Sarawagi. Discriminative methods formulti-labeled classification. In Advances in Knowledge Discovery andData Mining, volume 3056 of Lecture Notes in Computer Science, pages22–30. 2004.

[7] Djoerd Hiemstra, Stephen Robertson, and Hugo Zaragoza. Parsimo-nious language models for information retrieval. In Proceedings of the27th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 39–46, 2004.

[8] Rianne Kaptein, Rongmei LI, Djoerd Hiemstra, and Jaap Kamps. Us-ing parsimonious language models on web data. In Proceedings of the31st Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 763–764, 2008.

[9] Adam Kilgarriff. Which words are particularly characteristic of a text?A survey of statistical approaches. In Proceedings of the AISB Work-shop on Language Engineering for Document Analysis and Recognition,pages 33–40, 1996.

[10] Victor Lavrenko and W. Bruce Croft. Relevance models in informationretrieval. In Language Modeling for Information Retrieval, pages 11–56.2003.

36

[11] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A newbenchmark collection for text categorization research. The Journal ofMachine Learning Research, 5:361–397, 2004.

[12] Eneldo Loza Mencıa and Johannes Furnkranz. Efficient multilabel clas-sification algorithms for large-scale problems in the legal domain. InSemantic Processing of Legal Texts, volume 6036 of Lecture Notes inComputer Science, pages 192–215. 2010.

[13] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.Introduction to Information Retrieval. 2008.

[14] Edgar Meij, Wouter Weerkamp, Krisztian Balog, and Maarten de Ri-jke. Parsimonious relevance models. In Proceedings of the 31st AnnualInternational ACM SIGIR Conference on Research and Developmentin Information Retrieval, pages 817–818, 2008.

[15] Bruno Pouliquen, Ralf Steinberger, and Camelia Ignat. Automatic an-notation of multilingual text collections with a conceptual thesaurus.In Proceedings of the Workshop Ontologies and Information Extraction,pages 9–28, 2003.

[16] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-basedsystem for text categorization. Machine learning, 39(2):135–168, 2000.

[17] Fabrizio Sebastiani. Machine learning in automated text categorization.ACM Computing Surveys, 34(1):1–47, 2002.

[18] Ralf Steinberger, Pouliquen Bruno, Widiger Anna, Ignat Camelia, Er-javec Tomaz, Tufis Dan, and Varga Daniel. The JRC-Acquis: A multi-lingual aligned parallel corpus with 20+ languages. In Proceedings of the5th International Conference on Language Resources and Evaluation,pages 2142–2147, 2006.

[19] Ralf Steinberger, Mohamed Ebrahim, and Marco Turchi. JRC EurovocIndexer JEX - a freely available multi-label categorisation tool. In Pro-ceedings of the Eight International Conference on Language Resourcesand Evaluation, pages 798–805, 2012.

[20] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Miningmulti-label data. In Data Mining and Knowledge Discovery Handbook,pages 667–685. 2010.

[21] Min-Ling Zhang and Zhi-Hua Zhou. Multilabel neural networks withapplications to functional genomics and text categorization. Knowledgeand Data Engineering, IEEE Transactions on, 18(10):1338–1351, 2006.

37

A Calculation of Log-likelihood

The log-likelihood score G2 depicts the difference in relative frequency oftoken t in the current document d compared to the relative frequency of thetoken in the whole corpus D. The token frequencies as shown in Table 18are used to calculate the score in Equation 13 [9].

Table 18: Contingency table.

tf (d) tf (D) total

token t a b a+b

other tokens c d c+d

total a+c b+d a+b+c+d

G2(t, d,D) = 2(a ln(a) + b ln(b) + c ln(c) + d ln(d)

− (a+ b) ln(a+ b)− (a+ c) ln(a+ c)

− (b+ d) ln(b+ d)− (c+ d) ln(c+ d)

+ (a+ b+ c+ d) ln(a+ b+ c+ d))

(13)

38

B Concept Models

Table 19: The average number of tokens with non-zero values per conceptmodel and standard deviation for all languages measured for the JEX, uni-gram PLM and bigram PLM systems.

average number of non-zero tokens perconcept model and standard deviation

lang. JEX unigram PLM bigram PLM

cs 216± 597 4782± 5128 21 585± 39 821

da 271± 669 4376± 5397 23 144± 44 850

de 306± 718 4956± 6192 25 722± 50 292

en 118± 355 2802± 2680 19 387± 34 361

es 161± 495 3847± 3996 20 842± 36 945

et 227± 609 5616± 6737 22 122± 43 926

fi 353± 814 7004± 9537 27 577± 62 079

fr 233± 604 3475± 3444 20 680± 35 730

hu 143± 461 5836± 6916 22 186± 41 475

it 244± 639 3756± 3836 23 176± 41 639

lt 204± 582 5176± 5591 23 113± 43 600

lv 202± 575 4718± 5036 22 194± 41 679

mt 179± 535 4072± 4114 22 170± 33 448

nl 220± 575 3688± 4283 22 479± 42 251

pl 214± 598 5157± 5446 22 511± 41 190

pt 241± 626 3757± 3809 22 420± 39 443

sk 214± 604 5346± 5787 22 872± 41 707

sl 194± 545 4621± 4823 21 070± 37 492

sv 276± 673 4507± 5686 23 935± 46 597

average 222± 594 4605± 5181 22 589± 42 028

39

Text Classification of Political Documents Using ... · able, such a recommender system can be...

Documents

Transcript of Text Classification of Political Documents Using ... · able, such a recommender system can be...