Text Deconvolution Saliency (TDS): a deep tool box for ...

10
HAL Id: hal-01804310 https://hal.archives-ouvertes.fr/hal-01804310 Submitted on 31 May 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Text Deconvolution Saliency (TDS): a deep tool box for linguistic analysis Laurent Vanni, Mélanie Ducoffe, Damon Mayaffre, Frédéric Precioso, Dominique Longrée, Veeresh Elango, Nazly Santos Buitrago, Juan Gonzales Huesca, Luis Galdo, Carlos Aguilar To cite this version: Laurent Vanni, Mélanie Ducoffe, Damon Mayaffre, Frédéric Precioso, Dominique Longrée, et al.. Text Deconvolution Saliency (TDS) : a deep tool box for linguistic analysis. 56th Annual Meeting of the Association for Computational Linguistics, Jul 2018, Melbourne, France. hal-01804310

Transcript of Text Deconvolution Saliency (TDS): a deep tool box for ...

Page 1: Text Deconvolution Saliency (TDS): a deep tool box for ...

HAL Id: hal-01804310https://hal.archives-ouvertes.fr/hal-01804310

Submitted on 31 May 2018

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Text Deconvolution Saliency (TDS) : a deep tool box forlinguistic analysis

Laurent Vanni, Mélanie Ducoffe, Damon Mayaffre, Frédéric Precioso,Dominique Longrée, Veeresh Elango, Nazly Santos Buitrago, Juan Gonzales

Huesca, Luis Galdo, Carlos Aguilar

To cite this version:Laurent Vanni, Mélanie Ducoffe, Damon Mayaffre, Frédéric Precioso, Dominique Longrée, et al.. TextDeconvolution Saliency (TDS) : a deep tool box for linguistic analysis. 56th Annual Meeting of theAssociation for Computational Linguistics, Jul 2018, Melbourne, France. �hal-01804310�

Page 2: Text Deconvolution Saliency (TDS): a deep tool box for ...

Text Deconvolution Saliency (TDS) :a deep tool box for linguistic analysis

L. Vanni1, M. Ducoffe2, D. Mayaffre1, F. Precioso2

D. Longree3, V. Elango2, N. Santos2, J. Gonzalez2, L. Galdo2, C. Aguilar1

1 Univ. Nice Sophia Antipolis - BCL, UMR UNS-CNRS 7320 - France2 Univ. Nice Sophia Antipolis - I3S, UMR UNS-CNRS 7271 - France

3 Univ. Liege - L.A.S.L.A - [email protected]

Abstract

In this paper, we propose a new strat-egy, called Text Deconvolution Saliency(TDS), to visualize linguistic informationdetected by a CNN for text classifica-tion. We extend Deconvolution Networksto text in order to present a new per-spective on text analysis to the linguis-tic community. We empirically demon-strated the efficiency of our Text Decon-volution Saliency on corpora from threedifferent languages: English, French, andLatin. For every tested dataset, our TextDeconvolution Saliency automatically en-codes complex linguistic patterns based onco-occurrences and possibly on grammat-ical and syntax analysis.

1 Introduction

As in many other fields of data analysis, NaturalLanguage Processing (NLP) has been strongly im-pacted by the recent advances in Machine Learn-ing, more particularly with the emergence of DeepLearning techniques. These techniques outper-form all other state-of-the-art approaches on awide range of NLP tasks and so they have beenquickly and intensively used in industrial systems.Such systems rely on end-to-end training on largeamounts of data, making no prior assumptionsabout linguistic structure and focusing on stasti-cally frequent patterns. Thus, they somehow stepaway from computational linguistics as they learnimplicit linguistic information automatically with-out aiming at explaining or even exhibiting classiclinguistic structures underlying the decision.

This is the question we raise in this article andthat we intend to address by exhibiting classic lin-guistic patterns which are indeed exploited im-plictly in deep architectures to lead to higher per-

formances. Do neural networks make use of co-occurrences and other standard features, consid-ered in traditional Textual Data Analysis (TDA)(Textual Mining)? Do they also rely on comple-mentary linguistic structure which is invisible totraditional techniques? If so, projecting neural net-works features back onto the input space wouldhighlight new linguistic structures and would leadto improving the analysis of a corpus and a bet-ter understanding on where the power of the DeepLearning techniques comes from.

Our hypothesis is that Deep Learning is sensi-tive to the linguistic units on which the computa-tion of the key statistical sentences is based as wellas to phenomena other than frequency and com-plex linguistic observables. The TDA has moredifficulty taking such elements into account – suchas linguistic linguistic patterns. Our contribu-tion confronts Textual Data Analysis and Convolu-tional Neural Networks for text analysis. We takeadvantage of deconvolution networks for imageanalysis in order to present a new perspective ontext analysis to the linguistic community that wecall Text Deconvolution Saliency (TDS). Our de-convolution saliency corresponds to the sum overthe word embedding of the deconvolution projec-tion of a given feature map. Such a score providesa heat-map of words in a sentence that highlightsthe pattern relevant for the classification decision.We examine z-test (see section 4.2) and TDS forthree languages: English, French and Latin. Forall our datasets, TDS highlights new linguistic ob-servables, invisible with z-test alone.

2 Related work

Convolutional Neural Networks (CNNs) arewidely used in the computer vision community fora wide panel of tasks: ranging from image classifi-cation, object detection to semantic segmentation.

Page 3: Text Deconvolution Saliency (TDS): a deep tool box for ...

It is a bottom-up approach where we applied aninput image, stacked layers of convolutions, non-linearities and sub-sampling.

Encouraged by the success for vision tasks,researchers applied CNNs to text-related prob-lems Kalchbrenner et al. (2014); Kim (2014).The use of CNNs for sentence modeling tracesback to Collobert and Weston (2008). Collobertadapted CNNs for various NLP problems includ-ing Part-of-Speech tagging, chunking, Named En-tity Recognition and semantic labeling. CNNs forNLP work as an analogy between an image and atext representation. Indeed each word is embed-ded in a vector representation, then several wordsbuild a matrix (concatenation of the vectors).

We first discuss our choice of architectures.If Recurrent Neural Networks (mostly GRU andLSTM) are known to perform well on a broadrange of tasks for text, recent comparisons haveconfirmed the advantage of CNNs over RNNswhen the task at hand is essentially a keyphraserecognition task Yin et al. (2017).

In Textual Mining, we aim at highlighting lin-guistics patterns in order to analyze their constrast:specificities and similarities in a corpus Feldman,R., and J. Sanger (2007); L. Lebart, A. Salem andL. Berry (1998). It mostly relies on frequentialbased methods such as z-test. However, such ex-isting methods have so far encountered difficultiesin underlining more challenging linguistic knowl-edge, which up to now have not been empiricallyobserved as for instance syntactical motifs Melletand Longree (2009).

In that context, supervised classification, espe-cially CNNs, may be exploited for corpus anal-ysis. Indeed, CNN learns automatically param-eters to cluster similar instances and drive awayinstances from different categories. Eventually,their prediction relies on features which inferredspecificities and similarities in a corpus. Project-ing such features in the word embedding will re-veal relevant spots and may automatize the dis-covery of new linguistic structures as in the pre-viously cited syntactical motifs. Moreover, CNNshold other advantages for linguistic analysis. Theyare static architectures that, according to specificsettings are more robust to the vanishing gradi-ent problem, and thus can also model long-termdependency in a sentence Dauphin et al. (2017);Wen et al. (2017); Adel and Schutze (2017). Sucha property may help to detect structures relying on

different parts of a sentence.

All previous works converged to a shared as-sessment: both CNNs and RNNs provide relevant,but different kinds of information for text classifi-cation. However, though several works have stud-ied linguistic structures inherent in RNNs, to ourknowledge, none of them have focused on CNNs.A first line of research has extensively studied theinterpretability of word embeddings and their se-mantic representations Ji and Eisenstein (2014).When it comes to deep architectures, Karpathy etal. Karpathy et al. (2015) used LSTMs on charac-ter level language as a testbed. They demonstratethe existence of long-range dependencies on realword data. Their analysis is based on gate activa-tion statistics and is thus global. On another side,Li et al. Li et al. (2015) provided new visualizationtools for recurrent models. They use decoders, t-SNE and first derivative saliency, in order to shedlight on how neural models work. Our perspec-tive differs from their line of research, as we donot intend to explain how CNNs work on textualdata, but rather use their features to provide com-plementary information for linguistic analysis.

Although the usage of RNNs is more common,there are various visualization tools for CNNsanalysis, inspired by the computer vision field.Such works may help us to highlight the linguis-tic features learned by a CNN. Consequently, ourmethod takes inspiration from those works. Visu-alization models in computer vision mainly con-sist in inverting hidden layers in order to spot ac-tive regions or features that are relevant to theclassification decision. One can either train a de-coder network or use backpropagation on the in-put instance to highlight its most relevant features.While those methods may hold accurate informa-tion in their input recovery, they have two maindrawbacks: i) they are computationally expen-sive: the first method requires training a modelfor each latent representation, and the second re-lies on backpropagation for each submitted sen-tence. ii) they are highly hyperparameter depen-dent and may require some finetuning dependingon the task at hand. On the other hand, Deconvolu-tion Networks, proposed by Zeiler et al Zeiler andFergus (2014), provide an off-the-shelf method toproject a feature map in the input space. It consistsin inverting each convolutional layer iteratively,back to the input space. The inverse of a discreteconvolution is computationally challenging. In re-

Page 4: Text Deconvolution Saliency (TDS): a deep tool box for ...

sponse, a coarse approximation may be employedwhich consists of inverting channels and filterweights in a convolutional layer and then trans-posing their kernel matrix. More details of thedeconvolution heuristic are provided in section 3.Deconvolution has several advantages. First, it in-duces minimal computational requirements com-pared to previous visualization methods. Also,it has been used with success for semantic seg-mentation on images: in Noh et al. (2015); Nohet al demonstrate the efficiency of deconvolutionnetworks to predict segmentation masks to iden-tify pixel-wise class labels. Thus deconvolution isable to localize meaningful structure in the inputspace.

3 Model

3.1 CNN for Text Classification

We propose a deep neural model to capture lin-guistics patterns in text. This model is based onConvolutional Neural Networks with an embed-ding layer for word representations, one convolu-tional with pooling layer and non-linearities. Fi-nally we have two fully-connected layers. Thefinal output size corresponds to the number ofclasses. The model is trained by cross-entropywith an Adam optimizer. Figure 1 shows theglobal structure of our architecture. The inputis a sequence of words w1, w2...wn and the out-put contains class probabilities (for text classifica-tion).

The embedding is built on top of a Word2Vecarchitecture, here we consider a Skip-gram model.This embedding is also finetuned by the model toto increase the accuracy. Notice that we do not uselemmatisation, as in Collobert and Weston (2008),thus the linguistic material which is automaticallydetected does not rely on any prior assumptionsabout the part of speech.

In computer vision, we consider images as 2-dimensional isotropic signals. A text representa-tion may also be considered as a matrix: each wordis embedded in a feature vector and their concate-nation builds a matrix. However, we cannot as-sume both dimensions the sequence of words andtheir embedding representation are isotropic. Thusthe filters of CNNs for text typically differ fromtheir counterparts designed for images. Conse-quently in text, the width of the filter is usuallyequal to the dimension of the embedding, as illus-trated with the red, yellow, blue and green filters

in figure 1Using CNNs has another advantage in our con-

text: due to the convolution operators involved,they can be easily parallelized and may also beeasily used by the CPU, which is a practical so-lution for avoiding the use of GPUs at test time.

Figure 1: CNN for Text Classification

3.2 Deconvolution

Extending Deconvolution Networks for text isnot straightforward. Usually, in computer vision,the deconvolution is represented by a convolutionwhose weights depends on the filters of the CNN:we invert the weights of the channels and the filtersand then transpose each kernel matrix. When con-sidering deconvolution for text, transposing thekernel matrices is not realistic since we are deal-ing with nonisotropic dimensions - the word se-quences and the filter dimension. Eventually, thekernel matrix is not transposed.

Another drawback concerns the dimension ofthe feature map. Here feature map means the out-put of the convolution before applying max pool-ing. Its shape is actually the tuple (# words, # fil-ters). Because the filters’ width (red, yellow, blueand green in fig 1) matches the embedding dimen-sion, the feature maps cannot contain this informa-tion. To project the feature map in the embeddingspace, we need to convolve our feature map withthe kernel matrices. To this aim, we upsample thefeature map to obtain a 3-dimensional sample ofsize (# words, embedding dimension, # filters).

To analyze the relevance of a word in a sen-tence, we only keep one value per word which cor-responds to the sum along the embedding axis ofthe output of the deconvolution. We call this sumText Deconvolution Saliency (TDS).

For the sake of consistency, we sum up ourmethod in figure 2

Page 5: Text Deconvolution Saliency (TDS): a deep tool box for ...

Figure 2: Textual Deconvolution Saliency (TDS)

Eventually, every word in a sentence has aunique TDS score whose value is related to theothers. In the next section, we analyze the rel-evance of TDS. We thoroughly demonstrate em-pirically, that the TDS encodes complex linguis-tic patterns based on co-occurrences and possiblyalso on grammatical and syntaxic analysis.

4 Experiments

4.1 Datasets

In order to understand what the linguistic mark-ers found by the convolutional neural network ap-proach are, we conducted several tests on differentlanguages and our model seems to get the samebehavior in all of them. In order to perform allthe linguistic statistical tests, we used our ownsimple linguistic toolbox Hyperbase, which allowsthe creation of databases from textual corpus, theanalysis and the calculations such as z-test, co-occurrences, PCA, K-Means distance,... We use itto evaluate TDS against z-test scoring. We compelour analysis by only presenting cases on which z-test fail while TDS does not. Indeed TDS capturesz-test, as we did not find any sentence on whichz-test succeeds while TDS fails. Red words in thestudied examples are the highest TDS.

The first dataset we used for our experimentsis the well known IMDB movie review corpus forsentiment classification. It consists of 25,000 re-views labeled by positive or negative sentimentwith around 230,000 words.

The second dataset targets French political dis-courses. It is a corpus of 2.5 millions of words ofFrench Presidents from 1958 (with De Gaulle, thefirst President of the Fifth Republic) to 2018 withthe first speeches by Macron. In this corpus wehave removed Macron’s speech from the 31st of

December 2017, to use it as a test data set. Thetraining task is to recognize each french president.

The last dataset we used is based on Latin.We assembled a contrastive corpus of 2 millionwords with 22 principle authors writting in clas-sical Latin. As with the French dataset, the learn-ing task here is to be able to predict each authoraccording to new sequences of words. The nextexample is an excerpt of chapter 26 of the 23thbook of Livy:

[...] tutus tenebat se quoad multum acdiu obtestanti quattuor milia peditum etquingenti equites in supplementum missiex Africa sunt . tum refecta tandem specastra propius hostem mouit classemque et ipse instrui parari que iubet adinsulas maritimam que oram tutandam .in ipso impetu mouendarum de [...]

4.2 Z-test Versus Text DeconvolutionSaliency

Z-test is one of the standard metrics used inlinguistic statistics, in particular to measure theoccurrences of word collocations Manning andSchutze (1999). Indeed, the z-test provides a sta-tistical score of the co-occurrence of a sequence ofwords to appear more frequently than any other se-quence of words of the same length. This score re-sults from the comparison between the frequencyof the observerd word sequence with the frequencyexpected in the case of a ”Normal” distribution.In the context of constrative corpus analysis, thissame calculation applied to single words can read-ily provide, for example, the most specific vocab-ulary of a given author. The highest z-test are themost specific words of this given author in thiscase. This is a simple but strong method for ana-lyzing features of text. It can also be used to clas-sify word sentences according to the global z-test(sum of the scores) of all the words in the givensentence. We can thus use this global z-test asa very simple metric for authorship classification.The resulting authorship of a given sentence is forinstance given by the author corresponding to thehighest global z-test on that sentence comparedto all other global z-test obtained by summing upthe z-test of each word of the same sentence butwith the vocabulary specificity of another author.The mean accuracy of assigning the right author tothe right sentence, in our data set, is around 87%,which confirms that z-test is indeed meaningful for

Page 6: Text Deconvolution Saliency (TDS): a deep tool box for ...

z-test Deep LearningLatin 84% 93%French 89% 91%English 90% 97%

Table 1: Test accuray with z-test and Deep Learn-ing

contrast pattern analysis. On the other hand, mostof the time CNN reaches an accuracy greater than90% for text classification (as shown in Table 1).

This means that the CNN approaches can learnalso on their own some of the linguistic specifici-ties useful in discriminating text categories. Pre-vious works on image classification have high-lighted the key role of convolutional layers whichlearn different level of abstractions of the data tomake classification easier.

The question is: what is the nature of the ab-straction on text?

We show in this article that CNN approach de-tects automatically words with high z-test but ob-viously this is not the only linguistic structure de-tected.

To make the two values comparable, we nor-malize them. The values can be either positive ornegative. And we distinguish between two thresh-olds1 for the z-test: over 2 a word is considered asspecific and over 5 it is strongly specific (and theoposite with negative values). For the TDS it isjust a matter of activation strength.

The Figure 3 shows us a comparison betweenz-test and TDS on a sentence extracted from ourLatin corpora (Livy Book XXIII Chap. 26). Thissentence is an example of specific words used byLivy2. As we can see, when the z-test is the high-est, the TDS is also the highest and the TDS val-ues are high also for the neighbor words (for ex-ample around the word castra). However, this isnot always the case: for example small words asque or et are also high in z-test but they do notimpact the network at the same level. We can seealso on Figure 3 that words like tenebat, multum orpropius are totally uncorrelated. The Pearson cor-

1The z-test can be approximated by a normal distribution.The score we obtain by the z-test is the standard deviation. Alow standard deviation indicates that the data points tend tobe close to the mean (the expected value). Over 2 this scoremeans there is less than 2% of chance to have this distribu-tion. Over 5 it’s less than 0.1%.

2Titus Livius Patavinus – (64 or 59 BC - AD 12 or 17) –was a Roman historian.

Figure 3: z-test versus Text DeconvolutionSaliency (TDS) - Example on Livy Book XXIIIChap. 26

relation coefficient3 tells us that in this sentencethere is no linear correlation between z-test andTDS (with a Pearson of 0.38). This example isone of the most correlated examples of our dataset,thus CNN seems to learn more than a simple z-test.

4.3 Dataset: EnglishFor English, we used the IMDB movie review cor-pus for sentiment classification. With the defaultmethods, we can easily show the specific vocabu-lary of each class (positive/negative), according tothe z-test. There are for example the words too,bad, no or boring as most indicitive of negativesentiment, and the words and, performance, pow-erful or best for positive. Is it enough to detectautomatically if a new review is positive or not?Let’s see an example excerpted from a review fromDecember 2017 (not in the training set) on the lastAmerican blockbuster:

[...] i enjoyed three moments in thefilm in total , and if i am being honestand the person next to me fell asleep inthe middle and started snoring duringthe slow space chasescenes . the storyfailed to draw me in and entertain methe way [...]

In general the z-test is sufficient to predict theclass of this kind of comment. But in this case, theCNN seems to do better, but why?

3Pearson correlation coefficient measures the linear rela-tionship between two datasets. It has a value between +1 and−1, where 1 is total positive linear correlation, 0 is no linearcorrelation, and −1 is total negative

Page 7: Text Deconvolution Saliency (TDS): a deep tool box for ...

If we sum all the z-test (for negative and posi-tive), the positive class obtains a greater score thanthe negative. The words film, and, honest and en-tertain – with scores 5.38, 12.23, 4 and 2.4 – makethis example positive. CNN has activated differ-ent parts of this sentence (as we show in bold/redin the example). If we take the sub-sequence andif i am being honest and, there are two occurencesof and but the first one is followed by if and ourtoolbox gives us 0.84 for and if as a negative class.This is far from the 12.23 in the positive. And ifwe go further, we can do a co-occurrence analy-sis on and if on the training set. As we see withour co-occurrence analysis4 (Figure 4), honest isamong the most specific adjectivals5 associatedwith and if. Exactly what we found in our exam-ple.

Figure 4: co-occurrences analysis of and if (Hy-perbase)

In addition, we have the same behavior withthe verb fall. There is the word asleep next toit. Asleep alone is not really specific of nega-tive review (z-test of 1.13). But the associationof both words become highly specific of negativesentences (see the co-occurrences analysis - Fig-ure 5).

The Text Deconvolution Saliency here confirms4Those figures shows the major co-occurrences for a

given word (or lemma or PartOfSpeech). There two layersof co-occurrences, the first one (on top) show the direct co-occurrence and the second (on bottom) show a second levelof co-occurrence. This level is given by the context of twowords (taken together). The colors and the dotted lines areonly used to make it more readable (dotted lines are used forthe first level). The width of each line is related to the z-testscore (more the z-test is big, more the line is wide).

5With our toolbox, we can focus on different part ofspeech.

Figure 5: co-occurrences analysis of fall (Hyper-base)

that the CNN seems to focus not only on high z-test but on more complex patterns and maybe de-tects the lemma or the part of speech linked to eachword. We will see now that these observations arestill valid for other languages and can even be gen-eralized between different TDS.

4.4 Dataset: FrenchIn this corpus we have removed Macron’s speechfrom the 31st of December 2017, to use it as a testdata set. In this speech, the CNN primarily recog-nizes Macron (the training task was to be able topredict the correct President). To achieve this taskthe CNN seems to succeed in finding really com-plex patterns specific to Macron. For example inthis sequence:

[...] notre pays advienne a l’ecole pournos enfants, au travail pour l’ ensem-ble de nos concitoyens pour le climatpour le quotidien de chacune et chacund’ entre vous . Ces transformations pro-fondes ont commence et se poursuiv-ront avec la meme force le meme rythmela meme intensite [...]

The z-test gives a result statistically closer to DeGaulle than to Macron. The error in the statisticalattribution can be explained by a Gaullist phrase-ology and the multiplication of linguistic markersstrongly indexed with De Gaulle: De Gaulle hadthe specificity of making long and literary sen-tences articulated around co-ordination conjunc-tions as in et (z-test = 28 for de Gaulle, two oc-currences in the excerpt). His speech was also

Page 8: Text Deconvolution Saliency (TDS): a deep tool box for ...

Figure 6: Deconvolution on Macron speech.

more conceptual than average, and this resulted inan over-use of the articles defined le, la, l´, les)very numerous in the excerpt (7 occurrences); es-pecially in the feminine singular (la republique, laliberte, la nation, la guerre, etc., here we have lameme force, la meme intensite.

The best results given by the CNN may be sur-prising for a linguist but match perfectly with whatis known about the sociolinguistics of Macron’sdynamic kind of speeches.

The part of the excerpt, which impacts mostthe CNN classification, is related to the nominalsyntagm transformations profondes. Taken sepa-rately, neither of the phrase’s two words are veryMacronian from a statistical point of view (trans-formations = 1.9 profondes = 2.9). Better, thesyntagm itself does not appear in the President’slearning corpus (0 occurrence). However, it canbe seen that the co-occurrence of transformationand profondes amounts to 4.81 at Macron: so it isnot the occurrence of one word alone, or the other,which is Macronian but the simultaneous appear-ance of both in the same window. The second andcomplementary most impacting part of the excerptthus is related to the two verbs advienne and pour-suivront. From a semantic point of view, the twoverbs perfectly contribute, after the phrase trans-formations profondes, to give the necessary dy-namic to a discourse that advocates change. How-ever it is the verb tenses (carried by the morphol-ogy of the verbs) that appear to be the determiningfactor in the analysis. The calculation of the gram-matical codes co-occurring with the word trans-formations thus indicates that the verbs in the sub-junctive and the verbs in the future (and also thenouns) are the privileged codes for Macron (Fig-

ure 7).

Figure 7: Main part-of-speech co-occurrences fortransformations (Hyperbase)

More precisely the algorithm indicates that, forMacron, when transformation is associated with averb in the subjunctive (here advienne), then thereis usually a verb in the future co-present (herepoursuivront). transformations profondes, advi-enne to the subjunctive, poursuivront to the fu-ture: all these elements together form a speechpromising action, from the mouth of a young anddynamic President. Finally, the graph indicatesthat transformations is especially associated withnouns in the President’s speeches: in an extraor-dinary concentration, the excerpt lists 11 (pays,ecole, enfants, travail, concitoyens, climat, quo-tidien, transformations, force, rythme, intensite).

4.5 Dataset: LatinAs with the French dataset, the learning task hereis to be able to predict the identity of each authorfrom a contrastive corpus of 2 million words with22 principle authors writting in classical Latin.

The statistics here identify this sentence as Cae-sar6 but Livy is not far off. As historians, Caesarand Livy share a number of specific words: forexample tool words like se (reflexive pronoun) orque (a coordinator) and prepositions like in, ad, ex,of. There are also nouns like equites (cavalry) orcastra (fortified camp).

The attribution of the sentence to Caesar cannotonly rely only on z-test: que or in or castra, withdifferences thereof equivalent or inferior to Livy.

6Gaius Julius Caesar, 100 BC - 44 BC, usually calledJulius Caesar, was a Roman politician and general and a no-table author of Latin prose.

Page 9: Text Deconvolution Saliency (TDS): a deep tool box for ...

On the other hand, the differences of se, ex, aregreater, as is that of equites. Two very Caesarianterms undoubtedly make the difference iubet (heorders) and milia (thousands).

The greater score of quattuor (four), castra,hostem (the enemy), impetu (the assault) in Livyare not enough to switch the attribution to this au-thor.

On the other hand, CNN activates several zonesappearing at the beginning of sentences and cor-responding to coherent syntactic structures (forLivy) – Tandem reflexes spe castra propius hostemmouit (then, hope having finally returned, hemoved the camp closer to the camp of the enemy)– despite the fact that castra in hostem mouit isattested only by Tacitus7.

There are also in ipso metu (in fear itself), whilein followed by metu is counted one time with Cae-sar and one time also with Quinte-Curce8.

More complex structures are possibly also de-tected by the CNN: the structure tum + participatesAblative Absolute (tum refecta) is more character-istic of Livy (z-test 3.3 with 8 occurrences) thanof Caesar (z-test 1.7 with 3 occurrences), even if itis even more specific of Tacitus (z-test 4.2 with 10occurrences).

Finally and more likely, the co-occurrence be-tween castra, hostem and impetu may have playeda major role: Figure 8

Figure 8: Specific co-occurrences between impetuand castra (Hyperbase)

With Livy, impetu appears as a co-occurrentwith the lemmas hostis (z-test 9.42) and castra (z-

7Publius (or Gaius) Cornelius Tacitus, 56 BC - 120 BC,was a senator and a historian of the Roman Empire.

8Quintus Curtius Rufus was a Roman historian, probablyof the 1st century, his only known and only surviving workbeing ”Histories of Alexander the Great”

test 6.75), while hostis only has a gap of 3.41 inCaesar and that castra does not appear in the listof co-occurrents.

For castra, the first co-occurent for Livy ishostis (z-test 22.72), before castra (z-test 10.18),ad (z-test 10.85), in (z-test 8.21), impetus (z-test7.35), que (z-test 5.86) while in Caesar, impetusdoes not appear and the scores of all other lemmasare lower except castra (z-test 15.15), hostis (8),ad (10,35), in (5,17), que (4.79).

Thus, our results suggest that CNNs manage toaccount for specificity, phrase structure, and co-occurence networks. . .

5 Conclusion

In a nutshell, Text Deconvolution Saliency is ef-ficient on a wide range of corpora. By crossingstatistical approaches with neural networks, wepropose a new strategy for automatically detect-ing complex linguistic observables, which up tonow hardly detectable by frequency-based meth-ods. Recall that the linguistic matter and thetopology recovered by our TDS cannot return tochance: the zones of activation make it possibleto obtain recognition rates of more than 91% onthe French political speech and 93% on the Latincorpus; both rates equivalent to or higher than therates obtained by the statistical calculation of thekey passages. Improving the model and under-standing all the mathematical and linguistic out-comes remains an import goal. In future work,we intend to thoroughly study the impact of TDSgiven morphosyntactic information.

ReferencesHeike Adel and Hinrich Schutze. 2017. Global normal-

ization of convolutional neural networks for joint en-tity and relation classification. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 1723–1729.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning, ICML ’08, pages 160–167, NewYork, NY, USA. ACM.

Yann N Dauphin, Angela Fan, Michael Auli, and DavidGrangier. 2017. Language modeling with gated con-volutional networks. In International Conference onMachine Learning, pages 933–941.

Feldman, R., and J. Sanger. 2007. The Text MiningHandbook. Advanced Approaches in Analyzing Un-

Page 10: Text Deconvolution Saliency (TDS): a deep tool box for ...

structured Data. New York: Cambridge UniversityPress.

Hyperbase. Web based toolbox for linguistics analysis.http://hyperbase.unice.fr.

Yangfeng Ji and Jacob Eisenstein. 2014. Represen-tation learning for text-level discourse parsing. InProceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 13–24.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), vol-ume 1, pages 655–665.

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015.Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751.

L. Lebart, A. Salem and L. Berry. 1998. ExploringTextual Data. Ed. Springer.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky.2015. Visualizing and understanding neural modelsin nlp. arXiv preprint arXiv:1506.01066.

Christopher D Manning and Hinrich Schutze. 1999.Foundations of statistical natural language process-ing. MIT press.

S. Mellet and D. Longree. 2009. Syntactical motifs andtextual structures. In Belgian Journal of Linguistics23, pages 161–173.

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.2015. Learning deconvolution network for semanticsegmentation. In Proceedings of the IEEE Interna-tional Conference on Computer Vision, pages 1520–1528.

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In Proceedings of the 15th Conference ofthe European Chapter of the Association for Com-putational Linguistics: Volume 1, Long Papers, vol-ume 1, pages 438–449.

Wenpeng Yin, Katharina Kann, Mo Yu, and HinrichSchutze. 2017. Comparative study of cnn and rnnfor natural language processing. arXiv preprintarXiv:1702.01923.

Matthew D Zeiler and Rob Fergus. 2014. Visualizingand understanding convolutional networks. In Eu-ropean conference on computer vision, pages 818–833. Springer.

A Supplemental Material

In order to make our experiments reproductible,we detail here all the hyperparameters used inour architecture. The neural network is writtenin python with the library Keras (an tensorflow asbackend).

The embedding uses a Word2Vec implementa-tion given by the gensim Library. Here we use theSkipGram model with a window size of 10 wordsand output vectors of 128 values (embedding di-mension).

The textual datas are tokenized by a home-made tokensizer (which work on English, Latinand French). The corpus is splited into 50 lengthsequence of words (punctuation is keeped) andeach word is converted inta an uniq vectore of 128value.

The first layer of our model takes the text se-quence (as word vectors) and applies a weightcorresponding to our WordToVec values. Thoseweights are still trainable during model training.

The second layer is the convolution, a Conv2Din Keras with 512 filters of size 3 ∗ 128 (filter-ing three words at time), with a Relu activationmethod. Then, there is the Maxpooling (MaxPool-ing2D)

(The deconvolution model is identical untilhere. We replace the rest of the classifica-tion model (Dense) by a transposed convolution(Conv2DTranspose).)

The last layers of the model are Dense layers.One hidden layer of 100 neurons with a Relu acti-vation and one final layer of size equal to the num-ber of classes with a softmax activation.

All experiments in this paper share the samearchitecture and the same hyperparameters, andare trained with a cross-entropy method (with anAdam optimizer) with 90% of the dataset for thetraining data and 10% for the validation. All thetests in this paper are done with new data not in-cluded in the original dataset.