AINL 2016: Maraev

Character-level Convolutional NeuralNetwork for Sentence Paraphrase Detection

Vladislav MaraevNLX-Group, Faculty of Sciences, University of Lisbon

Paraphrase detection for Russian workshopAINL FRUCT 2016

Objective

WhatTask 2 — Binary classification (paraphrase/non-paraphrase).

HowApply convolutional neural network (CNN) architecture:

Standard Non-standardWord embeddings ✓ ✓

Character embeddings ✓

Vladislav Maraev AINL FRUCT 2016 Paraphrase detection for Russian workshop (10.11.2016) 2 / 16

Related work

Convolutional neural networks in NLP

• Detecting semantically equivalent questions with CNN andword embeddings (Bogdanova et al., 2015)

• Convolutional Neural Networks for Sentence Classification(Zhang and Wallace, 2015)

• Attention-based CNN for modeling sentence pairs (Yin et al.,2016)

• Character embeddings for text classification(Zhang et al.,2015)

• Word+character embeddings for sentiment analysis (dosSantos and Gatti, 2014)


How CNN works?

TR CONV POOL cosinesimilarity

Steps:

1. Token representation (Embedding)

2. Convolution

3. Pooling

4. Pair similarity estimation


Convolutional Neural Network1. Token representation

Input

s = {t1, t2, . . . , tN}

Token representation

rt = W0vt , (1)

where

• W0 ∈ Rd×V is an embedding matrix

• vt ∈ RV is a one-hot encoded vector of size V

Output

sTR = {rt1 , rt2 , . . . , rtN} , where rtn ∈ Rd


Convolutional Neural Network2. Convolution

Convolution

1. Concatenations zn of k-grams

2. Multiply by W1, add bias b1, andapply tanh function:

rzn = tanh(W1zn + b1

)where:

zn ∈ Rdk

W1 ∈ Rclu×dk

rzn ∈ Rclu


Convolutional Neural Network3. Pooling

Sum (or Max) over all the rzn element-wise and apply tanhfunction:

rs = tanh

(∑n

rzn

)which will give us sentence representation rs ∈ Rclu

* This means that sentence representation doesn’t depend onsentence length.


Convolutional Neural Network4. Compute similarity

TR CONV POOL cosinesimilarity

Estimate similarity between the pair of sentencerepresentations using cosine measure:

similarity =rs1 · rs2

∥rs1∥∥rs2∥


Training the network

We train W0, W1 and b1.

Steps

1. Compute mean-squared error (w.r.t. cosine similarity)

2. Use the backpropagation algorithm (SGD/RMSProp) tocompute gradients of the network


Several convolutional filters


Standard Run HyperparametersWord embeddings

Parameter Value Descriptionk {3, 5, 8, 12} Sizes of k-gramsclu 100 Size of each convolutional filterd 300 Size of word representationepochs 5 Number of training epochspooling MAX pooling layer functionoptimiser RMSProp Keras’s optimiser

word embeddings Random (uniform)

Sentences were tokenised and lowercased using Keras.


Standard Run HyperparametersCharacter embeddings

Parameter Value Descriptionk {2, 3, 5, 7, 9, 11} Sizes of k-gramsclu 100 Size of each conv. filterd 100 Size of word representationepochs 20 Number of training epochspooling MAX pooling layer functionoptimiser RMSProp Keras’s optimiser

char. embeddings Random (uniform)

Characters were lowercased, non-word characters were removed.


Non-Standard Run Hyperparameters

Parameter Value Descriptionk 3 Size of k-gramclu 300 Size of convolutional filterd 300 Size of word representationepochs 5 Number of training epochspooling MAX pooling layer functionoptimiser RMSProp Keras’s optimiser

word embeddings RusVectores trained on Russian National Corpus(Kutuzov and Andreev, 2015)

Input sentences were tokenised, lemmatised and PoS-taggedwith MyStem (Segalovich, 2003).


Main results

Accuracy F1

StandardNLX (characters) 72.74 78.80NLX (words) 66.19 76.44

Non-standard NLX (words) 69.94 76.80

BASELINE 49.66 54.03


Discussion

1. The result for standard run is competing with the best systemand can be further improved by tuning hyperparametersautomatically and also picking the epoch for testingautomatically, based on the validation results.

2. Surprisingly, results for the standard run outperformednon-standard, however, non-standard used external resourcesfor lemmatisation and initial word embeddings. (Probably dueto a higher focus on the standard run).

Next? Attention-based CNN (Yin et al., 2016), combination ofcharacter and word embeddings (dos Santos and Gatti, 2014).


References

Dasha Bogdanova, Cıcero dos Santos, Luciano Barbosa, and Bianca Zadrozny.Detecting semantically equivalent questions in online user forums. CoNLL 2015,page 123, 2015.

Cicero dos Santos and Maira Gatti. Deep convolutional neural networks for sentimentanalysis of short texts. In Proceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: Technical Papers, pages 69–78, Dublin,Ireland, August 2014. Dublin City University and Association for ComputationalLinguistics.

Andrey Kutuzov and Igor Andreev. Texts in, meaning out: neural language models insemantic similarity task for russian. In Proceedings of the Dialog Conference, 2015.

Ilya Segalovich. A fast morphological algorithm with unknown word guessing inducedby a dictionary for a web search engine. In MLMTA, pages 273–280. Citeseer, 2003.

Wenpeng Yin, Hinrich Schtze, Bing Xiang, and Bowen Zhou. Abcnn: Attention-basedconvolutional neural network for modeling sentence pairs. Transactions of theAssociation for Computational Linguistics, 4:259–272, 2016. ISSN 2307-387X.

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networksfor text classification. In Advances in Neural Information Processing Systems,pages 649–657, 2015.

Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide to)convolutional neural networks for sentence classification. arXiv preprintarXiv:1510.03820, 2015.


AINL 2016: Maraev

Science

Transcript of AINL 2016: Maraev