arXiv:1809.00800v1 [cs.CL] 4 Sep 2018Pb(x;y), rather than PMI, ranks undesirable function-word pairs...

13
Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions Sho Yokoi 1,2 Sosuke Kobayashi 3 Kenji Fukumizu 4 Jun Suzuki 1,2 Kentaro Inui 1,2 1 Tohoku University 2 RIKEN Center for Advanced Intelligence Project {yokoi,jun.suzuki,inui}@ecei.tohoku.ac.jp 3 Preferred Networks, Inc. 4 The Institute of Statistical Mathematics [email protected] [email protected] Abstract In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alter- native to pointwise mutual information (PMI). As well as deriving PMI from mutual infor- mation, we derive this new measure from the Hilbert–Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be in- terpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sen- tence embeddings) to be plugged in as ker- nels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN- based PMI while outperforming PMI in accu- racy. In addition, we also demonstrate that PH- SIC is beneficial as a criterion of a data selec- tion task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs. 1 Introduction Computing the co-occurrence strength between two linguistic expressions is a fundamental task in natu- ral language processing (NLP). For example, in col- location extraction (Manning and Sch ¨ utze, 1999), word bigrams are collected from corpora and then strongly co-occurring bigrams (e.g., “New York”) are found. In dialogue response selection (Lowe et al., 2015), pairs comprising a context and its response sentence are collected from dialogue cor- pora and the goal is to rank the candidate responses for each given context sentence. In either case, a set of linguistic expression pairs D = {(x i ,y i )} n i=1 is first collected and then the co-occurrence strength of a (new) pair (x, y) is computed. Robustness Learning to Sparsity Time PMI log n · c(x, y) y 0 c(x, y 0 ) x 0 c(x 0 ,y) Eq. 1 X log b PRNN(y|x) b PRNN(y) Eq. 2 X PHSIC (φ(x) - φ(x)) > b C XY (ψ(y) - ψ(y)) Sec. 5.1 X X (a- a) > b C ICD (b- b) Sec. 5.2 X X Table 1: The proposed co-occurrence norm, PHSIC, eliminates the trade-off between robustness to data sparsity and learning time, which PMI has (Section 1). Pointwise mutual information (PMI) (Church and Hanks, 1989) is frequently used to model the co-occurrence strength of linguistic expression pairs. There are two typical types of PMI esti- mation (computation) method. One is a counting- based estimator using maximum likelihood esti- mation, sometimes with smoothing techniques, for example, d PMI MLE (x, y; D)= log n · c(x, y) y 0 c(x, y 0 ) x 0 c(x 0 ,y) , (1) where c(x, y) denotes the frequency of the pair (x, y) in given data D. This is easy to compute and is commonly used to measure co-occurrence between words, such as in collocation extraction 1 ; however, when data D is sparse, i.e., when x or y is a phrase or sentence, this approach is unre- alistic. The second method uses recurrent neural networks (RNNs). Li et al. (2016) proposed to em- 1 In collocation extraction, simple counting c(x, y) b P(x, y), rather than PMI, ranks undesirable function-word pairs (e.g., “of the”) higher (Manning and Sch ¨ utze, 1999). arXiv:1809.00800v1 [cs.CL] 4 Sep 2018

Transcript of arXiv:1809.00800v1 [cs.CL] 4 Sep 2018Pb(x;y), rather than PMI, ranks undesirable function-word pairs...

  • Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Normfor Sparse Linguistic Expressions

    Sho Yokoi 1,2 Sosuke Kobayashi 3 Kenji Fukumizu 4 Jun Suzuki 1,2 Kentaro Inui 1,21 Tohoku University 2 RIKEN Center for Advanced Intelligence Project

    {yokoi,jun.suzuki,inui}@ecei.tohoku.ac.jp3 Preferred Networks, Inc. 4 The Institute of Statistical Mathematics

    [email protected] [email protected]

    Abstract

    In this paper, we propose a new kernel-basedco-occurrence measure that can be applied tosparse linguistic expressions (e.g., sentences)with a very short learning time, as an alter-native to pointwise mutual information (PMI).As well as deriving PMI from mutual infor-mation, we derive this new measure fromthe Hilbert–Schmidt independence criterion(HSIC); thus, we call the new measure thepointwise HSIC (PHSIC). PHSIC can be in-terpreted as a smoothed variant of PMI thatallows various similarity metrics (e.g., sen-tence embeddings) to be plugged in as ker-nels. Moreover, PHSIC can be estimated bysimple and fast (linear in the size of the data)matrix calculations regardless of whether weuse linear or nonlinear kernels. Empirically, ina dialogue response selection task, PHSIC islearned thousands of times faster than an RNN-based PMI while outperforming PMI in accu-racy. In addition, we also demonstrate that PH-SIC is beneficial as a criterion of a data selec-tion task for machine translation owing to itsability to give high (low) scores to a consistent(inconsistent) pair with other pairs.

    1 Introduction

    Computing the co-occurrence strength between twolinguistic expressions is a fundamental task in natu-ral language processing (NLP). For example, in col-location extraction (Manning and Schütze, 1999),word bigrams are collected from corpora and thenstrongly co-occurring bigrams (e.g., “New York”)are found. In dialogue response selection (Loweet al., 2015), pairs comprising a context and itsresponse sentence are collected from dialogue cor-pora and the goal is to rank the candidate responsesfor each given context sentence. In either case, a setof linguistic expression pairs D = {(xi, yi)}ni=1 isfirst collected and then the co-occurrence strengthof a (new) pair (x, y) is computed.

    Robustness Learningto Sparsity Time

    PMI

    logn · c(x, y)∑

    y′c(x, y′)∑

    x′c(x′, y)

    Eq. 1 XXX

    logP̂RNN(y|x)P̂RNN(y)

    Eq. 2 XXX

    PHSIC

    (φ(x)−φ(x))>ĈXY (ψ(y)−ψ(y)) Sec. 5.1 XXX XXX

    (a−a)>ĈICD(b−b) Sec. 5.2 XXX XXX

    Table 1: The proposed co-occurrence norm, PHSIC,eliminates the trade-off between robustness to datasparsity and learning time, which PMI has (Section 1).

    Pointwise mutual information (PMI) (Churchand Hanks, 1989) is frequently used to modelthe co-occurrence strength of linguistic expressionpairs. There are two typical types of PMI esti-mation (computation) method. One is a counting-based estimator using maximum likelihood esti-mation, sometimes with smoothing techniques, forexample,

    P̂MIMLE(x, y;D)= logn · c(x, y)∑

    y′c(x, y′)∑

    x′c(x′, y)

    ,

    (1)

    where c(x, y) denotes the frequency of thepair (x, y) in given dataD. This is easy to computeand is commonly used to measure co-occurrencebetween words, such as in collocation extraction1;however, when data D is sparse, i.e., when x ory is a phrase or sentence, this approach is unre-alistic. The second method uses recurrent neuralnetworks (RNNs). Li et al. (2016) proposed to em-1 In collocation extraction, simple counting c(x, y) ∝P̂(x, y), rather than PMI, ranks undesirable function-wordpairs (e.g., “of the”) higher (Manning and Schütze, 1999).

    arX

    iv:1

    809.

    0080

    0v1

    [cs

    .CL

    ] 4

    Sep

    201

    8

  • ploy PMI to suppress dull responses for utterancegeneration in dialogue systems2. They estimatedP(y) and P(y|x) using RNN language models andestimated PMI as follows:

    P̂MIRNN(x, y;D) = logP̂RNN(y|x)P̂RNN(y)

    . (2)

    This way of estimating PMI is applicable to sparselanguage expressions; however, learning RNN lan-guage models is computationally costly.

    To eliminate this trade-off between robustnessto data sparsity and learning time, in this study wepropose a new kernel-based co-occurrence mea-sure, which we call the pointwise Hilbert–Schmidtindependence criterion (PHSIC) (see Table 1). Ourcontributions are as follows:• We formalize PHSIC, which is derived from

    HSIC (Gretton et al., 2005), a kernel-based de-pendence measure, in the same way that PMI isderived from mutual information (Section 3).

    • We give an intuitive explanation why PHSIC isrobust to data sparsity. PHSIC is a “smoothedvariant of PMI”, which allows various similaritymetrics to be plugged in as kernels (Section 4).

    • We propose fast estimators of PHSIC, which arereduced to a simple and fast matrix calculationregardless of whether we use linear or nonlinearkernels (Section 5).

    • We empirically confirmed the effectiveness ofPHSIC, i.e., its robustness to data sparsity andlearning time, in two different types of experi-ment, a dialogue response selection task and adata selection task for machine translation (Sec-tion 6).

    2 Problem Setting

    Let X and Y denote random variables on X and Y ,respectively. In this paper, we deal with the tasksof taking a set of linguistic expression pairs

    D = {(xi, yi)}ni=1 ∼i.i.d. PXY , (3)

    which is regarded as a set of i.i.d. samples drawnfrom a joint distribution PXY , and then measuringthe “co-occurrence strength” for each given pair(x, y) ∈ X × Y . Such tasks include collocationextraction and dialogue response selection (Sec-tion 1).

    2 In dialogue response selection or generation, a simple con-ditional probability P̂(y|x), rather than PMI, ranks dull re-sponses (e.g., “I don’t know.”) higher (Li et al., 2016).

    3 Pointwise HSIC

    In this section, we give the formal definition ofPHSIC, a new kernel-based co-occurrence measure.We show a summary of this section in Table 2.Intuitively, PHSIC is a “kernelized variant of PMI.”

    3.1 Dependence Measure

    As a preliminary step, we introduce the simple con-cept of dependence (see Dependence Measure inTable 2). Recall that random variables X and Yare independent if and only if the joint probabil-ity density PXY and the product of the marginalsPXPY are equivalent. Therefore, we can measurethe dependence between random variables X andY via the difference between PXY and PXPY .

    Both the mutual information and the Hilbert–Schmidt independence criterion, to be describedbelow, are such dependence measures.

    3.2 MI and PMI

    We briefly review the well-known mutual informa-tion and PMI (see MI & PMI in Table 2).

    The mutual information (MI)3 between two ran-dom variables X and Y is defined by

    MI(X,Y ) := KL[PXY ‖PXPY ] (4)

    (Cover and Thomas, 2006), where KL[·‖·] de-notes the Kullback–Leibler (KL) divergence. Thus,MI(X,Y ) is the degree of dependence between Xand Y measured by the KL divergence betweenPXY and PXPY .

    Here, by definition of the KL divergence, MI canbe represented in the form of the expectation overPXY , i.e., the summation over all possible pairs(x, y) ∈ X×Y:

    MI(X,Y ) = E(x,y)

    [log

    PXY (x, y)

    PX(x)PY (y)

    ]. (5)

    The shaded part in Equation (5) is actually thepointwise mutual information (PMI) (Church andHanks, 1989):

    PMI(x, y;X,Y ) := logPXY (x, y)

    PX(x)PY (y). (6)

    Therefore, PMI(x, y) can be thought of as the con-tribution of (x, y) to MI(X,Y ).

    3 Conventionally, mutual information is denoted by I(X;Y );in this paper, however, for notational consistency, mutualinformation is denoted by MI(X,Y ).

  • Dependence Measure Co-occurrence Measurethe dependence between X and Y the contribution of (x, y)(the difference between PXY and PXPY ) to the dependence between X and Y

    MI & PMI

    MI(X,Y ) = KL[PXY ‖PXPY ]

    = E(x,y)

    [log

    PXY (x, y)

    PX(x)PY (y)

    ] PMI(x, y;X,Y )= log

    PXY (x, y)

    PX(x)PY (y)

    HSIC & PHSIC

    HSIC(X,Y ; k, `) = MMD2k,`[PXY ,PXPY ]

    = E(x,y)

    [(φ(x)−mX)>CXY (ψ(y)−mY )

    ]= E

    (x,y)

    [E

    (x′,y′)[k̃(x, x′)˜̀(y, y′)] ]

    PHSIC(x, y;X,Y, k, `)

    = (φ(x)−mX)>CXY (ψ(y)−mY )

    = E(x′,y′)

    [k̃(x, x′)˜̀(y, y′)]Table 2: Relationship between the mutual information (MI), the pointwise mutual information (PMI), the Hilbert–Schmidt independence criterion (HSIC), and the pointwise HSIC (PHSIC). As well as defining PMI as the contri-bution to MI, we define PHSIC as the contribution to HSIC. In short, PHSIC is a “kernelized PMI” (Section 3).

    3.3 HSIC and PHSICAs seen in the previous section, PMI can be derivedfrom MI. Here, we consider replacing MI with theHilbert–Schmidt independence criterion (HSIC).Then, in analogy with the relationship betweenPMI and MI, we derive PHSIC from HSIC (seeHSIC & PHSIC in Table 2).

    Let k : X × X → R and ` : Y × Y → Rdenote positive definite kernels on X and Y , re-spectively (intuitively, they are similarity func-tions between linguistic expressions). The Hilbert–Schmidt independence criterion (HSIC) (Grettonet al., 2005), a kernel-based dependence measure,is defined by

    HSIC(X,Y; k, `) :=MMD2k,`[PXY ,PXPY ], (7)

    where MMD[·, ·] denotes the maximum mean dis-crepancy (MMD) (Gretton et al., 2012), whichmeasures the difference between random vari-ables on a kernel-induced feature space. Thus,HSIC(X,Y ; k, `) is the degree of dependence be-tween X and Y measured by the MMD betweenPXY and PXPY , while MI is measured by the KLdivergence (Equation (4)).

    Analogous to MI in Equation (5), HSIC can berepresented in the form of the expectation on PXYby a simple deformation:

    HSIC(X,Y ; k, `)

    = E(x,y)

    [(φ(x)−mX)>CXY (ψ(y)−mY )

    ](8)

    = E(x,y)

    [E

    (x′,y′)[k̃(x, x′)˜̀(y, y′)] ], (9)

    where

    φ(x) := k(x, ·), ψ(y) := `(y, ·), (10)

    mX := Ex[φ(x)], mY := Ey[ψ(y)], (11)

    CXY := E(x,y)

    [(φ(x)−mX)(ψ(y)−mY )>

    ], (12)

    k̃(x, x′) := k(x, x′)−Ex′ [k(x, x′)]−Ex[k(x, x′)] +Ex,x′ [k(x, x′)]. (13)

    At first glance, these equations are somewhat com-plicated; however, the estimators of PHSIC weactually use are reduced to a simple matrix calcula-tion in Section 5. Unlike MI in Equation (5), HSIChas two representations: Equation (8) is the repre-sentation in feature space and Equation (9) is therepresentation in data space.

    Similar to the relationship between MI and PMI(Section 3.2), we define the pointwise Hilbert–Schmidt independence criterion (PHSIC) by theshaded parts in Equations (8) and (9):

    PHSIC(x, y;X,Y, k, `)

    := (φ(x)−mX)>CXY (ψ(y)−mY ) (14)

    = E(x′,y′)

    [k̃(x, x′)˜̀(y, y′)] . (15)Namely, PHSIC(x, y) is defined as the contribu-tion of (x, y) to HSIC(X,Y ).

    In summary, we define PHSIC such that“MI:PMI = HSIC:PHSIC” holds (see Table 2).

    4 PHSIC as Smoothed PMI

    This section gives an intuitive explanation for thefirst feature of PHSIC, i.e., the robustness to datasparsity, using Table 3. In short, we show thatPHSIC is a “smoothed variant of PMI.”

    First, the maximum likelihood estimator of PMI

  • add scores deduct scores

    P̂MI(x, y;D) = logn ·

    ∑i I[x=xi ∧ y=yi]∑

    i I[x=xi]∑

    i I[y=yi]

    ( x , y )

    = =

    D = {. . . , ( xi , yi ), . . . }

    ( x , y ) ( x , y )

    = 6= 6= =

    {. . . , ( xi , yi ), . . . ,( xi , yi ), . . . }

    P̂HSIC(x, y;D, k, `) = 1n

    ∑i

    ̂̃k(x, xi)̂˜̀(y, yi) ( x , y ) ( x , y )≈ ≈ 6≈ 6≈

    {. . . , ( xi , yi ), . . . ,( xi , yi ), . . . }

    ( x , y ) ( x , y )

    ≈ 6≈ 6≈ ≈

    {. . . , ( xi , yi ), . . . ,( xi , yi ), . . . }

    Table 3: Comparison of estimators of PMI and PHSIC in terms of methods of matching the given (x, y) and theobserved (xi, yi) in D. PMI matches them in an exact manner, while PHSIC smooths the matching using kernels.Therefore, PHSIC is expected to be robust to data sparsity (Section 4).

    in Equation (1) can be rewritten as

    P̂MI(x, y;D)= logn ·∑

    iI[x=xi ∧ y=yi]∑iI[x=xi]

    ∑iI[y=yi]

    , (16)

    where I[condition] = 1 if the condition is true andI[condition] = 0 otherwise. According to Equa-tion (16), P̂MI(x, y) is the amount computed byrepeating the following operation (see the first rowin Table 3):

    collate the given (x, y) and the observed(xi, yi) in D in order, and add the scores if(x, y) and (xi, yi) match exactly or deductthe scores if either the x side or the y side(but nor both) matches.

    Moreover, an estimator of PHSIC in data space(Equation (15)) is

    P̂HSIC(x, y;D, k, `)= 1n∑

    î̃k(x, xi)̂˜̀(y, yi) ,

    (17)

    where ̂̃k(·, ·) and ̂̀̃(·, ·) are similarity functions cen-tered on the data4. According to Equation (17),P̂HSIC(x, y) is the amount computed by repeat-ing the following operation (see the second row inTable 3):

    collate the given (x, y) and the observed(xi, yi) in D in order, and add the scores ifthe similarities on the x and y sides are both

    higher (both ̂̃k(x, xi) > 0 and ̂̀̃(y, yi) > 0hold)5 or deduct the scores if the similaritieson either the x or y sides are similar butthose on the other side are not similar.

    4 To be exact, ̂̃k(x, x′) := k(x, x′) − 1n

    ∑nj=1 k(x, xj) −

    1n

    ∑ni=1 k(xi, x

    ′) + 1n2

    ∑ni=1

    ∑nj=1 k(xi, xj), which is an

    estimator of the centered kernel k̃(x, x′) in Equation (13).5 In addition, the scores are added if the similarity on the x

    side and that on the y side are both lower, that is, if ̂̃k(x, xi) <0 and ̂̀̃(y, yi) < 0 hold.

    As described above, when comparing the esti-mators of PMI and PHSIC from the viewpoint of“methods of matching the given (x, y) and the ob-served (xi, yi),” it is understood that PMI matchesthem in an exact manner, while PHSIC smoothsthe matching using kernels (similarity functions).

    With this mechanism, even for completely un-known pairs, it is possible to estimate the co-occurrence strength by referring to observed pairsthrough the kernels. Therefore, PHSIC is expectedto be robust to data sparsity and can be applied tophrases and sentences.

    Available Kernels for PHSIC In NLP, a vari-ety of similarity functions (i.e., positive definitekernels) are available. We can freely utilize suchresources, such as cosine similarity between sen-tence embeddings. For a more detailed discussion,see Appendix A.

    5 Empirical Estimators of PHSIC

    Recall that we have two types of empirical esti-mator of PMI, the maximum likelihood estimator(Equation (1)) and the RNN-based estimator (Equa-tion (2)). In this section, we describe how to rapidlyestimate PHSIC from data. When using the linearkernel or cosine similarity (e.g., cosine similaritybetween sentence embeddings), PHSIC can be ef-ficiently estimated in feature space (Section 5.1).When using a nonlinear kernel such as the Gaussiankernel, PHSIC can also be estimated efficiently indata space via a simple matrix decomposition (Sec-tion 5.2).

    5.1 Estimation Using Linear Kernel or CosineWhen using the linear kernel or cosine similarity,the estimator of PHSIC in feature space (14) is asfollows:

  • P̂HSICfeature(x, y;D, k, `)

    = (φ(x)−φ(x))>ĈXY (ψ(y)−ψ(y)) , (18)

    where

    φ(x) =

    {x (k(x, x′) = x>x′)x/‖x‖ (k(x, x′) = cos(x, x′))

    , (19)

    φ(x) :=1

    n

    n∑i=1

    φ(xi), ψ(y) :=1

    n

    n∑i=1

    ψ(yi), (20)

    ĈXY :=1

    n

    n∑i=1

    φ(xi)ψ(yi)>− φ(x)ψ(y)>. (21)

    Generally in kernel methods, a feature map φ(·)induced by a kernel k(·, ·) is unknown or high-dimensional and it is difficult to compute estimatedvalues in feature space6. However, when we usethe linear kernel or cosine similarity, feature mapscan be explicitly determined (Equation (19)).

    Computational Cost When learning Equa-tion (18) with feature maps φ : X → Rd andψ : Y → Rd, computing the vectors φ(x), ψ(y) ∈Rd and the matrix ĈXY ∈ Rd×d takes O(nd2)time and O(nd) space (linear in the size of theinput, n). When estimating PHSIC(x, y), com-puting φ(x), ψ(y) ∈ Rd and Equation (18) takesO(d2) time (constant; does not depend on the sizeof the input, n).

    5.2 Estimation Using Nonlinear KernelsWhen using a nonlinear kernel such as the Gaussiankernel, it is necessary to estimate PHSIC in dataspace. Using a simple matrix decomposition, thiscan be achieved with the same computational costas the estimation in feature space. See Appendix Bfor a detailed derivation.

    6 Experiments

    In this section, we provide empirical evidence forthe greater effectiveness of PHSIC than PMI, i.e.,a very short learning time and robustness to datasparsity. Among the many potential applicationsof PHSIC, we choose two fundamental scenarios,(re-)ranking/classification and data selection.• In the ranking/classification scenario (measuring

    the co-occurrence strength of new data pairs withreference to observed pairs), PHSIC is applied

    6 One of the characteristics of kernel methods is that an in-tractable estimation in feature space is replaced with an effi-cient estimation in data space.

    as a criterion for the dialogue response selectiontask (Section 6.2).

    • In the data selection/filtering scenario (orderingthe entire set of observed data pairs accordingto the co-occurrence strength), PHSIC is alsoapplied as a criterion for data selection in thecontext of machine translation (Section 6.3).

    6.1 PHSIC Settings

    To take advantage of recent developments in rep-resentation learning, we used several pre-trainedmodels for encoding sentences into vectors andseveral kernels between these vectors for PHSIC.

    Encoders As sentence encorders, we used twopre-trained models without fine-tuning. First, thesum of the word vectors effectively represents asentence (Mikolov et al., 2013a):

    x=∑

    w∈xvec(w), y=∑

    w∈yvec(w). (22)

    For vec(·), we used the pre-trained fastTextmodel7, which is a high-accuracy and popular wordembedding model (Bojanowski et al., 2017); mod-els in 157 languages are publicly distributed (Graveet al., 2018). Second, we also used a DNN-basedsentence encoder, called the universal sentence en-coder (Cer et al., 2018), which utilizes the deepaveraging network (DAN) (Iyyer et al., 2015). Thepre-trained model for English sentences we used ispublicly available8.

    Kernels As kernels between these vectors, weused cosine similarity (cos)

    k(x,x′) = cos(x,x′) (23)

    and the Gaussian kernel (also known as the radialbasis function kernel; RBF kernel)

    k(x,x′) = exp

    (−‖x− x

    ′‖222σ2

    ), (24)

    and similarly for `(y,y′). The experiments are ranwith hyperparameter σ = 1.0 for the RBF kernel,and d = 100 for incomplete Cholesky decomposi-tion (for more detail, see Section B).

    6.2 Ranking: Dialogue Response Selection

    In the first experiment, we applied PHSIC as aranking criterion of the task of dialogue response7 https://fasttext.cc/docs/en/english-vectors.html, https://fasttext.cc/docs/en/crawl-vectors.html8 https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1

    https://fasttext.cc/docs/en/english-vectors.htmlhttps://fasttext.cc/docs/en/english-vectors.htmlhttps://fasttext.cc/docs/en/crawl-vectors.htmlhttps://fasttext.cc/docs/en/crawl-vectors.htmlhttps://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1

  • selection (Lowe et al., 2015); in the task, pairs com-prising a context (previous utterance sequence) andits response are collected from dialogue corporaand the goal is to rank the candidate responses foreach given context sentence.

    The task entails sentence sequences (very sparselinguistic expressions); moreover, Li et al. (2016)pointed out that (RNN-based) PMI has a positiveimpact on suppressing dull responses (e.g., “I don’tknow.”) in dialogue systems. Therefore, PHSIC,another co-occurrence measure, is also expectedto be effective for this. With this setting, wherethe validity of PMI is confirmed, we investigatewhether PHSIC can replace RNN-based PMI interms of both learning time and robustness to datasparsity.

    Experimental Settings

    Dataset For the training data, we gathered ap-proximately 5× 105 reply chains from Twitter, fol-lowing Sordoni et al. (2015)9. In addition, we ran-domly selected {103, 104, 105} reply chains fromthat dataset. Using these small subsets, we con-firmed the effect of the difference in the size of thetraining set (data sparseness) on the learning timeand predictive performance.

    For validation and test data, we used a small(approximately 2000 pairs each) but highly reliabledataset created by Sordoni et al. (2015)10, whichconsists only of conversations given high scoresby human annotators. Therefore, this set was notexpected to include dull responses.

    For each dataset, we converted each context-message-response triple into a context-responsepair by concatenating the context and message fol-lowing Li et al. (2016). In addition, to convert thetest set (positive examples) to ten-choice multiple-choice questions, we shuffled the combinations ofcontext and response to generate pseudo-negativeexamples.

    Evaluation Metrics We adopted the followingevaluation metrics for the task: (i) ROC-AUC (thearea under the receiver operating characteristiccurve), (ii) MRR (the mean reciprocal rank), and(iii) Recall@{1,2}.

    9 We collected tweets after 2017 for our training set to avoidduplication with the test set, which contains tweets from theyear 2012.10 https://www.microsoft.com/en-us/download/details.aspx?id=52375

    ConfigSize of Training Set n

    103 104 105 5×105

    Dim. Init.

    RN

    N-P

    MI 300 fastText

    Total 20.6 99.2 634.3 4042.5P̂(y) 8.0 23.6 294.6 1710.1P̂(y|x) 12.6 75.6 339.7 2332.4

    1200 fastTextTotal 49.0 162.0 1751.3 13054.9P̂(y) 16.3 57.2 671.0 5512.1P̂(y|x) 32.7 104.8 1080.3 7542.8

    PHSI

    C

    Encoder KernelfastText cos 0.0 0.1 0.5 2.8

    DAN cos 0.0 0.1 0.4 1.8

    Table 4: Learning time [s] for each model and eachsize of training set for the dialogue response task. Eachrow denotes a model; each column denotes the num-ber of training data n. The text appended to each base-line model denotes the number of dimension of hiddenlayers (Dim.) and the method of initialization the em-bedding layer (Init.). The text appended to each pro-posed model denotes the pre-trained models used to en-code sentences into vectors (Encoder) and the kernelbetween these vectors (Kernel). The best result (theshortest learning time) in each column is in bold.

    Experimental Procedure We used the follow-ing procedure: (i) train the model with a set ofcontext-response pairs D = {(xi, yi)}ni=1; (ii) foreach context sentence x in the test data, rank thecandidate responses {yj}10j=1 by the model; and (iii)report three evaluation metrics.

    Baseline Measures As baseline measures, both(1) an RNN language model P̂RNN(y) (Mikolovet al., 2010) and (2) a conditional RNN languagemodel P̂RNN(y|x) (Sutskever et al., 2014) weretrained, and (3) PMI based on these language mod-els, RNN-PMI, was also used for experiments (seeEquation (2)). We trained these models with allcombinations of the following settings: (a) the num-ber of dimensions of the hidden layers being 300or 1200 and (b) the initialization of the embed-ding layer being random (uniform on [−0.1, 0.1])or fastText. For more detailed settings, see Ap-pendix C.

    Experimental ResultsLearning Time Table 4 shows the experimentalresults of the learning time11. Regardless of thesize of the training set n, the learning time for

    11 The computing environment was as follows:(i) CPU: Xeon E5-1650-v3 (3.5 GHz, 6 Cores);(ii) GPU: GTX 1080 (8 GB).

    https://www.microsoft.com/en-us/download/details.aspx?id=52375https://www.microsoft.com/en-us/download/details.aspx?id=52375

  • PHSIC is much shorter than that of the RNN-basedmethod. For example, even when the size of thetraining set n is 5× 105, PHSIC is approximately1400–4000 times faster than RNN-based PMI. Thisis because the estimators of PHSIC are reducedto a deterministic and efficient matrix calculation(Section 5), whereas neural network-based modelsinvolve the sequential optimization of parametersvia gradient descent methods.

    Robustness to Data Sparsity Table 5 showsthe experimental results of the predictive perfor-mance. When the size of the training data is small(n=103, 104), that is, when the data is extremelysparse, the predictive performance of PHSIC hardlydeteriorates while that of PMI rapidly decays as thenumber of data decreases. This indicates that PH-SIC is more robust to data sparsity than RNN-basedPMI owing to the effect of kernels. Moreover, PH-SIC with the simple cosine kernel outperforms theRNN-based model regardless of the number of data,while the learning time of PHSIC is thousands oftimes shorter than those of the baseline methods(Section 6.2).

    Additionally we report Spearman’s rank correla-tion coefficient between models to verify whetherPHSIC shows similar behavior to PMI. See Ap-pendix D for more detail.

    6.3 Data Selection for Machine Translation

    The aim of our second experiment was to demon-strate that PHSIC is also beneficial as a criterionof data selection. To achieve this, we attemptedto apply PHSIC to a parallel corpus filtering taskthat has been intensively discussed in recent (neu-ral) machine translation (MT, NMT) studies. Thistask was first adopted as a shared task in the thirdconference on machine translation (WMT 2018)12.

    Several existing parallel corpora, especiallythose automatically gathered from large-scale textdata, such as the Web, contain unacceptableamounts of noisy (low-quality) sentence pairs thatgreatly affect the translation quality. Therefore,the development of an effective method for paral-lel corpus filtering would potentially have a largeinfluence on the MT community; discarding suchnoisy pairs may improve the translation quality andshorten the training time.

    We expect PHSIC to give low scores to excep-tional sentence pairs (misalignments or missing12 http://www.statmt.org/wmt18/parallel-corpus-filtering.html

    translations) during the selection process becausePHSIC assigns low scores to pairs that are highlyinconsistent with other pairs (see Section 4). Notethat applying RNN-based PMI to a parallel corpusselection task is unprofitable since obtaining RNN-based PMI also has an identical computational costfor training a sequence-to-sequence model for MT,and thus, we cannot expect a reduction of the totaltraining time.

    Experimental Settings

    Dataset We used the ASPEC-JE corpus13, whichis an official dataset used for the MT-evaluationshared task held in the fourth workshop on Asiantranslation (WAT 2017)14 (Nakazawa et al., 2017).ASPEC-JE consists of approximately three million(3M) Japanese–English parallel sentences from sci-entific paper abstracts. As discussed by Kocmiet al. (2017), ASPEC-JE contains many low-qualityparallel sentences that have the potential to signifi-cantly degrade the MT quality. In fact, they empir-ically revealed that using only the reliable part ofthe training parallel corpus significantly improvedthe translation quality. Therefore, ASPEC-JE isa suitable dataset for evaluating the data selectionability.

    Model For our data selection evaluation, we se-lected the Transformer architecture (Vaswani et al.,2017) as our baseline NMT model, which is widely-used in the NMT community and known as one ofthe current state-of-the-art architectures. We uti-lized fairseq15, a publicly available tool for neu-ral sequence-to-sequence models, for building ourmodels.

    Experimental Procedure We used the follow-ing procedure for this evaluation: (1) rank all paral-lel sentences in a given parallel corpus accordingto each criterion, (2) extract the top K ranked par-allel sentences, (3) train the NMT model using theextracted parallel sentences, and (4) evaluate thetranslation quality of the test data using a typicalMT automatic evaluation measure, i.e., BLEU (Pa-pineni et al., 2002)16. In our experiments we evalu-ated PHSIC with K = 0.5M and 1M.

    13 http://lotus.kuee.kyoto-u.ac.jp/ASPEC/14 http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/15 https://github.com/pytorch/fairseq16 We used multi-bleu.perl in the Moses tool (https://github.com/moses-smt/mosesdecoder).

    http://www.statmt.org/wmt18/parallel-corpus-filtering.htmlhttp://www.statmt.org/wmt18/parallel-corpus-filtering.htmlhttp://lotus.kuee.kyoto-u.ac.jp/ASPEC/http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/https://github.com/pytorch/fairseqhttps://github.com/moses-smt/mosesdecoderhttps://github.com/moses-smt/mosesdecoder

  • Models ConfigSize of Training Set n

    103 104 105 5× 105

    Chance Level .50; .29; .10, .20 .50; .29; .10, .20 .50; .29; .10, .20 .50; .29; .10, .20

    Dim. Init.

    P̂RNN(y) 1200 fastText .50; .29; .10, .21 .50; .30; .11, .21 .50; .30; .10, .21 .50; .30; .13, .25

    P̂RNN(y|x) 1200 fastText .50; .29; .10, .21 .50; .30; .10, .21 .52; .31; .11, .23 .54; .32; .13, .25

    RNN-PMI300 random .51; .30; .10, .21 .51; .30; .11, .22 .58; .35; .14, .29 .69; .46; .25, .42

    fastText .51; .29; .09, .20 .56; .34; .15, .25 .66; .41; .20, .36 .76; .56; .36, .54

    1200 random .50; .29; .11, .20 .51; .30; .10, .19 .57; .35; .14, .29 .70; .47; .26, .44fastText .51; .30; .11, .20 .52; .32; .13, .23 .65; .42; .21, .36 .75; .54; .34, .52

    Encoder Kernel

    PHSICfastText cos .61; .38; .17, .33 .62; .40; .19, .34 .62; .40; .19, .34 .62; .40; .19, .34

    DAN cos .77; .58; .40, .56 .78; .57; .39, .56 .78; .58; .41, .57 .78; .58; .40, .57

    Table 5: Predictive performance for each model and each training set size for the dialogue response selectiontask: ROC-AUC; MRR; Recall@1,2. The best result in each column is in bold. The other notation is the same as inTable 4.

    Selection Criteria# of Selected Data K0.5M 1M 3M

    (all the training set) - - 41.02

    Random 34.26 39.82 -

    fast align 38.63 40.56 -

    Encoder KernelPHSIC fastText RBF 38.95 40.95 -

    Table 6: BLEU scores with the Transformer for eachdata selection criterion and each size of selected dataK for the parallel corpus filtering task.“Random” rep-resents the baseline method of selecting sentences atrandom.

    Baseline Measure As a baseline measure, weutilize a publicly available script17 of fast align(Dyer et al., 2013), which is one of the state-of-the-art word aligner. We firstly used the fast alignfor the training set D = {(xi, yi)}i to obtainthe word alignment between each sentence pair(xi, yi), i.e., a set of aligned word pairs with itsprobabilities. We then computed the co-occurrencescore of (xi, yi) with sentence-length normaliza-tion, i.e., the average log probability of alignedword pairs.

    Experimental ResultsTable 6 shows the results of our data selection eval-uation. It is common knowledge in NMT that moredata gives better performance in general. However,we observed that PHSIC successfully extracted ben-eficial parallel sentences from the noisy parallel17 https://github.com/clab/fast align

    corpus; the result using 1M data extracted from the3M corpus by PHSIC was almost the same as thatusing 3M data (the decrease in the BLEU scorewas only 0.07), whereas that by random extractionreduced the BLEU score by 1.20.

    This was actually a surprising result because PH-SIC utilizes only monolingual similarity measures(kernels) without any other language resources.This indicates that PHSIC can be applied to a lan-guage pair poor in parallel resources. In addition,the surface form and grammatical characteristicsbetween English and Japanese are extremely differ-ent18; therefore, we expect that PHSIC will workwell regardless of the similarity of the languagepair.

    7 Related Work

    Dependence Measures Measuring indepen-dence or dependence (correlation) between tworandom variables, i.e., estimating dependencefrom a set of paired data, is a fundamental task instatistics and a very wide area of data science. Tomeasure the complex nonlinear dependence thatreal data has, we have several choices.

    First, information-theoretic MI (Cover andThomas, 2006) and its variants (Suzuki et al., 2009;Reshef et al., 2011) are the most commonly useddependence measures. However, to the best of ourknowledge, there is no practical method of com-puting MIs for large-multi class high-dimensional18 For example, word order; English is an SVO (subject-verb-object) language and Japanese is an SOV (subject-object-verb)language.

    https://github.com/clab/fast_align

  • (having a complex generative model) discrete data,such as sparse linguistic data.

    Second, several kernel-based dependence mea-sures have been proposed for measuring nonlin-ear dependence (Akaho, 2001; Bach and Jordan,2002; Gretton et al., 2005). The reason why kernel-based dependence measures work well for realdata is that they do not explicitly estimate den-sities, which is difficult for high-dimensional data.Among them, HSIC (Gretton et al., 2005) is pop-ular because it has a simple estimation method,which is used for various tasks such as feature se-lection (Song et al., 2012), dimensionality reduc-tion (Fukumizu et al., 2009), and unsupervised ob-ject matching (Quadrianto et al., 2009; Jagarlamudiet al., 2010). We follow this line.

    Co-occurrence Measures First, In NLP, PMI(Church and Hanks, 1989) and its variants (Bouma,2009) are the de facto co-occurrence measures be-tween dense linguistic expressions, such as words(Bouma, 2009) and simple narrative-event expres-sions (Chambers and Jurafsky, 2008). In recentyears, positive PMI (PPMI) has played an impor-tant role as a component of word vectors (Levy andGoldberg, 2014).

    Second, there are several studies in which thepairwise ranking problem has been solved by us-ing deep neural networks (DNNs) in NLP. Li et al.(2016) proposed a PMI estimation using RNN lan-guage models; this was used as a baseline modelin our experiments (see Section 6.2). Severalstudies have used DNN-based binary classifiersmodeling P(C = positive | (x, y)) to solve thegiven ranking problem directly (Hu et al., 2014;Yin et al., 2016; Mueller and Thyagarajan, 2016)(these networks are sometimes called Siamese neu-ral networks). Our study focuses on comparingco-occurrence measures. It is unknown whetherSiamese NNs capture the co-occurrence strength;therefore we did not deal with Siamese NNs in thispaper.

    Finally, to the best of our knowledge, Yokoi et al.(2017)’s paper is the first study that suggested con-verting HSIC to a pointwise measure. The presentstudy was inspired by their suggestion; here, wehave (i) provided a formal definition (population)of PHSIC; (ii) analyzed the relationship betweenPHSIC and PMI; (iii) proposed linear-time estima-tion methods; and (iv) experimentally verified thecomputation speed and robustness to data sparsityof PHSIC for practical applications.

    8 Conclusion

    The NLP community has commonly employedPMI to estimate the co-occurrence strength be-tween linguistic expressions; however, existingPMI estimators have a high computational costwhen applied to sparse linguistic expressions(Section 1). We proposed a new kernel-basedco-occurrence measure, the pointwise Hilbert–Schmidt independent criterion (PHSIC). As wellas defining PMI as the contribution to mutual in-formation, PHSIC is defined as the contribution toHSIC; PHSIC is intuitively a “kernelized variant ofPMI” (Section 3). PHSIC can be applied to sparselinguistic expressions owing to the mechanism ofsmoothing by kernels. Comparing the estimatorsof PMI and PHSIC, PHSIC can be interpreted asa smoothed variant of PMI, which allows varioussimilarity metrics to be plugged in as kernels (Sec-tion 4). In addition, PHSIC can be estimated inlinear time owing to the efficient matrix calculation,regardless of whether we use linear or nonlinearkernels (Section 5). We conducted a ranking taskfor dialogue systems and a data selection task formachine translation (Section 6). The experimen-tal results show that (i) the learning of PHSIC wascompleted thousands of times faster than that of theRNN-based PMI while outperforming it in rankingaccuracy (Section 6.2); and (ii) even when using anonlinear kernel, PHSIC can be applied to a largedataset. Moreover, PHSIC reduces the amount oftraining data to one third without sacrificing theoutput translation quality (Section 6.3).

    Future Work Using the PHSIC estimator in fea-ture space (Equation (18)), we can generate themost appropriate ψ(y) for a given φ(x) (uniquely,up to scale). That is, if a DNN-based sentence de-coder is used, y (a sentence) can be restored fromψ(y) (a feature vector) so that generative modelsof strong co-occurring sentences can be realized.

    Acknowledgments

    We are grateful to anonymous reviewers for theirhelpful comments. We also thank Weihua Hu foruseful discussions, Kenshi Yamaguchi for collect-ing data, and Paul Reisert for proofreading. Thiswork was supported in part by JSPS KAKENHIGrant Number JP15H01702 and JST CREST GrantNumber JPMJCR1513, Japan.

  • ReferencesShotaro Akaho. 2001. A kernel method for canonical

    correlation analysis. In IMPS, pages 1–7.

    Francis R. Bach and Michael I. Jordan. 2002. KernelIndependent Component Analysis. JMLR, 3(Jul):1–48.

    Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectors withSubword Information. TACL, 5:135–146.

    Gerlof Bouma. 2009. Normalized (Pointwise) MutualInformation in Collocation Extraction. In GSCL,pages 31–40.

    Razvan C Bunescu and Raymond J Mooney. 2006.Subsequence Kernels for Relation Extraction. InNIPS, pages 171–178.

    Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St John, Noah Constant,Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.2018. Universal Sentence Encoder. CoRR,abs/1803.1.

    Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-vised Learning of Narrative Event Chains. In ACL,pages 789–797.

    Kenneth Ward Church and Patrick Hanks. 1989. WordAssociation Norms, Mutual Information, and Lexi-cography. In ACL, pages 76–83.

    Michael Collins and Nigel Duffy. 2002. ConvolutionKernels for Natural Language. In NIPS, pages 625–632.

    Thomas M. Cover and Joy A. Thomas. 2006. Elementsof Information Theory. John Wiley & Sons.

    Andrew M Dai and Quoc V Le. 2015. Semi-supervisedSequence Learning. In NIPS, pages 3079–3087.

    Chris Dyer, Victor Chahuneau, and Noah A Smith.2013. A Simple, Fast, and Effective Reparameteri-zation of IBM Model 2. In NAACL-HLT, pages 644–648.

    Shai Fine and Katya Scheinberg. 2001. Efficient SVMTraining Using Low-Rank Kernel Representations.JMLR, 2(Dec):243–264.

    Kenji Fukumizu, Francis R. Bach, and Michael I. Jor-dan. 2009. Kernel dimension reduction in regres-sion. Annals of Statistics, 37(4):1871–1905.

    Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-mand Joulin, and Tomas Mikolov. 2018. LearningWord Vectors for 157 Languages. In LREC, pages3483–3487.

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch,Bernhard Schölkopf, and Alexander Smola. 2012. AKernel Two-Sample Test. JMLR, 13(Mar):723–773.

    Arthur Gretton, Olivier Bousquet, Alex Smola, andBernhard Schölkopf. 2005. Measuring StatisticalDependence with Hilbert-Schmidt Norms. In ALT,pages 63–77.

    Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.Learning Distributed Representations of Sentencesfrom Unlabelled Data. In NAACL-HLT, pages 1367–1377.

    Sepp Hochreiter and Jrgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

    Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2014. Convolutional Neural Network Archi-tectures for Matching Natural Language Sentences.In NIPS, pages 2042–2050.

    Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daumé III. 2015. Deep Unordered Compo-sition Rivals Syntactic Methods for Text Classifica-tion. In ACL/IJCNLP, pages 1681–1691.

    Jagadeesh Jagarlamudi, Seth Juarez, and HalDaumé III. 2010. Kernelized Sorting for NaturalLanguage Processing. In AAAI, pages 1020–1025.

    Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-Thought Vectors. InNIPS, pages 3294–3302.

    Tom Kocmi, Dušan Variš, and Ondřej Bojar. 2017.Cuni nmt system for wat 2017 translation tasks. InWAT, pages 154–159.

    Omer Levy and Yoav Goldberg. 2014. Neural WordEmbedding as Implicit Matrix Factorization. InNIPS, pages 2177–2185.

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A Diversity-Promoting Objec-tive Function for Neural Conversation Models. InNAACL-HLT, pages 110–119.

    Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The Ubuntu Dialogue Corpus: ALarge Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In SIGDIAL, pages 285–294.

    Christopher D. Manning and Hinrich Schütze. 1999.Foundations of Statistical Natural Language Pro-cessing. MIT press.

    Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Distributed Representations of Wordsand Phrases and their Compositionality. In NIPS,pages 3111–3119.

    Tomas Mikolov, Greg Corrado, Kai Chen, and JeffreyDean. 2013b. Efficient Estimation of Word Repre-sentations in Vector Space. In Proc. of the Workshopon ICLR, pages 1–12.

  • Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In In-terspeech, pages 1045–1048.

    Jeff Mitchell and Mirella Lapata. 2010. Compositionin Distributional Models of Semantics. CognitiveScience, 34(8):1388–1429.

    Alessandro Moschitti. 2006. Making Tree Kernelspractical for Natural Language Learning. In EACL,volume 6, pages 113–120.

    Jonas Mueller and Aditya Thyagarajan. 2016. SiameseRecurrent Architectures for Learning Sentence Sim-ilarity. In AAAI, 2012, pages 2786–2792.

    Toshiaki Nakazawa, Shohei Higashiyama, ChenchenDing, Hideya Mino, Isao Goto, Hideto Kazawa,Yusuke Oda, Graham Neubig, and Sadao Kurohashi.2017. Overview of the 4th workshop on asian trans-lation. In WAT, pages 1–54.

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL, pages 311–318.

    Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global Vectors for WordRepresentation. In EMNLP, pages 1532–1543.

    Novi Quadrianto, Le Song, and Alex J. Smola. 2009.Kernelized sorting. In NIPS, pages 1289–1296.

    David N. Reshef, Yakir A. Reshef, Hilary K. Finucane,Sharon R. Grossman, Gilean McVean, Peter J. Turn-baugh, Eric S. Lander, Michael Mitzenmacher, andPardis C. Sabeti. 2011. Detecting Novel Associa-tions in Large Data Sets. Science, 334(6062):1518–1524.

    J. Shawe-Taylor and N. Cristianini. 2004. Kernel Meth-ods for Pattern Analysis. Cambridge UniversityPress.

    Le Song, Alex Smola, Arthur Gretton, Justin Bedo, andKarsten Borgwardt. 2012. Feature Selection via De-pendence Maximization. JMLR, 13:1393–1434.

    Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A Neural Network Approach to Context-SensitiveGeneration of Conversational Responses. InNAACL-HLT, pages 196–205.

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to Sequence Learning with Neural Net-works. In NIPS, pages 3104–3112.

    Taiji Suzuki, Masashi Sugiyama, Takafumi Kanamori,and Jun Sese. 2009. Mutual information estimationreveals global associations between stimuli and bi-ological processes. BMC Bioinformatics, 10(Suppl1):S52.

    Seiya Tokui, Kenta Oono, Shohei Hido, and JustinClayton. 2015. Chainer: a Next-Generation OpenSource Framework for Deep Learning. In Learn-ingSys.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In NIPS, pages 5998–6008.

    John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2015. Towards Universal Paraphrastic Sen-tence Embeddings. In ICLR, pages 1–19.

    Wenpeng Yin, Hinrich Schütze, Bing Xiang, andBowen Zhou. 2016. ABCNN: Attention-Based Con-volutional Neural Network for Modeling SentencePairs. TACL, 4(1):259–272.

    Sho Yokoi, Daichi Mochihashi, Ryo Takahashi, NaoakiOkazaki, and Kentaro Inui. 2017. Learning Co-Substructures by Kernel Dependence Maximization.In IJCAI, pages 3329–3335.

  • A Available Kernels for PHSIC

    Similarity between Sentence Vectors A varietyof vector representations of phrases and sentencesbased on the distributional hypothesis have recentlybeen proposed, including sentence encoders (Kiroset al., 2015; Dai and Le, 2015; Iyyer et al., 2015;Hill et al., 2016; Cer et al., 2018) and the sum ofword embeddings; it is known as additive composi-tionality (Mitchell and Lapata, 2010; Mikolov et al.,2013a; Wieting et al., 2015) that we can expressthe meaning of phrases and sentences well withthe sum of word vectors (e.g., word2vec (Mikolovet al., 2013b), GloVe (Pennington et al., 2014), andfastText (Bojanowski et al., 2017)). Note that var-ious pre-trained models of sentence encoders andword embeddings have also been made available.

    The cosine of these vectors, which is a positivedefinite kernel, can be used as a convenient andhighly accurate similarity function between phrasesor sentences. Other major kernels can also be used,such as the RBF kernel, the Laplacian kernel, andpolynomial kernels.

    Structured Kernels Various structured kernelsfor NLP, such as tree kernels, which capture finestructure of sentences such as syntax, were devisedin the support vector machine era (Collins andDuffy, 2002; Bunescu and Mooney, 2006; Mos-chitti, 2006).

    Combinations We can freely combine the previ-ously mentioned kernels because the sum and theproduct of positive definite kernels are also posi-tive definite kernels (Shawe-Taylor and Cristianini,2004, Proposition 3.22).

    B Derivation of Fast PHSIC Estimationin Data Space

    Although estimators of HSIC and PHSIC dependon kernels k, ` and data D, hereinafter, we use thefollowing notation for the sake of simplicity:

    ĤSIC(X,Y ) := ĤSIC(X,Y ;D, k, `), (25)

    P̂HSIC(x, y) := P̂HSIC(x, y;D, k, `). (26)

    Naı̈ve Estimation Fist, an estimator of PHSICin the data space (15) is

    P̂HSICkernel(x, y)=(k − k)>( 1nH)(`− `), (27)

    where k := (k(x, x1), . . . , k(x, xn))> ∈ Rn, soas `; and vector k := 1nK1 denotes empiricalmean of {ki}ni=1, so as `. This estimation has a

    large computational cost. When learning, comput-ing the vectors k, ` takes O(n2) time and O(n)space. When estimating PHSIC, computing k, `and multiplying the matrix 1nH takes O(n) time.

    Fast Estimation via Incomplete Cholesky De-composition Equation (27) has a large compu-tational cost because it is necessary to constructthe Gram matrices K and L ∈ Rn×n. In kernelmethods, several methods have been proposed forapproximating Gram matrices at low cost with-out constructing them explicitly, such as incom-plete Cholesky decomposition (Fine and Schein-berg, 2001).

    By incomplete Cholesky decomposition, fromdata points {x1, . . . , xn} ⊆ X and a positive def-inite kernel k : X × X → R, a matrix A =(a1, . . . ,an)

    > ∈ Rn×d (d � n) can be obtainedwith O(nd2) time complexity. This makes it possi-ble to approximate the Gram matrix K by vectorsai ∈ Rd without configuring the entire of K:

    a>i aj ≈ k(xi, xj) (28)AA> ≈ K. (29)

    Also, for HSIC, an efficient approximationmethod utilizing incomplete Cholesky decompo-sition has been proposed (Gretton et al., 2005,Lemma 2):

    ĤSICICD(X,Y ) =1

    n2‖(HA)>B‖2F, (30)

    where A = (a1, . . . ,an)> ∈ Rn×d is a matrixsatisfying AA> ≈ K computed via incompleteCholesky decomposition, so as B (BB> ≈ L).Equation (30) can be represented in the form of theexpectation on data points:

    ĤSICICD(X,Y )=1

    n

    n∑i=1

    [(ai−a)>ĈICD(bi−b)

    ](31)

    ĈICD :=1

    n(HA)>B ∈ Rd×d, (32)

    where vector a := 1nA>1 ∈ Rd denotes empirical

    mean of {ai}ni=1, so as b := 1nB>1.

    Recall that PHSIC(x, y) is the contribution of(x, y) to HSIC(X,Y ) (see Section 3.3); PHSICthen can be efficiently estimated by the shaded partof Equation (31):

    P̂HSICICD(x, y)= (a−a)>ĈICD(b−b) . (33)

    Here, the vector a ∈ Rd corresponding to the newx can be calculated by “performing from halfway”

  • Models Config(A) (B) (C) (D)

    Dim. Init.

    RNN-PMI300 fastText (A) – .42 .12 .27

    1200 fastText (B) .42 – .12 .26

    Encoder Kernel

    PHSICfastText cos (C) .12 .12 – .16

    DAN cos (D) .27 .26 .16 –

    Table 7: Spearman’s ρ between the co-occurrencescores computed by the models in the dialogue re-sponse selection task (Section 6.2). The size of trainingset n is 5 × 105. The other notation is the same as inTable 4.

    on the incomplete Cholesky decomposition algo-rithm. Let x(1), . . . , x(d) denote the dominant xisadopted during decomposition algorithm. The jthelement of a can be computed as follows:

    a[j]=[k(x, x(j))−

    j−1∑m=1

    a[m]Ajm

    ]/Ajj , (34)

    so as b ∈ Rd corresponding to the new y. Theestimation via incomplete Cholesky decomposi-tion (33) is extremely efficient compared to thenaive estimation (27); Equation (33)’s computa-tional complexity is equivalent to the estimation inthe feature space (18).

    C Detailed Settings for Learning RNNs

    Detailed settings for learning RNNs used in thisresearch are as follows.• Hidden layers: single layer LSTMs (Hochreiter

    and Schmidhuber, 1997)• Vocabulary: words with a frequency: 10 or more

    (n = 5× 105), 2 or more (otherwise)• Dropout rate: 0.1 (300-dim), 0.3 (1200-dim)• Batch size: 64• Max epoch number: 5 (n = 5× 105), 30 (other-

    wise)• Deep learning framework: Chainer (Tokui et al.,

    2015)

    D Correlation Between Modelsin Dialogue Response Selection Task

    Table D shows Spearman’s rank correlation coef-ficient (Spearman’s ρ) between the co-occurrencescores on the test set computed by the models in thedialogue response selection task (Section 6.2). Thisshows that the behavior of RNN-based PMI and

    PHSIC are considerably different. Furthermore, in-terestingly, the behavior of PHSICs using differentkernels is also different. Possible reasons for theseobservations are as follows: (1) the difference inthe dependence measures (MI or HSIC) on whicheach model is based; (2) the validity or numericalstability of estimating PMI with RNN languagemodels; and (3) differences in the behavior of PH-SIC originating from differences in the plugged inkernels. A more detailed analysis of the compati-bility between tasks and measures (or kernels) isattractive future work.