Modeling Concentrated Cross-Attention for Neural Machine ...

Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1401–1411November 7–11, 2021. ©2021 Association for Computational Linguistics

1401

Modeling Concentrated Cross-Attention for Neural Machine Translationwith Gaussian Mixture Model

Shaolei Zhang 1,2, Yang Feng 1,2∗

1Key Laboratory of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

2 University of Chinese Academy of Sciences, Beijing, China{zhangshaolei20z, fengyang}@ict.ac.cn

Abstract

Cross-attention is an important component ofneural machine translation (NMT), which is al-ways realized by dot-product attention in pre-vious methods. However, dot-product atten-tion only considers the pair-wise correlationbetween words, resulting in dispersion whendealing with long sentences and neglect ofsource neighboring relationships. Inspired bylinguistics, the above issues are caused by ig-noring a type of cross-attention, called concen-trated attention, which focuses on several cen-tral words and then spreads around them. Inthis work, we apply Gaussian Mixture Model(GMM) to model the concentrated attentionin cross-attention. Experiments and analy-ses we conducted on three datasets show thatthe proposed method outperforms the baselineand has significant improvement on alignmentquality, N-gram accuracy, and long sentencetranslation.

1 Introduction

Recently, Neural Machine Translation (NMT) hasbeen greatly improved with Transformer (Vaswaniet al., 2017), which mainly relies on the atten-tion mechanism. The attention mechanism inTransformer consists of self-attention and cross-attention, where cross-attention is proved moreimportant to translation quality than self-attention(He et al., 2020; You et al., 2020). Even if theself-attention is modified to a fixed template, thetranslation quality would not significantly reduce(You et al., 2020), while cross-attention plays anirreplaceable role in NMT. Cross-attention in Trans-former is realized by dot-product attention, whichcalculates the attention distribution base on the pair-wise similarity.

However, modeling cross-attention with the dot-product attention still has some weaknesses dueto its calculation way. First, when dealing with

∗Corresponding author: Yang Feng.

DistributedAttention

ConcentratedAttention

Figure 1: An attention example of En→De translationwhen generating target words “Mäppchen” (Englishmeaning: pencil case), showing the difference and com-plementarity between distributed attention and concen-trated attention.

long sentences, the attention distribution with dot-product attention tends to be dispersed (Yang et al.,2018), which proved unfavorable for translation(Zhang et al., 2019; Tang et al., 2019; He et al.,2020). Second, dot-product attention is difficultto explicitly consider the source neighboring rela-tionship (Im and Cho, 2017; Sperber et al., 2018),resulting in ignoring the words with lower similar-ity but nearby the important word which determinephrase structure or morphology.

Research in linguistics and cognitive science sug-gests that human attention to language can be di-vided into two categories: distributed attention andconcentrated attention (Jacob and Bruce, 1973; Itoet al., 1998; Brand and Johnson, 2018). Specifi-cally, distributed attention is scattered on all sourcewords, and the degree of attention is determinedby correlation. On the contrary, concentrated atten-tion only focuses on a few central words and thenspreads on the words around them. Accordingly,we consider that cross-attention can be divided intothese two types of attention as well, where dis-

1402

tributed attention is well modeled by dot-productattention, but concentrated attention is ignored. Fig-ure 1 shows an attention example of En→De trans-lation when generating target words “Mäppchen”(English meaning: pencil case). Only with dis-tributed attention, some irrelevant words (such as‘desk’) robbed some attention weight, resulting inattention dispersion. Besides, the correlation tothe function words (both ‘a’ and ‘some’) are lowand similar, but they are important to determine thesingular/plural forms of the target word. In con-centrated attention, attention distribution is moreconcentrated and can capture neighboring relation-ships, which make up for the lack of distributedattention.

In this paper, to explicitly model the concen-trated cross-attention, we apply the Gaussian mix-ture model (GMM) (Rasmussen, 1999) to constructGaussian mixture attention. Specifically, Gaus-sian mixture attention first focuses on some centralwords, and then pays attention to the words aroundthe central words, where the attention decreases asthe word away from the central word. Since cross-lingual alignments are often one-to-many, Gaussianmixture attention is more flexible to model multi-ple central words, which is not possible with asingle Gaussian distribution. Then, Gaussian mix-ture attention and dot-product attention are fusedto jointly determine the total attention.

Experiments we conducted on three datasetsshow that our method outperforms the baseline ontranslation quality. Further analyses show that ourmethod enhances cross-attention, thereby improv-ing the performance of alignment quality, N-gramaccuracy and long sentence translation.

Our contributions can be summarized as follows:

• We introduce concentrated attention to cross-attention, which successfully compensates forthe weakness of dot-product attention.

• To our best knowledge, we are the first toapply GMM to model attention distribution intext sequence, which provides a method formodeling multi-center attention distribution.

2 Background

Our method is applied on cross-attention in Trans-former (Vaswani et al., 2017), so we first brieflyintroduce the architecture of Transformer with afocus on its dot-product attention. Then, we givethe concept of the Gaussian Mixture model.

2.1 Transformer

Encoder-Decoder Transformer consists of en-coder and decoder, each of which contains Nrepeated independent structures. Each encoderlayer contains two sub-layers: self-attention andfully connected feed-forward network (FFN), whileeach decoder layer includes three sub-layers: self-attention, cross-attention, and FFN. We denote theinput sentence as x = (x1, · · · , xJ), where J isthe length of source sentence, xj ∈ Rdmodel is thesum of the token embedding and the position en-coding of the source token, and dmodel representsthe representation dimension. The encoder mapsx to a sequence of hidden states z = (z1, · · · , zJ).Given z and previous target tokens, the decoderpredicts the next output token yi, and the entireoutput sequence is y = (y1, · · · , yI), where I isthe length of the target sentence.

Dot-product attention Both self-attention andcross-attention in Transformer apply multi-headattention, which contains multiple heads and eachhead accomplishes scaled dot-product attention toprocess a set of queries(q), keys(k), and values(v).Following, we focus on the specific representationof the dot-product attention in cross-attention.

In cross-attention, the queries is the hidden statesof decoder s = {s1, · · · , sI}, while the keys andvalues both come from the hidden states of theencoder z = {z1, · · · , zJ}. Dot-product attentionfirst calculates the pairwise correlation score eijbetween the ith target token and the jth sourcetoken, and normalizes to obtain the dot-productattention weight αij of the ith target token for eachsource token:

eij =Q (si)K (zj)

T

√dk

(1)

αij =exp eij∑Jl=1 exp eil

(2)

where Q (·) and K (·) are the projection functionsfrom the input space to the query space and the keyspace, respectively, and dk represents the dimen-sions of the queries and keys. Then for each i, thecontext vector ci is calculated as:

ci =n∑j=1

αijV (zj) (3)

where V (·) is a projection function from the inputspace to the value space.

1403

k q v

Dot-productAttention

Gaussian MixtureAttention

+



+

Linear Linear Linear



+

Concat

Linear

H

Multi-head Attention(Cross-attention)

Figure 2: The architecture of the proposed method.

2.2 Gaussian Mixture ModelSingle Gaussian distribution A Gaussian distri-bution with the mean µ and variance σ, which iscalculated as:

x ∼ N (µ, σ) ≡ 1√2πσ

exp

(−(x− µ)2

2σ2

)(4)

Gaussian mixture model (GMM) (Rasmussen,1999) Composed ofK single Gaussian distribution,which is calculated as:

x ∼K∑k=1

akN (µk, σk) (5)

where ak, µk and σk are weight, mean and vari-ance of the kth Gaussian distribution, respectively.During training, for unlabeled data, GMM can betrained using the EM algorithm, and for labeleddata, GMM can be trained using methods such asmaximum likelihood or gradient descent.

3 The Proposed Method

To improve cross-attention, in addition to dot-product attention, we introduce the Gaussian mix-ture attention into each head of cross-attention.As shown in Figure 2, we first calculate the dot-product attention and Gaussian mixture attention,and then fuses them through a gating mechanismto determine the total attention distribution. Theproposed Gaussian mixture attention is constructedby mean, variance, and weight, all of which arepredicted based on the target word, as shown inFigure 3. The specific details will be introducedfollowing.

3.1 Gaussian Mixture AttentionGaussian mixture attention consists of K indepen-dent Gaussian distributions, where each Gaussian

FFN

Conversion Layer

K

Gaussian Mixture Attention

FFN FFN

Figure 3: Calculation of Gaussian Mixture Attention.

distribution has different weights. The general formof Gaussian mixture attention βij between the ith

target token and the jth source token is defined as:

βij =

K∑k=1

ωik ·1

Zikexp

(−(j − µik)2

2σ2ik

)(6)

where µik, σik and ωik are the mean, variance, andthe weight of the kth Gaussian distribution of theith target word, respectively, Zik represents thenormalization factor of kth Gaussian distribution ofthe ith target word, and K is a hyperparameter weset. In the physical sense, the mean µik representsthe position of the central word, the variance σikrepresents the attenuation degree of attention withthe offset from the central word, and the weightωik represents the importance of the central word.

As shown in Figure 3, the parameters for theGaussian mixture attention of the ith target wordare (ωi,µi,σi,Zi), where ωi ∈ RK×1 is thevector representation of [ωi1, · · · , ωiK ], and oth-ers are in the same rule. (ωi,µi,σi,Zi) areconverted from predicted intermediate parameters(ω̂i, µ̂i, σ̂i). According to Yang et al. (2018) andBattenberg et al. (2020), it is more rubost to usetarget hidden state to predict variables of Gaussiandistribution. Thus, the intermediate parameters arepredicted through the Feedforward Network (FFN):

ω̂i = V>ω tanh

(W>

ω Q(si) + bω1

)+ bω2 (7)

µ̂i = V>µ tanh

(W>

µ Q(si) + bµ1

)+ bµ2 (8)

σ̂i = V>σ tanh

(W>

σ Q(si) + bσ1

)+ bσ2 (9)

where {Wω,Wµ,Wσ}∈Rdq×dq and {Vω,Vµ,Vσ}∈Rdq×K are learnable parameters, {bω1,bµ1,bσ1}∈Rdq×1 and {bω2,bµ2,bσ2}∈RK×1 are learnable

1404

bias. Note thatQ (·) shares parameters with the pro-jection function of dot-product attention in Eq.(2),and parameters of FFN are shared in each head.

Given intermediate parameters (ω̂i, µ̂i, σ̂i), ourmethod predicts (ωi,µi,σi,Zi) through a conver-sion layer. For the weight of every single Gaussiandistribution, we normalize them with:

ωi = Softmax (ω̂i) (10)

For the mean, considering the word order differ-ences between language, we directly predict itsabsolute position:

µi = J · Sigmoid (µ̂i) (11)

where J is the length of the source sentence.Note that, since the source position is discrete

and will be truncated at the boundary, the attentionweight sum is less than 1 without normalization,rather than fully normalized attention weight. Pre-vious work (Luong et al., 2015; Yang et al., 2018;You et al., 2020) on applying Gaussian distribu-tion hardly normalized it since the normalization ofGaussian attention weights results in unstable train-ing. However, unnormalized attention weight leadsto occasional spikes or dropouts in the attention andalignment (Battenberg et al., 2020). Therefore, tonormalize Gaussian mixture attention meanwhilemaintaining training stability, we propose an ap-proximate normalization.

Approximate normalization adjusts the varianceaccording to the mean position to ensure that mostof the weights are within the source sentence,which not only avoids the spikes caused by littleweights sum but also ensures stable training. Forapproximate normalization, we calculate the valueof σi with µi:

σi=min

{J

6· Sigmoid (σ̂i) ,

µi3,J−µi

3

}(12)

Approximate normalization ensures that interval[µi − 3σi,µi + 3σi] is within the source sentence

(Pukelsheim, 1994). Besides, Zi is set to√2πσ2

i

to normalize each Gaussian distribution in GMM.In general, although Gaussian mixture attention isnot strictly normalized, approximate normalizationguarantees a coverage of more than 90% of theattention weight. In the experiments (Sec.5.2), weadditionally report the results of a strict normaliza-tion Gaussian mixture attention as a variant of ourmethod for comparison.

3.2 Fusion of Attention

Dot-product attention (αij in Eq.(2)) capture thedistributed attention brought by the pair-wise sim-ilarity, while Gaussian mixture attention (βij inEq.(6)) model the location-related concentrated at-tention. To balance two types of attention, we cal-culate the total attention weight γij by fusing themthrough a gating mechanism:

γij = (1− gi)× αij + gi × βij (13)

where gi is a gating factor, predicted through FFN:

gi=Sigmoid(V >g tanh(W>

g Q(si)+bg1

)+bg2)

(14)where Wg ∈ Rdq×dq and Vg ∈ Rdq×1 are learn-able parameters, bg1 ∈ Rdq×1 and bg2 ∈ R arelearnable bias. Finally, the context vector ci inEq.(3) is calculated as:

ci =

n∑j=1

γijV (zj) (15)

4 Related Work

Attention mechanism is the most significant com-ponent of the Transformer (Vaswani et al., 2017)for Neural Machine Translation. Recently, somemethods model the location in Transformer, mostof which focus on self-attention.

Some methods improve the word representation.Transformer itself (Vaswani et al., 2017) introduceda position encoding to embed position informationin word representation. Shaw et al. (2018) intro-duced relative position encoding in self-attention.Wang et al. (2019) enhanced self-attention withstructural positions from the syntax dependencies.Ding et al. (2020a) utilized reordering informationto learn position representation in self-attention.

Other methods directly consider location infor-mation in the attention mechanism, which is closelyrelated to this work. Luong et al. (2015) proposedlocal attention, which only focuses on a small sub-set of the source positions. Yang et al. (2018) mul-tiplied a learnable Gaussian bias to self-attentionto model the local information. Song et al. (2018)accommodates some masks for self-attention toextract global/local information. Xu et al. (2019)propose a hybrid attention mechanism to dynam-ically leverage both local and global information.You et al. (2020) apply a hard-code Gaussian toreplace dot-product attention in Transformer.

1405

Models Zi ωi µi σi Normalized

Synthesis Network 1 exp (ω̂i) µi−1+exp (µ̂i)√

exp (−σ̂i) /2 None

Our Method√2πσ2

i Softmax (ω̂i) J ·Sigmoid (µ̂i) Eq.(12) Approximate

Our Method+Norm.√2πσ2

i Softmax (ω̂i) J ·Sigmoid (µ̂i) J ·Sigmoid (σ̂i) Strict

Table 1: Conversion method of synthesis network, our method, and the strict-normalized variant of our method.

There are three differences between the proposedmethod and previous methods. 1) Most previousmethods only focus on self-attention, while weconsider the cross-attention, which is proved tobe more critical to translation quality (Voita et al.,2019; Tang et al., 2019; You et al., 2020; Ding et al.,2020b). 2) Previous methods usually multiply dot-product attention with a position-related bias ormask to model position. Our method additionallyintroduces a concentrated attention to compensatefor dot-product attention, rather than simple bias.3) Gaussian distribution is widely used in previousposition modeling, while we use the more flexibleGMM for complex cross-attention.

5 Experiments

We conducted experiments on three datasets andcompare with the baseline and previous methods toevaluate the performance of the proposed method.

5.1 Datasets

Experiments were conducted on the following threedatasets of different sizes.

Nist Zh→En 1.25M sentence pairs from LDCcorpora1. We use MT02 as the validation set andMT03, MT04, MT05, MT06, MT08 as the test sets,each with 4 English references. Results are aver-aged on all test sets. We tokenize and lowercaseEnglish sentences with the Moses2, and segmentedthe Chinese sentences with Stanford Segmentor3.We apply BPE (Sennrich et al., 2016) with 30Kmerge operations on all texts.

WMT14 En→De 4.5M sentence pairs fromWMT14 4 English-German task. We use news-test2013 (3000 sentence pairs) as the validation setand news-test2014 (3003 sentence pairs) as the test

1The corpora include LDC2002E18, LDC2003E07,LDC2003E14, Hansards portion of LDC2004T07,LDC2004T08 and LDC2005T06.

2https://www.statmt.org/moses/3https://nlp.stanford.edu/4www.statmt.org/wmt14/

set. We apply BPE with 32K merge operations, andthe vocabulary is shared across languages.

WMT17 Zh→En 20M sentence pairs fromWMT17 5 Chinese-English task, follow the pro-cessing of Hassan et al. (2018). We use devtest2017(2002 sentence pairs) as the validation set and news-test2017 (2001 sentence pairs) as the test set. Weapply BPE with 32K merge operations on all texts.

5.2 System

We conducted experiments on the following sys-tems.

Transformer Baseline of our method. Thearchitecture of Transformer-Base/Big was imple-mented strictly referring to Vaswani et al. (2017).

Hard-Code Gaussian Use a hard-code Gaus-sian distribution to replace dot-product attention incross-attention (You et al., 2020). The hard-codeGaussian distribution is an artificially set Gaussiandistribution with fixed mean and variance.

Localness Gaussian Our implementation of themodeling localness proposed by Yang et al. (2018).A learnable Gaussian bias is multiplied to modelthe local information of attention, especially in self-attention, and we apply it in cross-attention.

Synthesis Network Following (Graves, 2013),We modify the proposed Gaussian mixture atten-tion with the synthesis network, which consists ofK unnormalized Gaussian distributions, as shownin Table 1.

Our Method The proposed method. A Gaussianmixture attention is applied to cross-attention inTransformer. Refer to Sec.3 for specific details.

Our Method+Norm. Considering our methodis approximate-normalized, we additionally pro-pose strict-normalized Gaussian mixture attention,as a variant of our method. Since Gaussian mixtureattention weights βij are all positive numbers, wedirectly use βij/

∑nl=1 βij to normalize it, referring

to ‘Our Method+Norm.’ in Table 1.

5www.statmt.org/wmt17/

https://www.statmt.org/moses/

https://nlp.stanford.edu/

www.statmt.org/wmt14/

www.statmt.org/wmt17/

1406

ModelsNist

Zh→EnWMT14En→De

WMT17Zh→En

BLEU #Para. BLEU #Para. BLEU #Para.

Transformer-Base 44.02 79.7M 27.34 63.1M 24.03 83.6MHard-Code Gaussian 36.10 79.6M 25.03 63.0M 20.95 83.5MLocalness Gaussian 44.06 80.2M 27.41 63.6M 24.01 84.1MSynthesis Network 43.34 79.8M 26.27 63.2M 23.69 83.7M

Our Method+Norm. 44.69↑ 79.8M 27.50↑ 63.2M 24.44↑ 83.7MOur Method 45.39↑ 79.8M 28.09↑ 63.2M 24.44↑ 83.7M

Transformer-Big 44.20 247.5M 28.43 214.3M 24.46 255.2M

Our Method (Big) 45.45↑ 247.6M 29.02↑ 214.4M 24.82↑ 255.3M

Table 2: BLEU score of our method and the existing NMT models on test sets. “#Para.”: the learnable parameterscale of the model (M=million). “↑”: the improvement is significant by contrast to baseline (ρ < 0.01).

1 2 3 4 5 6 7 8K (#Gaussian distribution)

45.8

46.0

46.2

46.4

46.6

46.8

BLE

U

Baseline Our Method+Norm. Our Method

Figure 4: BLEU scores with different K.

The implementation is all adapted from FairseqLibrary (Ott et al., 2019) with the same settingsfrom Vaswani et al. (2017). SacreBLEU (Post,2018) is applied to evaluate translation quality.

5.3 Effect of Hyperparameter K

Before the main experiment, as shown in Figure 4,we evaluate performance with various hyperparam-eter K on the Nist Zh→En validation set, whereK represents the number of Gaussian distributionsin the Gaussian mixture attention. When K = 1,since the cross-attention is not one-to-one, it is dif-ficult for a single Gaussian distribution to fit thecross-attention. With the increase of K, the trans-lation quality improves and performs best whenK = 4. When K is large, too many Gaussiandistributions in GMM-based attention will compli-

cate the model and maybe predict some inaccuratecentral words, resulting in a decrease in translationquality. Therefore, we set K = 4 in the followingexperiments.

5.4 Main Results

Table 2 shows the results of our method com-pared with the baseline and previous methods.Our method achieves the best results on all threedatasets, improving about 1.37 on Nist Zh→En,0.75 on WMT14 En→De, and 0.41 on WMT17Zh→En respectively, compared with Transformer-Base. Besides, compared to Transformer-Big, ourmethod still brings significant improvements. Ourmethod only increases 0.1% more parameters thanTransformer-Base and achieve similar performancewith Transformer-Big. The performance improve-ment of our method is not simply through increas-ing the model parameters, but to improve the cross-attention with the proposed Gaussian mixture at-tention.

‘Hard-Code Gaussian’ (You et al., 2020) and‘Localness Gaussian’ (Yang et al., 2018) have beenproved to be effective in self-attention but not obvi-ous for cross-attention, since it’s difficult to fit com-plex cross-attention with their single Gaussian bias.The Gaussian mixture attention is more flexible andcan fit arbitrarily complex distributions, especiallyto model multi-center attention distribution, whichis more suitable for modeling cross-attention.

Considering normalization, ‘Synthesis Network’is un-normalized, ‘Our Method+Norm.’ is strict-normalized, and ‘our method’ is approximate-normalized, where our proposed approximate

1407

BLEU ∆

Our Method 28.09

− Share Mean 27.58 -0.51− Share Varience 27.67 -0.42− Share Weight 27.90 -0.19

Table 3: Performance when each Gaussian distributionshares the mean, variance, and weight, respectively.

normalization performs best. In practice, un-normalization tends to cause attention spikes, whilestrict normalization leads to unstable training.

6 Analysis

We conducted extensive analyses to understand thespecific improvements of our method in attentionentropy, alignment quality, phrase fluency, and longsentence translation. Unless otherwise specified,all the results are reported on WMT14 En→De testset with Transformer-Base.

6.1 Flexibility of Gaussian Mixture Attention

Compared with the single Gaussian distribution,GMM is more flexible on three aspects: mean, vari-ance, and weight. To evaluate the improvementsbrought by these three aspects, we respectivelyshare the mean, variance, and weight between eachGaussian distribution in Gaussian mixture atten-tion, and report the results in Table 3.

The performance decreases most obviouslywhen each Gaussian distribution sharing the samecentral word (mean). The major superiority ofGMM over Gaussian distribution is that GMM con-tains multiple centers, which is more in line withcross-attention. The variance allows each centralword to have different attention coverage, and theweight controls the contribution of each Gaussiandistribution. The flexibility of these three aspectsmakes Gaussian mixture attention more suitablefor cross-attention.

6.2 Effect of Gating Mechanism

Our method applies a gating mechanism to fuseGaussian mixture attention and dot-product atten-tion. We conduct the ablation study of directlyaveraging the two types of attention or only usingone of them in Table 4. Our method surpasses onlyusing a single type of attention or directly averag-ing the two types of attention, which shows that

BLEU ∆

Our Method 28.09− Average Gating 27.61 -0.48

Dot-product Attention 27.34 -0.75Gaussian Mixture Attention 26.32 -1.77

Table 4: Ablation study of directly averaging the twotypes of attention or only using one type of attention.

Figure 5: The distribution of gating factor gi in thecross-attention of each layer. X-axis is gi, which rep-resents the weight of Gaussian mixture attention in thetotal attention, and Y-axis is the frequency with gi.

the gating mechanism plays an important role andeffectively fuses two types of attention.

To analyze the relationship between the twotypes of attention in detail, we calculate the dis-tribution of gating factor (gi in Eq.(13)) of eachdecoder layer, and show the result in Figure 5. Toour surprise, our method makes the cross-attentionin each decoder layer present a different divisionof labor, which confirms the conclusions of previ-ous work (Li et al., 2019). Specifically, the bottomlayers (L1, L2) in the decoder emphasize on dot-product attention and tend to capture global infor-mation; the middle layers (L3, L4) emphasize onGaussian mixture attention, which captures localinformation around the central word; two types ofattention in the top layer (L5, L6) are more bal-anced and jointly determine the final output. Pre-vious work (Li et al., 2019) pointed out that the

1408

BLEU ∆

Baseline 27.34

Our|method

All 6 layers 28.09 +0.75Bottom 2 layers 27.43 +0.09Middle 2 layers 27.74 +0.40

Top 2 layers 27.89 +0.55

Table 5: Result of using Gaussian mixture attention indifferent decoder layers.

Entropy

AllShort(0,20]

Mid[21,40]

Long[41,∞]

Baseline 2.67 2.40 2.80 2.94

OursGMA 0.57 0.59 0.55 0.50DP 2.68 2.42 2.80 2.90Total 1.86 1.78 1.90 1.90

Table 6: Entropy of attention distribution on varyingsource sentence length. ‘GMA’: Gaussian mixture at-tention. ‘DP’: Dot-product attention. ‘Total’: Fusionof two types of attention.

cross-attention of each layer is different, wherethe lower layer tends to capture the sentence infor-mation, while the upper layer tends to capture thealignment and specific word information since it iscloser to the output. Our method also confirms thispoint, where Gaussian mixture attention effectivelymodel concentrated attention so that it occupies alarger proportion in higher layers. With the gat-ing mechanism, our method successfully fuses twotypes of attention and learns the division of laborbetween different layers.

Based on this, we tried to only apply our methodto a part of decoder layers to verify the effect of ourmethod on different layers, and the results are re-ported in Table 5. When Gaussian mixture attentionis only used in the top or middle 2 layers, the trans-lation quality can be significantly improved withoutrequiring many additional calculations comparedwith Transformer-Base.

6.3 Entropy of Attention Distribution

We use Gaussian mixture attention to model con-centrated attention to make up for the dispersion ofdot-product attention, especially on the long source.Entropy is often used to measure the dispersion ofdistribution, where the higher entropy means thatthe distribution is more dispersed (He et al., 2020).

AER P. R.

Transformer-Base 53.79 47.75 47.09Our method 48.21 53.94 52.83

Transformer-Big 46.50 54.46 56.15Our method (Big) 43.03 58.86 60.62

Table 7: Alignment quality of our method and baseline.

We report the entropy of the attention distributionin our method on varying source sentence lengthin Table 6. The entropy of dot-product attentionincrease with the length of the sentence, showingthat dot-product attention is easy to become dis-persed as the source length increases. However,the entropy of concentrated attention modeled byGaussian mixture attention always remains at a lowlevel, since it’s unaffected by the source length.Overall, with the proposed Gaussian mixture atten-tion, our method has lower entropy than baseline,indicating that our method focuses more on someimportant words, which proves to be beneficial totranslation (He et al., 2020; Zhang et al., 2019).

6.4 Alignment Quality

The Gaussian mixture attention we proposed ex-plicitly models the concentrated attention, so it ispotential to help cross-attention achieve more ac-curate alignment between the target and the source.To explore this conjecture, we evaluate the align-ment accuracy of our method on RWTH En→Dealignment dataset 6 (Liu et al., 2016; Ghader andMonz, 2017; Tang et al., 2019).

Following Luong et al. (2015) and Kuang et al.(2018), we force the models to produce the refer-ence target words during inference to get the at-tention between source and target. We average theattention weights across all heads from the penul-timate layer (Li et al., 2019; Ding et al., 2020a),where the source token with the highest attentionweight is viewed as the alignment of the currenttarget token. The alignment error rate (AER) (Ochand Ney, 2003), precision (P.), and recall (R.) ofour method are reported in Table 7.

Our method achieves better alignment accuracythan baseline, improving 5.58 on Transformer-Baseand 3.47 on Transformer-Big, which shows thatmodeling concentrated attention indeed improvesthe alignment quality of cross-attention.

6https://www-i6.informatik.rwth-aachen.de/goldAlignment/

https://www-i6.informatik.rwth-aachen.de/goldAlignment/

https://www-i6.informatik.rwth-aachen.de/goldAlignment/

1409

1 2 3 4 5 6 7 8 9 10N-gram

0.0

0.2

0.4

0.6

0.8

1.0G

ap o

f Sco

reOur MethodOur Method+Norm.Baseline

Figure 6: Gap of N-gram score between our methodand baseline with respect to various N, to highlight thedifference in improvement of different n-grams.

6.5 N-gram Accuracy

Our method uses Gaussian mixture attention tomodel concentrated attention, which intuitivelyshould be able to enhance the ability to captureneighboring structures, thereby obtaining more flu-ent translation. To evaluate the quality of phrasetranslation, we calculate the improvement of ourmethod on various N-grams in Figure 6. We setTransformer-Base as the baseline, and report thegap of score between our method and baseline oneach N-gram score. Our method is superior to thebaseline in all N-grams, especially on 2-gram and3-gram, which shows that our method effectivelycaptures the nearby phrase structure. In the con-centrated attention modeled by Gaussian mixtureattention, the attention to the surrounding wordsincreases along with the central word, resulting inbetter phrase translation.

6.6 Analysis on Sentence Length

To analyze the improvement of our method on sen-tences with different lengths, we group the sen-tences into 6 sets according to the source length(Bahdanau et al., 2014; Tu et al., 2016), and reportthe BLEU scores on each set in Figure 7.

Compared with Baseline, our method has a moresignificant improvement in long sentences, with+1.36 BLEU on (30, 40], +1.06 BLEU on (40, 50],and +4.14 BLEU on (50,+∞]. Our method sig-nificantly improves the long sentence translationby modeling concentrated cross-attention. Whenthe source sentence is very long, dot-product at-tention fairly pays attention to every source word

(0,10] (10,20] (20,30] (30,40] (40,50](50,+ ]Input length

20

22

24

26

28

30

32

BLE

U

26.22

27.47 27.41 27.30

26.24

27.30

25.85

28.1327.55

28.66

27.30

31.44BaselineOur method

Figure 7: BLEU scores of sentence with various length.

and normalizes it, causing cross-attention to be-come dispersed, which proved to be unfavorablefor translation in previous work (Zhang et al., 2019;He et al., 2020; Tang et al., 2019). In contrast, re-gardless of the length of the source, Gaussian mix-ture attention concentrates surround some centralwords, effectively avoiding the attention dispersion.Therefore, our method effectively improves thetranslation quality of long sentences by modelingthe concentrated attention. Besides, our methoddrops slightly when the sentence length is small.Since we set K = 4 for the source sentences of dif-ferent length, when the sentence is very short, it isdifficult to accurately find 4 corresponding centralwords (mean) for each target word.

7 Conclusion

Inspired by linguistics, we decompose the cross-attention into distributed attention and concentratedattention. To model the concentrated attention, weapply GMM to construct the Gaussian mixture at-tention, which effectively resolves the weaknessof dot-product attention. Experiments show thatthe proposed method outperforms the strong base-line on three datasets. Further analyses show thespecific advantages of the proposed method in at-tention distribution, alignment quality, N-gram ac-curacy, and long sentence translation.

Acknowledgements

We thank all the anonymous reviewers for theirinsightful and valuable comments. This work wassupported by National Key R&D Program of China(NO. 2017YFE0192900).

1410

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. 2014. Neural machine translation byjointly learning to align and translate. Citearxiv:1409.0473Comment: Accepted at ICLR 2015as oral presentation.

E. Battenberg, R. J. Skerry-Ryan, S. Mariooryad,D. Stanton, D. Kao, M. Shannon, and T. Bagby.2020. Location-relative attention mechanisms forrobust long-form speech synthesis. In ICASSP 2020- 2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages6194–6198.

John Brand and Aaron P. Johnson. 2018. The effectsof distributed and focused attention on rapid scenecategorization. Visual Cognition, 26(6):450–462.

Liang Ding, Longyue Wang, and Dacheng Tao. 2020a.Self-attention with cross-lingual position represen-tation. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 1679–1685, Online. Association for Computa-tional Linguistics.

Liang Ding, Longyue Wang, Di Wu, Dacheng Tao, andZhaopeng Tu. 2020b. Context-aware cross-attentionfor non-autoregressive translation. In Proceedingsof the 28th International Conference on Compu-tational Linguistics, pages 4396–4402, Barcelona,Spain (Online). International Committee on Compu-tational Linguistics.

Hamidreza Ghader and Christof Monz. 2017. Whatdoes attention in neural machine translation pay at-tention to? In Proceedings of the Eighth Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers), pages 30–39,Taipei, Taiwan. Asian Federation of Natural Lan-guage Processing.

Alex Graves. 2013. Generating sequences with recur-rent neural networks. CoRR, abs/1308.0850.

Hany Hassan, Anthony Aue, C. Chen, Vishal Chowd-hary, J. Clark, C. Federmann, Xuedong Huang,Marcin Junczys-Dowmunt, W. Lewis, M. Li, ShujieLiu, T. Liu, Renqian Luo, Arul Menezes, Tao Qin,F. Seide, Xu Tan, Fei Tian, Lijun Wu, ShuangzhiWu, Yingce Xia, Dongdong Zhang, Zhirui Zhang,and M. Zhou. 2018. Achieving human parity on au-tomatic chinese to english news translation. ArXiv,abs/1803.05567.

Ruining He, Anirudh Ravula, Bhargav Kanagal, andJoshua Ainslie. 2020. RealFormer: TransformerLikes Residual Attention. arXiv e-prints, pagearXiv:2012.11747.

Ruining He, Anirudh Ravula, Bhargav Kanagal, andJoshua Ainslie. 2020. Realformer: Transformerlikes residual attention.

Jinbae Im and Sungzoon Cho. 2017. Distance-basedself-attention network for natural language infer-ence.

Minami Ito, Gerald Westheimer, and Charles D Gilbert.1998. Attention and perceptual learning modulatecontextual influences on visual perception. Neuron,20(6):1191 – 1197.

Beck Jacob and Ambler Bruce. 1973. The effects ofconcentrated and distributed attention on peripheralacuity. Perception & Psychophysics, 14(2).

Shaohui Kuang, Junhui Li, António Branco, WeihuaLuo, and Deyi Xiong. 2018. Attention focusing forneural machine translation by bridging source andtarget embeddings. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1767–1776, Melbourne, Australia. Association for Compu-tational Linguistics.

Xintong Li, Guanlin Li, Lemao Liu, Max Meng, andShuming Shi. 2019. On the word alignment fromneural machine translation. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 1293–1303, Florence,Italy. Association for Computational Linguistics.

Lemao Liu, Masao Utiyama, Andrew Finch, and Ei-ichiro Sumita. 2016. Neural machine translationwith supervised attention. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages3093–3102, Osaka, Japan. The COLING 2016 Orga-nizing Committee.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Franz Josef Och and Hermann Ney. 2003. A systematiccomparison of various statistical alignment models.Computational Linguistics, 29(1):19–51.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computa-tional Linguistics.

Friedrich Pukelsheim. 1994. The three sigma rule. TheAmerican Statistician, 48(2):88–91.

http://arxiv.org/abs/1409.0473


https://doi.org/10.1109/ICASSP40776.2020.9054106

https://doi.org/10.1109/ICASSP40776.2020.9054106

https://doi.org/10.1080/13506285.2018.1485808

https://doi.org/10.1080/13506285.2018.1485808

https://doi.org/10.1080/13506285.2018.1485808

https://doi.org/10.18653/v1/2020.acl-main.153


https://doi.org/10.18653/v1/2020.coling-main.389

https://doi.org/10.18653/v1/2020.coling-main.389

https://www.aclweb.org/anthology/I17-1004



http://dblp.uni-trier.de/db/journals/corr/corr1308.html#Graves13

http://dblp.uni-trier.de/db/journals/corr/corr1308.html#Graves13

https://arxiv.org/abs/1803.05567

https://arxiv.org/abs/1803.05567



http://arxiv.org/abs/arXiv:2012.11747


https://doi.org/https://doi.org/10.1016/S0896-6273(00)80499-7

https://doi.org/https://doi.org/10.1016/S0896-6273(00)80499-7

https://doi.org/https://doi.org/10.3758/BF03212381



https://doi.org/10.18653/v1/P18-1164

https://doi.org/10.18653/v1/P18-1164

https://doi.org/10.18653/v1/P18-1164

https://doi.org/10.18653/v1/P19-1124

https://doi.org/10.18653/v1/P19-1124

https://www.aclweb.org/anthology/C16-1291

https://www.aclweb.org/anthology/C16-1291

https://doi.org/10.18653/v1/D15-1166

https://doi.org/10.18653/v1/D15-1166

https://doi.org/10.1162/089120103321337421

https://doi.org/10.1162/089120103321337421

https://doi.org/10.18653/v1/N19-4009

https://doi.org/10.18653/v1/N19-4009

https://doi.org/10.18653/v1/W18-6319

https://doi.org/10.18653/v1/W18-6319

http://www.jstor.org/stable/2684253

1411

Carl Edward Rasmussen. 1999. The infinite gaussianmixture model. In NIPS, pages 554–560. The MITPress.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), pages 464–468,New Orleans, Louisiana. Association for Computa-tional Linguistics.

Kaitao Song, Xu Tan, Furong Peng, and Jianfeng Lu.2018. Hybrid self-attention network for machinetranslation.

Matthias Sperber, Jan Niehues, Graham Neubig, Sebas-tian Stüker, and Alex Waibel. 2018. Self-attentionalacoustic models. pages 3723–3727.

Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2019.Understanding neural machine translation by sim-plification: The case of encoder-free models. InProceedings of the International Conference onRecent Advances in Natural Language Processing(RANLP 2019), pages 1186–1193, Varna, Bulgaria.INCOMA Ltd.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany. Association for ComputationalLinguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 5797–5808, Florence,Italy. Association for Computational Linguistics.

Xing Wang, Zhaopeng Tu, Longyue Wang, and Shum-ing Shi. 2019. Self-attention with structural positionrepresentations. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 1403–1409, Hong Kong, China. As-sociation for Computational Linguistics.

Mingzhou Xu, Derek F. Wong, Baosong Yang, YueZhang, and Lidia S. Chao. 2019. Leveraging lo-cal and global patterns for self-attention networks.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages3069–3075, Florence, Italy. Association for Compu-tational Linguistics.

Baosong Yang, Zhaopeng Tu, Derek F. Wong, FandongMeng, Lidia S. Chao, and Tong Zhang. 2018. Mod-eling localness for self-attention networks. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 4449–4458, Brussels, Belgium. Association for Computa-tional Linguistics.

Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020.Hard-coded gaussian attention for neural machinetranslation. pages 7689–7700.

J. Zhang, Y. Zhao, H. Li, and C. Zong. 2019. Atten-tion with sparsity regularization for neural machinetranslation and summarization. IEEE/ACM Transac-tions on Audio, Speech, and Language Processing,27(3):507–518.

https://proceedings.neurips.cc/paper/1999/file/97d98119037c5b8a9663cb21fb8ebf47-Paper.pdf

https://proceedings.neurips.cc/paper/1999/file/97d98119037c5b8a9663cb21fb8ebf47-Paper.pdf

https://doi.org/10.18653/v1/P16-1162

https://doi.org/10.18653/v1/P16-1162

https://doi.org/10.18653/v1/N18-2074

https://doi.org/10.18653/v1/N18-2074



https://doi.org/10.21437/Interspeech.2018-1910

https://doi.org/10.21437/Interspeech.2018-1910

https://doi.org/10.26615/978-954-452-056-4_136

https://doi.org/10.26615/978-954-452-056-4_136

https://doi.org/10.18653/v1/P16-1008

https://doi.org/10.18653/v1/P16-1008

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

https://doi.org/10.18653/v1/P19-1580

https://doi.org/10.18653/v1/P19-1580

https://doi.org/10.18653/v1/P19-1580

https://doi.org/10.18653/v1/D19-1145

https://doi.org/10.18653/v1/D19-1145

https://doi.org/10.18653/v1/P19-1295

https://doi.org/10.18653/v1/P19-1295

https://doi.org/10.18653/v1/D18-1475

https://doi.org/10.18653/v1/D18-1475



https://doi.org/10.1109/TASLP.2018.2883740



Modeling Concentrated Cross-Attention for Neural Machine ...

Documents

Transcript of Modeling Concentrated Cross-Attention for Neural Machine ...