Simplify the Usage of Lexicon in Chinese NERNamed Entity Recognition (NER) is concerned with the...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5951–5960July 5 - 10, 2020. c©2020 Association for Computational Linguistics

5951

Simplify the Usage of Lexicon in Chinese NER

Ruotian Ma1∗ , Minlong Peng1∗, Qi Zhang1,3, Zhongyu Wei2,3, Xuanjing Huang1

1Shanghai Key Laboratory of Intelligent Information Processing,School of Computer Science, Fudan University

2School of Data Science, Fudan University3Research Institute of Intelligent and Complex Systems, Fudan University

{rtma19,mlpeng16,qz,zywei,xjhuang}@fudan.edu.cn

Abstract

Recently, many works have tried to augmentthe performance of Chinese named entityrecognition (NER) using word lexicons. Asa representative, Lattice-LSTM (Zhang andYang, 2018) has achieved new benchmarkresults on several public Chinese NERdatasets. However, Lattice-LSTM has acomplex model architecture. This limits itsapplication in many industrial areas wherereal-time NER responses are needed.

In this work, we propose a simple but effectivemethod for incorporating the word lexicon intothe character representations. This methodavoids designing a complicated sequence mod-eling architecture, and for any neural NERmodel, it requires only subtle adjustment of thecharacter representation layer to introduce thelexicon information. Experimental studies onfour benchmark Chinese NER datasets showthat our method achieves an inference speedup to 6.15 times faster than those of state-of-the-art methods, along with a better perfor-mance. The experimental results also showthat the proposed method can be easily incor-porated with pre-trained models like BERT. 1

1 Introduction

Named Entity Recognition (NER) is concernedwith the identification of named entities, suchas persons, locations, and organizations, in un-structured text. NER plays an important rolein many downstream tasks, including knowledgebase construction (Riedel et al., 2013), informationretrieval (Chen et al., 2015), and question answer-ing (Diefenbach et al., 2018). In languages wherewords are naturally separated (e.g., English), NERhas been conventionally formulated as a sequence

∗Equal contribution.1The source code of this paper is publicly

available at https://github.com/v-mipeng/LexiconAugmentedNER.

labeling problem, and the state-of-the-art resultshave been achieved using neural-network-basedmodels (Huang et al., 2015; Chiu and Nichols,2016; Liu et al., 2018).

Compared with NER in English, Chinese NERis more difficult since sentences in Chinese are notnaturally segmented. Thus, a common practice forChinese NER is to first perform word segmentationusing an existing CWS system and then apply aword-level sequence labeling model to the seg-mented sentence (Yang et al., 2016; He and Sun,2017b). However, it is inevitable that the CWSsystem will incorrectly segment query sentences.This will result in errors in the detection of entityboundary and the prediction of entity categoryin NER. Therefore, some approaches resort toperforming Chinese NER directly at the characterlevel, which has been empirically proven to beeffective (He and Wang, 2008; Liu et al., 2010; Liet al., 2014; Liu et al., 2019; Sui et al., 2019; Guiet al., 2019b; Ding et al., 2019).

A drawback of the purely character-based NERmethod is that the word information is not fully ex-ploited. With this consideration, Zhang and Yang,(2018) proposed Lattice-LSTM for incorporatingword lexicons into the character-based NER model.Moreover, rather than heuristically choosing a wordfor the character when it matches multiple wordsin the lexicon, the authors proposed to preserveall words that match the character, leaving thesubsequent NER model to determine which wordto apply. To realize this idea, they introduced anelaborate modification to the sequence modelinglayer of the LSTM-CRF model (Huang et al., 2015).Experimental studies on four Chinese NER datasetshave verified the effectiveness of Lattice-LSTM.

However, the model architecture of Lattice-LSTM is quite complicated. In order to introducelexicon information, Lattice-LSTM adds severaladditional edges between nonadjacent characters

https://github.com/v-mipeng/LexiconAugmentedNER

https://github.com/v-mipeng/LexiconAugmentedNER

5952

in the input sequence, which significantly slowsits training and inference speeds. In addition,it is difficult to transfer the structure of Lattice-LSTM to other neural-network architectures (e.g.,convolutional neural networks and transformers)that may be more suitable for some specific tasks.

In this work, we propose a simpler method torealize the idea of Lattice-LSTM, i.e., incorporat-ing all the matched words for each character to acharacter-based NER model. The first principleof our model design is to achieve a fast inferencespeed. To this end, we propose to encode lexiconinformation in the character representations, andwe design the encoding scheme to preserve asmuch of the lexicon matching results as possible.Compared with Lattice-LSTM, our method avoidsthe need for a complicated model architecture, iseasier to implement, and can be quickly adapted toany appropriate neural NER model by adjustingthe character representation layer. In addition,ablation studies show the superiority of our methodin incorporating more complete and distinct lexiconinformation, as well as introducing a more effectiveword-weighting strategy. The contributions of thiswork can be summarized as follows:

• We propose a simple but effective method forincorporating word lexicons into the characterrepresentations for Chinese NER.

• The proposed method is transferable to differ-ent sequence-labeling architectures and can beeasily incorporated with pre-trained modelslike BERT (Devlin et al., 2018).

We performed experiments on four public ChineseNER datasets. The experimental results show thatwhen implementing the sequence modeling layerwith a single-layer Bi-LSTM, our method achievesconsiderable improvements over the state-of-the-art methods in both inference speed and sequencelabeling performance.

2 Background

In this section, we introduce several previous worksthat influenced our work, including the Softwordtechnique and Lattice-LSTM.

2.1 Softword Feature

The Softword technique was originally used forincorporating word segmentation information intodownstream tasks (Zhao and Kit, 2008; Peng and

Dredze, 2016). It augments the character repre-sentation with the embedding of its correspondingsegmentation label:

xcj ← [xcj ; eseg(seg(cj))]. (1)

Here, seg(cj) ∈ Yseg denotes the segmentationlabel of the character cj predicted by the wordsegmentor, eseg denotes the segmentation labelembedding lookup table, and typically Yseg ={B,M,E,S}.

However, gold segmentation is not provided inmost datasets, and segmentation results obtainedby a segmenter can be incorrect. Therefore,segmentation errors will inevitably be introducedthrough this approach.

2.2 Lattice-LSTMLattice-LSTM designs to incorporate lexicon in-formation into the character-based neural NERmodel. To achieve this purpose, lexicon matchingis first performed on the input sentence. If the sub-sequence {ci, · · · , cj} of the sentence matches aword in the lexicon for i < j, a directed edge isadded from ci to cj . All lexicon matching resultsrelated to a character are preserved by allowingthe character to be connected with multiple othercharacters. Intrinsically, this practice converts theinput form of a sentence from a chain into a graph.

In a normal LSTM layer, the hidden state hi andthe memory cell ci of each time step is updated by:

hi, ci = f(hj−1, cj−1, xcj), (2)

However, in order to model the graph-based input,Lattice-LSTM introduces an elaborate modificationto the normal LSTM. Specifically, let s<∗,j>denote the list of sub-sequences of sentence s thatmatch the lexicon and end with cj , h<∗,j> denotethe corresponding hidden state list {hi,∀s<i,j> ∈s<∗,j>}, and c<∗,j> denote the correspondingmemory cell list {ci,∀s<i,j> ∈ s<∗,j>}. InLattice-LSTM, the hidden state hj and memorycell cj of cj are now updated as follows:

hj , cj = f(hj−1, cj−1, xcj , s<∗,j>, h<∗,j>, c<∗,j>),

(3)where f is a simplified representation of thefunction used by Lattice-LSTM to perform memoryupdate.

From our perspective, there are two main advan-tages to Lattice-LSTM. First, it preserves all thepossible lexicon matching results that are related to

5953

a character, which helps avoid the error propagationproblem introduced by heuristically choosing asingle matching result for each character. Second,it introduces pre-trained word embeddings to thesystem, which greatly enhances its performance.

However, efficiency problems exist in Lattice-LSTM. Compared with normal LSTM, Lattice-LSTM needs to additionally model s<∗,j>, h<∗,j>,and c<∗,j> for memory update, which slows thetraining and inference speeds. Additionally, due tothe complicated implementation of f , it is difficultfor Lattice-LSTM to process multiple sentencesin parallel (in the published implementation ofLattice-LSTM, the batch size was set to 1). Theseproblems limit its application in some industrialareas where real-time NER responses are needed.

3 Approach

In this work, we sought to retain the merits ofLattice-LSTM while overcoming its drawbacks. Tothis end, we propose a novel method in which lexi-con information is introduced by simply adjustingthe character representation layer of an NER model.We refer to this method as SoftLexicon. As shownin Figure 1, the overall architecture of the proposedmethod is as follows. First, each character of theinput sequence is mapped into a dense vector. Next,the SoftLexicon feature is constructed and addedto the representation of each character. Then, theseaugmented character representations are put intothe sequence modeling layer and the CRF layer toobtain the final predictions.

3.1 Character Representation LayerFor a character-based Chinese NER model, theinput sentence is seen as a character sequence s ={c1, c2, · · · , cn} ∈ Vc, where Vc is the charactervocabulary. Each character ci is represented usinga dense vector (embedding):

xci = ec(ci), (4)

where ec denotes the character embedding lookuptable.

Char + bichar. In addition, Zhang and Yang,(2018) has proved that character bigrams are usefulfor representing characters, especially for thosemethods not using word information. Therefore,it is common to augment the character representa-tions with bigram embeddings:

xci = [ec(ci); eb(ci, ci+1)], (5)

Match in the lexicon

中国语⾔学

语⾔ (Language)

语⾔学 (Linguistic)

国语 (National language)

中国语⾔ (Chinese language)

中国语 (Chinese language)

语 (Language; Say)

C3

语⾔ f3,4

语⾔学 f3,5

中国语⾔ f1,4

国语 f2,3

中国语 f1,3

语 f3,3

weight

B

M

E

S

⨁

⨁

⨁

⨁

C4

B

M

E

S

⨁

⨁

⨁

⨁

C5

B

M

E

S

⨁

⨁

⨁

⨁

C2

B

M

E

S

⨁

⨁

⨁

⨁

C1

B

M

E

S

⨁

⨁

⨁

⨁

Bi-LSTM / CNN / Transformer layer

B-LOC E-LOC O O O

Char emb

SoftLexicon feature

Concatenation

Sequence encoding

layer

CRF layer

Predictions

Input sequence

…

…

…

Figure 1: The overall architecture of the proposedmethod.

where eb denotes the bigram embedding lookuptable.

3.2 Incorporating Lexicon InformationThe problem with the purely character-based NERmodel is that it fails to exploit word information.To address this issue, we proposed two methods, asdescribed below, to introduce the word informationinto the character representations. In the following,for any input sequence s = {c1, c2, · · · , cn}, wi,j

denotes its sub-sequence {ci, ci+1, · · · , cj}.

ExSoftword FeatureThe first conducted method is an intuitive exten-sion of the Softword method, called ExSoftword.Instead of choosing one segmentation result foreach character, it proposes to retain all possiblesegmentation results obtained using the lexicon:

xcj ← [xc

j ; eseg(segs(cj)], (6)

where segs(cj) denotes all segmentation labelsrelated to cj , and eseg(segs(cj)) is a 5-dimensionalmulti-hot vector with each dimension correspond-ing to an item of {B, M, E, S, O}.

As an example presented in Figure 2, thecharacter c7 (“西”) occurs in two words, w5,8

(“中山西路”) and w6,7 (“山西”), that match thelexicon, and it occurs in the middle of “中山西路” and the end of “山西”. Therefore, itscorresponding segmentation result is {M,E}, andits character representation is enriched as follows:

xc7 ← [xc

7; eseg({M,E})]. (7)

5954

中Zhong

山Hill

西East

路Road

在On

住Live

明Ming

李Li

中山Zhongshan City

中山西路West Zhongshan Road

李明Ming Li (person name)

c𝟏 𝒄2 𝒄8𝒄7𝒄6𝒄5𝒄4𝒄3

𝒘𝟔,𝟕𝒘𝟓,𝟔 𝒘𝟓,𝟖𝒘𝟏,𝟐

{ 𝐁 } { 𝐁,𝐌, 𝐄 } { 𝐌, 𝐄 } { 𝐄 }

中山西East Zhongshan City

山西路Shanxi Road


山西Shanxi Province

𝒘𝟓,𝟔 𝒘𝟓,𝟕 𝒘𝟔,𝟖

Cannot restore!

ExSoftword method

Figure 2: The ExSoftword method.

Here, the second and third dimensions of eseg(·)are set to 1, and the rest dimensions are set to 0.

The problem of this approach is that it cannotfully inherit the two merits of Lattice-LSTM. First,it fails to introduce pre-trained word embeddings.Second, it still losses information of the matchingresults. As shown in Figure 2, the constructedExSoftword feature for characters {c5, c6, c7, c8} is{{B}, {B,M,E}, {M,E}, {E}}. However, giventhis constructed sequence, there exists more thanone corresponding matching results, such as {w5,6

(“中山”), w5,7 (“中山西”), w6,8 (“山西路”)} and{w5,6 (“中山”), w6,7 (“山西”), w5,8 (“中山西路”)}. Therefore, we cannot tell which is thecorrect result to be restored.

SoftLexiconBased on the analysis on Exsoftword, we furtherdeveloped the SoftLexicon method to incorporatethe lexicon information. The SoftLexicon featuresare constructed in three steps.

Categorizing the matched words. First, to re-tain the segmentation information, all matchedwords of each character ci is categorized into fourword sets “BMES”, which is marked by the foursegmentation labels. For each character ci in theinput sequence = {c1, c2, · · · , cn}, the four set isconstructed by:

B(ci) = {wi,k,∀wi,k ∈ L, i < k ≤ n},M(ci) = {wj,k,∀wj,k ∈ L, 1 ≤ j < i < k ≤ n},E(ci) = {wj,i,∀wj,i ∈ L, 1 ≤ j < i},S(ci) = {ci, ∃ci ∈ L}.

(8)

Here, L denotes the lexicon we use in this work.Additionally, if a word set is empty, a specialword “NONE” is added to the empty word set. Anexample of this categorization approach is shownin Figure 3. Noted that in this way, not only we

李Li

𝐁 = "山西"𝐌 = {"中山西路"}𝐄 = "中山"𝐒 = "山"

𝐁 = "𝑵𝒐𝒏𝒆"𝐌 = "𝑵𝒐𝒏𝒆"𝐄 = "中山西路"𝐒 = "路"

Soft-lexicon method

中Zhong

山Hill

西East

路Road

在On

住Live

明Ming


中山西路West Zhongshan Road

李明Ming Li (person name)

c𝟏 𝒄2 𝒄8𝒄7𝒄6𝒄5𝒄4𝒄3

𝒘𝟔,𝟕𝒘𝟓,𝟔 𝒘𝟓,𝟖𝒘𝟏,𝟐

山西Shanxi Province

Figure 3: The SoftLexicon method.

can introduce the word embedding, but also noinformation loss exists since the matching resultscan be exactly restored from the four word sets ofthe characters.

Condensing the word sets. After obtaining the“BMES” word sets for each character, each wordset is then condensed into a fixed-dimensionalvector. In this work, we explored two approachesfor implementing this condensation.

The first implementation is the intuitive mean-pooling method:

vs(S) = 1

|S|∑w∈S

ew(w). (9)

Here, S denotes a word set and ew denotes theword embedding lookup table.

However, as shown in Table 8, the resultsof empirical studies revealed that this algorithmdoes not perform well. Therefore, a weightingalgorithm is introduced to further leverage theword information. To maintain computationalefficiency, we did not opt for a dynamic weightingalgorithm like attention. Instead, we propose usingthe frequency of each word as an indication of itsweight. Since the frequency of a word is a staticvalue that can be obtained offline, this can greatlyaccelerate the calculation of the weight of eachword.

Specifically, let z(w) denote the frequency thata lexicon word w occurs in the statistical data,the weighted representation of the word set S isobtained as follows:

vs(S) =4

Z

∑w∈S

z(w)ew(w), (10)

whereZ =

∑w∈B∪M∪E∪S

z(w).

5955

Here, weight normalization is performed on allwords in the four word sets to make an overallcomparison.

In this work, the statistical data set is constructedfrom a combination of training and developing dataof the task. Of course, if there is unlabelled datain the task, the unlabeled data set can serve asthe statistical data set. In addition, note that thefrequency of w does not increase if w is coveredby another sub-sequence that matches the lexicon.This prevents the problem in which the frequencyof a shorter word is always less than the frequencyof the longer word that covers it.

Combining with character representation.The final step is to combine the representations offour word sets into one fix-dimensional feature,and add it to the representation of each character.In order to retain as much information as possible,we choose to concatenate the representations ofthe four word sets, and the final representation ofeach character is obtained by:

es (B,M,E,S) = [vs(B);vs(M);vs(E);vs(S)],

xc ← [xc; es(B,M,E, S)].(11)

Here, vs denotes the weighting function above.

3.3 Sequence Modeling LayerWith the lexicon information incorporated, thecharacter representations are then put into thesequence modeling layer, which models the depen-dency between characters. Generic architecturesfor this layer including the bidirectional long-short term memory network(BiLSTM), the Con-volutional Neural Network(CNN) and the trans-former(Vaswani et al., 2017). In this work, weimplemented this layer with a single-layer Bi-LSTM.

Here, we precisely show the definition of theforward LSTM:

itftotc̃t

=

σσσ

tanh

(W [xctht−1

]+ b

),

ct = c̃t � it + ct−1 � ft,ht = ot � tanh(ct).

(12)

where σ is the element-wise sigmoid function and� represents element-wise product. W and bare trainable parameters. The backward LSTMshares the same definition as the forward LSTM

Datasets Type Train Dev Test

OntoNotesSentence 15.7k 4.3k 4.3k

Char 491.9k 200.5k 208.1k

MSRASentence 46.4k - 4.4k

Char 2169.9k - 172.6k

WeiboSentence 1.4k 0.27k 0.27k

Char 73.8k 14.5 14.8k

ResumeSentence 3.8k 0.46 0.48k

Char 124.1k 13.9k 15.1k

Table 1: Statistics of datasets.

yet model the sequence in a reverse order. Theconcatenated hidden states at the ith step of theforward and backward LSTMs hi = [

−→h i;←−h i]

forms the context-dependent representation of ci.

3.4 Label Inference LayerOn top of the sequence modeling layer, it is typicalto apply a sequential conditional random field(CRF) (Lafferty et al., 2001) layer to perform labelinference for the whole character sequence at once:

p(y|s;θ) =∏n

t=1 φt(yt−1, yt|s)∑y′∈Ys

∏nt=1 φt(y

′t−1, y

′t|s)

. (13)

Here, Ys denotes all possible label sequencesof s, and φt(y

′, y|s) = exp(wTy′,yht + by′,y),

where wy′,y and by′,y are trainable parameterscorresponding to the label pair (y′, y), and θdenotes model parameters. For label inference, itsearches for the label sequence y∗ with the highestconditional probability given the input sequence s:

y∗ =y p(y|s;θ), (14)

which can be efficiently solved using the Viterbialgorithm (Forney, 1973).

4 Experiments

4.1 Experiment SetupMost experimental settings in this work followedthe protocols of Lattice-LSTM (Zhang and Yang,2018), including tested datasets, compared base-lines, evaluation metrics (P, R, F1), and so on.To make this work self-completed, we conciselyillustrate some primary settings of this work.

DatasetsThe methods were evaluated on four Chinese NERdatasets, including OntoNotes (Weischedel et al.,2011), MSRA (Levow, 2006), Weibo NER (Peng

5956

Models OntoNotes MSRA Weibo ResumeLattice-LSTM 1× 1× 1× 1×LR-CNN (Gui et al., 2019) 2.23× 1.57× 2.41× 1.44×BERT-tagger 2.56× 2.55× 4.45× 3.12×BERT + LSTM + CRF 2.77× 2.32× 2.84× 2.38×SoftLexicon (LSTM) 6.15× 5.78× 6.10× 6.13×SoftLexicon (LSTM) + bichar 6.08× 5.95× 5.91× 6.45×SoftLexicon (LSTM) + BERT 2.74× 2.33× 2.85× 2.32×

Table 2: Inference speed (average sentences per second,the larger the better) of our method with LSTM layercompared with Lattice-LSTM, LR-CNN and BERT.

and Dredze, 2015; He and Sun, 2017a), andResume NER (Zhang and Yang, 2018). OntoNotesand MSRA are from the newswire domain, wheregold-standard segmentation is available for trainingdata. For OntoNotes, gold segmentation is alsoavailable for development and testing data. WeiboNER and Resume NER are from social media andresume, respectively. There is no gold standardsegmentation in these two datasets. Table 1 showsstatistic information of these datasets. As for thelexicon, we used the same one as Lattice-LSTM,which contains 5.7k single-character words, 291.5ktwo-character words, 278.1k three-character words,and 129.1k other words. In addition, the pre-trained character embeddings we used are also thesame with Lattice-LSTM, which are pre-trained onChinese Giga-Word using word2vec.

Implementation DetailIn this work, we implement the sequence-labelinglayer with Bi-LSTM. Most implementation de-tails followed those of Lattice-LSTM, includingcharacter and word embedding sizes, dropout,embedding initialization, and LSTM layer number.Additionally, the hidden size was set to 200 forsmall datasets Weibo and Resume, and 300 forlarger datasets OntoNotes and MSRA. The initiallearning rate was set to 0.005 for Weibo and 0.0015for the rest three datasets with Adamax (Kingmaand Ba, 2014) step rule 2.

4.2 Computational Efficiency Study

Table 2 shows the inference speed of the Soft-Lexicon method when implementing the sequencemodeling layer with a bi-LSTM layer. The speedwas evaluated based on the average number ofsentences processed by the model per secondusing a GPU (NVIDIA TITAN X). From the

2Please refer to the attached source code for moreimplementation detail of this work and access https://github.com/jiesutd/LatticeLSTM for pre-trainedword and character embeddings.

20 40 60 80 100Sentence length

0

25

50

75

100

125

150

175

Infe

renc

e sp

eed

(st/s

)

Soft-lexiconLattice-LSTMLR_CNN

Figure 4: Inference speed against sentence length. Weuse a same batch size of 1 for a fair speed comparison.

table, we can observe that when decoding withthe same batch size (=1), the proposed methodis considerably more efficient than Lattice-LSTMand LR-CNN, performing up to 6.15 times fasterthan Lattice-LSTM. The inference speeds of Soft-Lexicon(LSTM) with bichar are close to thosewithout bichar, since we only concatenate anadditional feature to the character representation.The inference speeds of the BERT-Tagger andSoftLexicon (LSTM) + BERT models are limiteddue to the deep layers of the BERT structure.However, the speeds of the SoftLexicon (LSTM) +BERT model are still faster than those of Lattice-LSTM and LR-CNN on all datasets.

To further illustrate the efficiency of the Soft-Lexicon method, we also conducted an experimentto evaluate its inference speed against sentencesof different lengths, as shown in Table 4. Fora fair comparison, we set the batch size to 1in all of the compared methods. The resultsshow that the proposed method achieves significantimprovement in speed over Lattice-LSTM andLR-CNN when processing short sentences. Withthe increase of sentence length, the proposedmethod is consistently faster than Lattice-LSTMand LR-CNN despite the speed degradation dueto the recurrent architecture of LSTM. Overall,the proposed SoftLexicon method shows a greatadvantage over other methods in computationalefficiency.

4.3 Effectiveness Study

Tables 3−63 show the performances of our methodagainst the compared baselines. In this study,the sequence modeling layer of our method was

3In Table 3−5, ∗ indicates that the model uses externallabeled data for semi-supervised learning. † means that themodel also uses discrete features.

https://github.com/jiesutd/LatticeLSTM

https://github.com/jiesutd/LatticeLSTM

5957

Input Models P R F1

Gold seg

Yang et al., 2016 65.59 71.84 68.57Yang et al., 2016∗† 72.98 80.15 76.40Che et al., 2013∗ 77.71 72.51 75.02Wang et al., 2013∗ 76.43 72.32 74.32Word-based (LSTM) 76.66 63.60 69.52

+ char + bichar 78.62 73.13 75.77

Auto segWord-based (LSTM) 72.84 59.72 65.63

+ char + bichar 73.36 70.12 71.70

No seg

Char-based (LSTM) 68.79 60.35 64.30+ bichar + softword 74.36 69.43 71.89+ ExSoftword 69.90 66.46 68.13+ bichar + ExSoftword 73.80 71.05 72.40

Lattice-LSTM 76.35 71.56 73.88LR-CNN (Gui et al., 2019) 76.40 72.60 74.45SoftLexicon (LSTM) 77.28 74.07 75.64SoftLexicon (LSTM) + bichar 77.13 75.22 76.16BERT-Tagger 76.01 79.96 77.93BERT + LSTM + CRF 81.99 81.65 81.82SoftLexicon (LSTM) + BERT 83.41 82.21 82.81

Table 3: Performance on OntoNotes. A model followedby (LSTM) (e.g., Proposed (LSTM)) indicates that itssequence modeling layer is LSTM-based.

implemented with a single layer bidirectionalLSTM.

OntoNotes. Table 3 shows results 4 on theOntoNotes dataset, where gold word segmentationis provided for both training and testing data.The methods of the “Gold seg” and the “Autoseg” groups are all word-based, with the formerinput building on gold word segmentationresults and the latter building on automatic wordsegmentation results by a segmenter trained onOntoNotes training data. The methods used inthe “No seg” group are character-based. From thetable, we can make several observations. First,when gold word segmentation was replaced byautomatically generated word segmentation, theF1 score decreases from 75.77% to 71.70%. Thisreveals the problem of treating the predictedword segmentation result as the true result in theword-based Chinese NER. Second, the F1 scoreof the Char-based (LSTM)+ExSoftword modelis greatly improved from that of the Char-based(LSTM) model. This indicates the feasibility ofthe naive ExSoftword method. However, it stillgreatly underperforms relative to Lattice-LSTM,which reveals its deficiency in utilizing wordinformation. Lastly, the proposed SoftLexiconmethod outperforms Lattice-LSTM by 1.76%with respect to the F1 score, and obtains a greaterimprovement of 2.28% combining the bichar

4A result in boldface indicates that it is statisticallysignificantly better (p < 0.01 in pairwise t−test) than theothers in the same box.

Models P R F1Chen et al., 2006 91.22 81.71 86.20Zhang et al. 2006∗ 92.20 90.18 91.18Zhou et al. 2013 91.86 88.75 90.28Lu et al. 2016 - - 87.94Dong et al. 2016 91.28 90.62 90.95Char-based (LSTM) 90.74 86.96 88.81+ bichar+softword 92.97 90.80 91.87+ ExSoftword 90.77 87.23 88.97+ bichar+ExSoftword 93.21 91.57 92.38


Table 4: Performance on MSRA.

Models NE NM OverallPeng and Dredze, 2015 51.96 61.05 56.05Peng and Dredze, 2016∗ 55.28 62.97 58.99He and Sun, 2017a 50.60 59.32 54.82He and Sun, 2017b∗ 54.50 62.17 58.23Char-based (LSTM) 46.11 55.29 52.77

+ bichar+softword 50.55 60.11 56.75+ ExSoftword 44.65 55.19 52.42+ bichar+ExSoftword 58.93 53.38 56.02


Table 5: Performance on Weibo. NE, NM and Overalldenote F1 scores for named entities, nominal entities(excluding named entities) and both, respectively.

feature. It even performs comparably with theword-based methods of the “Gold seg” group,verifying its effectiveness on OntoNotes.

MSRA/Weibo/Resume. Tables 4, 5 and 6 showresults on the MSRA, Weibo and Resume datasets,respectively. Compared methods include thebest statistical models on these data set, whichleveraged rich handcrafted features (Chen et al.,2006; Zhang et al., 2006; Zhou et al., 2013),character embedding features (Lu et al., 2016; Pengand Dredze, 2016), radical features (Dong et al.,2016), cross-domain data, and semi-superviseddata (He and Sun, 2017b). From the tables, wecan see that the performance of the proposed Soft-lexion method is significant better than that ofLattice-LSTM and other baseline methods on allthree datasets.

5958

Models P R F1Word-based (LSTM) 93.72 93.44 93.58

+char+bichar 94.07 94.42 94.24Char-based (LSTM) 93.66 93.31 93.48

+ bichar+softword 94.53 94.29 94.41+ ExSoftword 95.29 94.42 94.85+ bichar+ExSoftword 96.14 94.72 95.43


Table 6: Performance on Resume.

Models OntoNotes MSRA Weibo ResumeSoftLexicon (LSTM) 75.64 93.66 61.42 95.53ExSoftword (CNN) 68.11 90.02 53.93 94.49SoftLexicon (CNN) 74.08 92.19 59.65 95.02ExSoftword (Transformer) 64.29 86.29 52.86 93.78SoftLexicon (Transformer) 71.21 90.48 61.04 94.59

Table 7: F1 score with different implementations of thesequence modeling layer. ExSoftword is the shorthandof Char-based+bichar+ExSoftword.

4.4 Transferability Study

Table 7 shows the performance of the SoftLexiconmethod when implementing the sequence modelinglayer with different neural architecture. Fromthe table, we can first see that the LSTM-basedarchitecture performed better than the CNN- andtransformer- based architectures. In addition, ourmethod with different sequence modeling layersconsistently outperformed their corresponding Ex-Softword baselines. This confirms the superiorityof our method in modeling lexicon information indifferent neural NER models.

4.5 Combining Pre-trained Model

We also conducted experiments on the four datasetsto further verify the effectiveness of SoftLexicon incombination with pre-trained model, the resultsof which are shown in Tables 3−6. In theseexperiments, we first use a BERT encoder to obtainthe contextual representations of each sequenc,and then concatenated them into the characterrepresentations. From the table, we can see thatthe SoftLexicon method with BERT outperformsthe BERT tagger on all four datasets. Theseresults show that the SoftLexicon method canbe effectively combined with pre-trained model.Moreover, the results also verify the effectivenessof our method in utilizing lexicon information,

Models OntoNotes MSRA Weibo ResumeSoftLexicon (LSTM) 75.64 93.66 61.42 95.53

- “M” group 75.06 93.09 58.13 94.72- Distinction 70.29 92.08 54.85 94.30- Weighted pooling 72.57 92.76 57.72 95.33- Overall weighting 74.28 93.16 59.55 94.92

Table 8: An ablation study of the proposed model.

which means it can complement the informationobtained from the pre-trained model.

4.6 Ablation Study

To investigate the contribution of each componentof our method, we conducted ablation experimentson all four datasets, as shown in table 8.

(1) In Lattice-LSTM, each character receivesword information only from the words that beginor end with it. Thus, the information of thewords that contain the character inside is ignored.However, the SoftLexicon prevents the loss of thisinformation by incorporating the “Middle” groupof words. In the “ - ‘M’ group” experiment, weremoved the ”Middle” group in SoftLexicon, asin Lattice-LSTM. The degradation in performanceon all four datasets indicates the importance of the“M” group of words, and confirms the advantage ofour method.

(2) Our method proposed to draw a clear dis-tinction between the four “BMES” categories ofmatched words. To study the relative contributionof this design, we conducted experiments to removethis distinction, i.e., we simply added up all theweighted words regardless of their categories. Thedecline in performance verifies the significance ofa clear distinction for different matched words.

(3) We proposed two strategies for pooling thefour word sets in Section 3.2. In the “- Weightedpooling” experiment, the weighted pooling strategywas replaced with mean-pooling, which degradesthe performance. Compared with mean-pooling,the weighting strategy not only succeeds in weigh-ing different words by their significance, but alsointroduces the frequency information of each wordin the statistical data, which is verified to be helpful.

(4) Although existing lexicon-based methodslike Lattice-LSTM also use word weighting, un-like the proposed Soft-lexion method, they failto perform weight normalization among all thematched words. For example, Lattice-LSTM onlynormalizes the weights inside the “B” group or the”E” group. In the “- Overall weighting” experiment,we performed weight normalization inside each

5959

“BMES” group as Lattice-LSTM does, and foundthe resulting performance to be degraded. Thisresult shows that the ability to perform overallweight normalization among all matched wordsis also an advantage of our method.

5 Conclusion

In this work, we addressed the computational effi-ciency of utilizing word lexicons in Chinese NER.To obtain a high-performing Chinese NER systemwith a fast inference speed, we proposed a novelmethod to incorporate the lexicon information intothe character representations. Experimental studieson four benchmark Chinese NER datasets revealthat our method can achieve a much faster inferencespeed and better performance than the comparedstate-of-the-art methods.

Acknowledgements

The authors wish to thank the anonymousreviewers for their helpful comments. Thiswork was partially funded by China NationalKey RD Program (No. 2018YFB1005104,2018YFC0831105, 2017YFB1002104), NationalNatural Science Foundation of China (No.61976056, 61532011, 61751201), ShanghaiMunicipal Science and Technology Major Project(No.2018SHZDZX01), Science and TechnologyCommission of Shanghai Municipality Grant(No.18DZ1201000, 16JC1420401, 17JC1420200).

ReferencesWanxiang Che, Mengqiu Wang, Christopher D Man-

ning, and Ting Liu. 2013. Named entity recognitionwith bilingual constraints. In NAACL, pages 52–62.

Aitao Chen, Fuchun Peng, Roy Shan, and GordonSun. 2006. Chinese named entity recognitionwith conditional probabilistic models. In SIGHANWorkshop on Chinese Language Processing.

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, andJun Zhao. 2015. Event extraction via dynamicmulti-pooling convolutional neural networks. InACL—IJCNLP, volume 1, pages 167–176.

Jason Chiu and Eric Nichols. 2016. Namedentity recognition with bidirectional lstm-cnns.Transactions of the Association of ComputationalLinguistics, 4(1):357–370.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-trainingof deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805.

Dennis Diefenbach, Vanessa Lopez, Kamal Singh, andPierre Maret. 2018. Core techniques of questionanswering systems over knowledge bases: a survey.KAIS, 55(3):529–569.

Ruixue Ding, Pengjun Xie, Xiaoyan Zhang, Wei Lu,Linlin Li, and Luo Si. 2019. A neural multi-digraph model for chinese ner with gazetteers. InProceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages1462–1467.

Chuanhai Dong, Jiajun Zhang, Chengqing Zong,Masanori Hattori, and Hui Di. 2016. Character-based lstm-crf with radical-level features for chinesenamed entity recognition. In Natural LanguageUnderstanding and Intelligent Applications, pages239–250. Springer.

G David Forney. 1973. The viterbi algorithm.Proceedings of the IEEE, 61(3):268–278.

Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-GangJiang, and Xuanjing Huang. 2019a. Cnn-basedchinese ner with lexicon rethinking. In Proceedingsof the 28th International Joint Conference onArtificial Intelligence, pages 4982–4988. AAAIPress.

Tao Gui, Yicheng Zou, Qi Zhang, Minlong Peng, JinlanFu, Zhongyu Wei, and Xuan-Jing Huang. 2019b.A lexicon-based graph neural network for chinesener. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP),pages 1039–1049.

Hangfeng He and Xu Sun. 2017a. F-score driven maxmargin neural network for named entity recognitionin chinese social media. In Proceedings of the15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume2, Short Papers, pages 713–718.

Hangfeng He and Xu Sun. 2017b. A unified modelfor cross-domain and semi-supervised named entityrecognition in chinese social media. In AAAI.

Jingzhou He and Houfeng Wang. 2008. Chinesenamed entity recognition and word segmentationbased on character. In Proceedings of theSixth SIGHAN Workshop on Chinese LanguageProcessing.

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-tional lstm-crf models for sequence tagging. arXivpreprint arXiv:1508.01991.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data.

5960

Gina-Anne Levow. 2006. The third internationalchinese language processing bakeoff: Word segmen-tation and named entity recognition. In SIGHANWorkshop on Chinese Language Processing, pages108–117.

Haibo Li, Masato Hagiwara, Qi Li, and Heng Ji. 2014.Comparison of the impact of word segmentation onname tagging for chinese and japanese. In LREC,pages 2532–2536.

Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, HuanGui, Jian Peng, and Jiawei Han. 2018. Empowersequence labeling with task-aware neural languagemodel. AAAI Conference on Artificial Intelligence.

Wei Liu, Tongge Xu, Qinghua Xu, Jiayu Song, andYueran Zu. 2019. An encoding strategy based word-character lstm for chinese ner. In Proceedings ofthe 2019 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 2379–2389.

Zhangxun Liu, Conghui Zhu, and Tiejun Zhao. 2010.Chinese named entity recognition with a sequencelabeling approach: based on characters, or basedon words? In Advanced intelligent computingtheories and applications. With aspects of artificialintelligence, pages 634–640. Springer.

Yanan Lu, Yue Zhang, and Dong-Hong Ji. 2016. Multi-prototype chinese character embedding. In LREC.

Nanyun Peng and Mark Dredze. 2015. Named entityrecognition for chinese social media with jointlytrained embeddings. In EMNLP.

Nanyun Peng and Mark Dredze. 2016. Improvingnamed entity recognition for chinese social mediawith word segmentation representation learning. InACL, page 149.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extractionwith matrix factorization and universal schemas. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 74–84.

Dianbo Sui, Yubo Chen, Kang Liu, Jun Zhao,and Shengping Liu. 2019. Leverage lexicalknowledge for chinese named entity recognition viacollaborative graph network. In Proceedings of the2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3821–3831.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention isall you need. In Advances in Neural InformationProcessing Systems, pages 5998–6008.

Mengqiu Wang, Wanxiang Che, and Christopher DManning. 2013. Effective bilingual constraints forsemi-supervised learning of named entity recogniz-ers. In AAAI.

Ralph Weischedel, Sameer Pradhan, Lance Ramshaw,Martha Palmer, Nianwen Xue, Mitchell Marcus,Ann Taylor, Craig Greenberg, Eduard Hovy, RobertBelvin, et al. 2011. Ontonotes release 4.0.LDC2011T03, Philadelphia, Penn.: Linguistic DataConsortium.

Jie Yang, Zhiyang Teng, Meishan Zhang, and YueZhang. 2016. Combining discrete and neuralfeatures for sequence labeling. In CICLing.Springer.

Suxiang Zhang, Ying Qin, Juan Wen, and XiaojieWang. 2006. Word segmentation and named entityrecognition for sighan bakeoff3. In SIGHANWorkshop on Chinese Language Processing, pages158–161.

Yue Zhang and Jie Yang. 2018. Chinese ner usinglattice lstm. Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (ACL), 1554-1564.

Hai Zhao and Chunyu Kit. 2008. Unsupervisedsegmentation helps supervised learning of charactertagging for word segmentation and named entityrecognition. In Proceedings of the Sixth SIGHANWorkshop on Chinese Language Processing.

Junsheng Zhou, Weiguang Qu, and Fen Zhang.2013. Chinese named entity recognition via jointidentification and categorization. Chinese journal ofelectronics, 22(2):225–230.

Simplify the Usage of Lexicon in Chinese NERNamed Entity Recognition (NER) is concerned with the...

Documents

Transcript of Simplify the Usage of Lexicon in Chinese NERNamed Entity Recognition (NER) is concerned with the...