AttentionMeSH: Simple, Effective and Interpretable ...

10
Proceedings of the 2018 EMNLP Workshop BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering, pages 47–56 Brussels, Belgium, November 1st, 2018. c 2018 Association for Computational Linguistics 47 AttentionMeSH: Simple, Effective and Interpretable Automatic MeSH Indexer Qiao Jin* University of Pittsburgh [email protected] Bhuwan Dhingra* Carnegie Mellon University [email protected] William W. CohenGoogle, Inc. [email protected] Xinghua Lu University of Pittsburgh [email protected] Abstract There are millions of articles in PubMed database. To facilitate information retrieval, curators in the National Library of Medicine (NLM) assign a set of Medical Subject Head- ings (MeSH) to each article. MeSH is a hierarchically-organized vocabulary, contain- ing about 28K different concepts, covering the fields from clinical medicine to informa- tion sciences. Several automatic MeSH index- ing models have been developed to improve the time-consuming and financially expensive manual annotation, including the NLM official tool – Medical Text Indexer, and the winner of BioASQ Task5a challenge – DeepMeSH. However, these models are complex and not interpretable. We propose a novel end-to-end model, AttentionMeSH, which utilizes deep learning and attention mechanism to index MeSH terms to biomedical text. The attention mechanism enables the model to associate tex- tual evidence with annotations, thus providing interpretability at the word level. The model also uses a novel masking mechanism to en- hance accuracy and speed. In the final week of BioASQ Chanllenge Task6a, we ranked 2nd by average MiF using an on-construction model. After the contest, we achieve close to state-of-the-art MiF performance of 0.684 using our final model. Human evaluations show AttentionMeSH also provides high level of interpretability, retrieving about 90% of all expert-labeled relevant words given an MeSH- article pair at 20 output. 1 Introduction MEDLINE is a database containing more than 24 million biomedical journal citations by 2018 1 . *These authors contribute equally to the paper. This work was done while the author was at CMU. 1 https://www.nlm.nih.gov/pubs/ factsheets/medline.html PubMed provides free access to MEDLINE for worldwide researchers. To facilitate information storage and retrieval, curators at the National Li- brary of Medicine (NLM) assign a set of Med- ical Subject Headings (MeSH) to each article. MeSH 2 is a hierarchically-organized terminology developed by NLM for indexing and cataloging biomedical texts like MEDLINE articles. MeSH has about 28 thousand terms by 2018 3 , covering the fields from clinical medicine to information sciences. Indexers examine the full article and annotate it with MeSH terms according to rules set by NLM 4 . Its estimated that indexing an ar- ticle costs $9.4 on average (Mork et al., 2013), and there are more than 813,500 citations added to MEDLINE in 2017 5 . Indexing all citations manu- ally would cost several million dollars in one year. Thus, several automatic annotation models have been developed to improve the time-consuming and financially expensive manual annotation. We will discuss these models in section 2.1. Automatical annotating PubMed abstracts with MeSH terms is hard in several aspects: There are 28 thousand possible classes and even more of their combinations. The frequencies of dif- ferent MeSH terms also vary a lot: The most frequent MeSH term is ‘Humans’ and it is an- notated to more than 8 million articles in the MEDLINE database; while the 20,000th frequent MeSH ‘Hypnosis, Anesthetic’ is indexed to only about 200 articles (Peng et al., 2016). It causes se- vere class imbalance problems. Above difficulties are further complicated by the fact that indexers at the NLM usually inspect the whole articles to 2 https://www.nlm.nih.gov/mesh 3 https://www.nlm.nih.gov/pubs/ factsheets/mesh.html 4 https://www.nlm.nih.gov/bsd/indexing/ training/TIP_010.html 5 https://www.nlm.nih.gov/bsd/bsd_key. html

Transcript of AttentionMeSH: Simple, Effective and Interpretable ...

Proceedings of the 2018 EMNLP Workshop BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering, pages 47–56Brussels, Belgium, November 1st, 2018. c©2018 Association for Computational Linguistics

47

AttentionMeSH: Simple, Effective and Interpretable Automatic MeSHIndexer

Qiao Jin*University of [email protected]

Bhuwan Dhingra*Carnegie Mellon [email protected]

William W. Cohen†Google, Inc.

[email protected]

Xinghua LuUniversity of [email protected]

Abstract

There are millions of articles in PubMeddatabase. To facilitate information retrieval,curators in the National Library of Medicine(NLM) assign a set of Medical Subject Head-ings (MeSH) to each article. MeSH is ahierarchically-organized vocabulary, contain-ing about 28K different concepts, coveringthe fields from clinical medicine to informa-tion sciences. Several automatic MeSH index-ing models have been developed to improvethe time-consuming and financially expensivemanual annotation, including the NLM officialtool – Medical Text Indexer, and the winnerof BioASQ Task5a challenge – DeepMeSH.However, these models are complex and notinterpretable. We propose a novel end-to-endmodel, AttentionMeSH, which utilizes deeplearning and attention mechanism to indexMeSH terms to biomedical text. The attentionmechanism enables the model to associate tex-tual evidence with annotations, thus providinginterpretability at the word level. The modelalso uses a novel masking mechanism to en-hance accuracy and speed. In the final weekof BioASQ Chanllenge Task6a, we ranked2nd by average MiF using an on-constructionmodel. After the contest, we achieve close tostate-of-the-art MiF performance of ∼ 0.684using our final model. Human evaluationsshow AttentionMeSH also provides high levelof interpretability, retrieving about 90% of allexpert-labeled relevant words given an MeSH-article pair at 20 output.

1 Introduction

MEDLINE is a database containing more than24 million biomedical journal citations by 20181.

*These authors contribute equally to the paper.†This work was done while the author was at CMU.

1https://www.nlm.nih.gov/pubs/factsheets/medline.html

PubMed provides free access to MEDLINE forworldwide researchers. To facilitate informationstorage and retrieval, curators at the National Li-brary of Medicine (NLM) assign a set of Med-ical Subject Headings (MeSH) to each article.MeSH2 is a hierarchically-organized terminologydeveloped by NLM for indexing and catalogingbiomedical texts like MEDLINE articles. MeSHhas about 28 thousand terms by 20183, coveringthe fields from clinical medicine to informationsciences. Indexers examine the full article andannotate it with MeSH terms according to rulesset by NLM4. Its estimated that indexing an ar-ticle costs $9.4 on average (Mork et al., 2013),and there are more than 813,500 citations added toMEDLINE in 20175. Indexing all citations manu-ally would cost several million dollars in one year.Thus, several automatic annotation models havebeen developed to improve the time-consumingand financially expensive manual annotation. Wewill discuss these models in section 2.1.

Automatical annotating PubMed abstracts withMeSH terms is hard in several aspects: Thereare 28 thousand possible classes and even moreof their combinations. The frequencies of dif-ferent MeSH terms also vary a lot: The mostfrequent MeSH term is ‘Humans’ and it is an-notated to more than 8 million articles in theMEDLINE database; while the 20,000th frequentMeSH ‘Hypnosis, Anesthetic’ is indexed to onlyabout 200 articles (Peng et al., 2016). It causes se-vere class imbalance problems. Above difficultiesare further complicated by the fact that indexersat the NLM usually inspect the whole articles to

2https://www.nlm.nih.gov/mesh3https://www.nlm.nih.gov/pubs/

factsheets/mesh.html4https://www.nlm.nih.gov/bsd/indexing/

training/TIP_010.html5https://www.nlm.nih.gov/bsd/bsd_key.

html

48

do the annotation, but the challenge only providesPubMed abstracts and titles, which might not beenough to find all MeSH terms. We will discuss itmore detailedly in section 5.

Deep learning is a subtype of machine learningthat arranges the computational models in multipleprocessing layers to learn the representations ofdata with multiple levels of abstractions as well asthe mapping from these features to the output (Le-Cun et al., 2015). Attention is a strategy for deeplearning models to learn both the mapping frominput to output and the relevance between inputparts and output parts (Bahdanau et al., 2014). Thelearnt relevance helps improve the mapping per-formance as well as provide interpretability. Wewill discuss relevant works of deep learning in au-tomatic annotations in section 2.2.

Here we propose a novel model, Attention-MeSH, which utilizes deep learning and attentionmechanism to index MeSH terms to biomedicaltexts and provides interpretation at the word level.Each abstract, together with title and journal name,is tokenized to words, then the model feeds wordvectors to a bidirectional gated recurrent unit (Bi-GRU) to derive word representations with con-textual information (Schuster and Paliwal, 1997;Cho et al., 2014). We narrow down the MeSHterm vocabulary for each abstract using a mask-ing mechanism. Then for each candidate MeSHterm, the model calculates the attention attributionover words. Next, each MeSH term gets a spe-cific document representations by MeSH-specificattention-weighted sum of the word vectors. Fi-nally, the model uses nonlinear layers to classifyeach MeSH term using the learnt MeSH-specificdocument representation.

We participated in BioASQ Challenge Task6Awhile developing the model. We achieveclose to state-of-the-art performance with an on-construction model in the final week of the contestand with our final model after the contest. Themodel also achieves high level of interpretabilityevaluated by human experts.

The main contributions of this work are summa-rized as follows:

1. To the best of our knowledge, Attention-MeSH is the first end-to-end deep learningmodel with soft-attention mechanism to in-dex MeSH terms in such a large scale (mil-lions of training data). With this relativelysimple model, we achieved close to state-

of-the-art performance without any sophisti-cated feature engineering or preprocessing.

2. We develop a novel masking mechanism,which is aimed to handle multi-class clas-sification problems with a large number ofclasses, like indexing MeSH. We also con-duct extensive experiments on how the mask-ing layer settings influence classification per-formance.

3. We believe AttentionMeSH is the first MeSHannotation model that is capable of providingtextual evidence and interpretations of its pre-dictions. We argue that interpretability mat-ters because humans are needed to completethe annotation task.

2 Related Work

2.1 Automatic MeSH IndexingNLM developed Medical Text Indexer (MTI), asoftware for providing human indexers with au-tomatic MeSH recommendations (Aronson et al.,2004). MTI takes as input a title and correspond-ing abstract to generate a set of recommendedMeSH terms. MTI has two steps: the first is togenerate MeSH candidates for recommendation,and the second is to filter and rank the candi-dates to give a final output. MTI uses MetaMapand nearest neighbor methods. MetaMap is an-other NLM-developed tool, which maps mentionsin biomedical texts to Unified Medical LanguageSystem concepts (Aronson, 2001).

BioASQ is an European Union-funded projectthat organizes tasks on biomedical semantic in-dexing and question answering (Tsatsaronis et al.,2015). In the task A of BioASQ, participantsare asked to annotate un-indexed PubMed arti-cles with MeSH terms using their models, be-fore they are annotated by the human indexers.The manual annotations are taken as ground truthto evaluate the participating models. DeepMeSH(Peng et al., 2016) is the winner of the latest chal-lenge, BioASQ task 5a, held in 2017. DeepMeSHalso uses a two-step strategy: the first step is togenerate MeSH candidates and predict the num-ber of output MeSH terms, and the second stepis to rank the candidates and take the highest-ranked predicted number of MeSH terms as out-put. DeepMeSH uses Term Frequency InverseDocument Frequency (TFIDF) and document tovector (D2V) schemes to represent each abstract

49

and generate MeSH candidates using binary clas-sifiers and k-nearest neighbor (KNN) methodsover using these features. TFIDF is a traditionalweighted bag of word sparse representation of thetext and D2V learns a deep semantic representa-tion of the text.

Because state-of-the-art models have less than0.7 Micro-F, automatic MeSH indexing systemscan just serve to assist human indexers. Since hu-man indexers usually add or delete MeSH termsbased on the recommendations, interpretability ofthe automatic annotations is very important forthem. In this paper we adopt a local explanationview of model interpretability (Lipton, 2016), andargue that a good system, in addition to being ac-curate, should also be able to tell which part ofthe input supports the indexed MeSH term. Thiswould allow human indexers to be more effectiveat annotating the article.

2.2 Deep Learning for Text Classification

Automatic indexing of MeSH terms to PubMedarticles is a multi-label text classification prob-lem. FastText (Joulin et al., 2016) is a simpleand effective method for classifying texts basedon n-gram embeddings. (Kim, 2014) used Con-volutional Neural Networks (CNNs) for sentence-level classification tasks with state-of-the-art per-formance on 4 out of 7 tasks they tried. Very deeparchitectures such as that of (Conneau et al., 2017)have also been proposed for text classification.Motivated by these works we use an RNN-basedmodel for classifying each MeSH term as being apositive label for a given article. We further use at-tention mechanism to boost performance and pro-vide word-level interpretability.

Recently, there has been work on automatic an-notation of International Classification of Diseasescodes from clinical texts. (Shi et al., 2017) usedcharacter-level and word-level Long Short-TermMemory netowrks to get the document represen-tations and (Mullenbach et al., 2018) used word-level 1-D CNN to get the document representa-tions. Both these works utilized a soft attentionstrategy where each class gets a specific documentrepresetation by weighted sum of the attentionover words or phrases. Mullenbach et al. (2018)also highlighted the need for interpretability whenannotating medical texts – in this work we applysimilar ideas to the domain of MeSH indexing.

3 Methods

The model architecture is visualized in Figure 1.Starting from an input abstract, title and journalname, words in the document are embedded andfed to BiGRU to derive context-aware represen-tations; KNN-derived articles from training cor-pus are identified and frequent MeSH terms inthem are included as candidate annotations for thedocument. MeSH terms are embedded, and onlythose candidates are further considered in atten-tion mechanism. We call it a masking mecha-nism. We apply an attention mechanism to as-sign attention weights to each word with respectto each candidate MeSH term, which leads to aMeSH-specific document representation. Finally,we use MeSH-specific document representationsas input to perform classifications. For each candi-date MeSH term of a document, the model outputsa probability. Binary cross-entropy loss is used fora gradient-based method to optimize the parame-ters. At inference time, the sigmoid outputs areconverted to binary variables by thresholding.

3.1 Document RepresentationFor each article to be indexed, we first tokenize thejournal name, title and abstract to words. In orderto use the pre-trained word embeddings6 providedby BioASQ organizer, we use the same tokenizeras they did. The pre-trained word embeddings aredenoted as E ∈ R|V|×de1 , where |V| is the vocab-ulary size and de1 is the embedding size.

We can represent each article by a sequence ofword embeddings corresponding to the tokenizedtext. The word embeddings are initialized by theBioASQ pre-trained word embeddings.

D =[w1 ... wL

]T ∈ RL×de1 ,

where L is the number of words in the journalname, title and abstract, and wi is a vector forword at position i.

For each document representation D, we feedthis sequence of word vectors to an BiGRU to de-rive a context-aware sequence of word vectors:

D = BiGRU (D) =[w1 ... wL

]T ∈ RL×2dh ,

where wi is the corresponding concatenated for-ward and backward hidden states of each word,and dh is the hidden size of BiGRU.

6http://participants-area.bioasq.org/tools/BioASQword2vec/

50

m1

m2

...

mN

mi1

...

miM

...

...

...

...

X

X

Lin

ear

yi1 2 [0, 1]

ˆyiM 2 [0, 1]...

...

Mask

.

Tokenize & Embed

Attention Matrix

yi1, ..., yiM 2 {0, 1}

Binary Encoding

Embed

Optimization

BiGRU

Article

wi1 wi2

gwi1 gwi2 gwiLi

wiLi

Pediatric DiabetesJournal Name:

i-thGradient

Based

Binary Cross-Entropy Loss

Training

LabelPrediction

Inference

Threshold-ing

Abstract:___________________________MeSH:___, ____,___, …

Abstract:___________________________MeSH:___, ____,___, …

Abstract:___________________________MeSH:___, ____,___, …

...

...

KNN

Frequent

MeSH

Figure 1: Model Architecture. BiGRU: Bi-directional recurrent gated unit. The example abstract is from (Karaaand Goldstein, 2015).

3.2 MeSH Representation and Masking

We learn the MeSH embedding matrix H ∈RN×de2 , where N is the number of all MeSHterms (28,340), and de2 is the embedding size. Foreach article, we consider only a subset of all 28kMeSH terms for two reasons: 1. For each MeSHterm, there are far more negative samples than thepositive ones. We achieve down-sampling of thenegative samples by considering only a subset ofall MeSH terms as candidate for each article, sothat the classifier only concentrate on choosing amost suitable MeSH among a set of plausible an-notations; 2. It’s more time efficient than trainingall the MeSH terms or training the MeSH classi-fiers one by one. We call it a masking layer.

We use KNN strategy to choose a specific sub-set of MeSH terms to train for each article:

Each abstract can be represented by IDF-weighted sum of word vectors:

d =

∑ni=1 IDF i ×wi∑n

i=1 IDF i∈ Rde1,

where wi is the corresponding word vector, andIDF i is the inverse document frequency of thisword.

We then calculate cosine similarity of represen-

tations between the abstracts:

Similarity(i, j) =dTi dj

||di|| × ||dj||For each article, we find itsK nearest neighbors

based on cosine similarity. And then we countthe MeSH term frequency in these neighbors. Themost frequent M MeSH terms are trained for eacharticle. We denote the masked MeSH embeddingas H′,

H′ =[m1 m2 ... mM

]∈ RM×de2 ,

where we make de2 = dh so that we could directlyget the dot product of each MeSH representationand word vector.

3.3 Attention MechanismAfter getting the document representation andmasked MeSH representations, we calculate thedot products between each context-aware wordvector and each MeSH embedding, which repre-sents the similarity within each pair:

S = H′ DT =[Dm1 ... DmM

]T∈ RM×L,

We then uses SoftMax function to normalize overthe word axis to get attention weights attributionfor each MeSH term:

SoftMax (Sim)

51

=[SoftMax(Dm1) ... SoftMax(DmM)

]T

=[α1 ... αM

]T ∈ [0, 1]M×L,

where αj ∈ [0, 1]L is the attention weights overwords for MeSH term j, and

∑Lk=1 αjk = 1.

3.4 ClassificationFor each MeSH term, we can have a MeSH-specific representation of document by sum ofword vectors weighted by attention weights:

Rj = αjD ∈ R2dh ,

where Rj is MeSH term j specific document rep-resentation. We apply a linear projection layer andsigmoid activatin function to each MeSH term, fi-nally getting the output probability:

yj = σ(RTj m

′j + bj) ∈ [0, 1],

where m′j and bj are learnable linear projectionparameters for MeSH term j. We model

P (MeSH j indexed | Journal, Title, Abstract) = yj .

3.5 TrainingAfter get the conditioned probability we model,we can calculate the binary cross-entropy loss foreach MeSH term:

Lj = −(yjlog(yj) + (1− yj)log(1− yj)),

where yj ∈ {0, 1} is the ground-truth label ofMeSH j. yj = 0 means MeSH j is not annotatedto the article by human indexers, while yj = 1means MeSH j is annotated. We can get the totalloss by summing them up:

L =1

M

M∑

j=1

Lj

The model is trained end-to-end from word andMeSH embedding to the final projection layer bya gradient-based optimization algorithm to mini-mize L.

3.6 InferenceAt inference time, we will predict the MeSH termswhose predicted probability is larger than a tunedthreshold:

(predict MeSH j) = 1(yj > pj),

where pj is the tuned threshold for MeSH term j.The thresholds are tuned to maximize MiF:

p1, ..., pN = argmaxp1,...,pN

MiF(Model, p1, ..., pN )

We tune p by the the micro-F optimization algo-rithm described in (Pillai et al., 2013), which theyproved to be able to achieve the global maximum.

4 Experiments

4.1 DatasetWe use the dataset provided by BioASQ7, whichcontains about 13.5 million manually annotatedPubMed articles. The dataset covers 28,340MeSH terms in total, and each article is annotated12.69 MeSH terms on average. We selected 3 mil-lion articles from 2012 to 2017 for training.

The results reported in this paper are derivedfrom two test sets: BioASQ Test Sets: During thechallenge, BioASQ provides a test set of severalthousands articles each week. Ours: we use 100thousand latest articles to test our model, and allother results are calculated by this dataset. Sinceour test set is very large, the results will be precise.

4.2 ConfigurationThe model is implemented using PyTorch (Paszkeet al., 2017). The parameter settings are shownin Table 1. We use Adam optimizer and batchsize of 32. We train 2 epochs of each model onthe 3M article training set, and apply hyperboliclearning rate decay and early stopping strategies(Yao et al., 2007). The training takes 4 days on2 GPUs (GeForce GTX TITAN X). Before tuningthe thresholds for all individual MeSH term, weuse a global threshold of 0.35 due to the highlyimbalanced dataset.

4.3 Evaluation MetricThe major metric for performance evaluation isMicro-F, which is a harmonic mean of micro-precision (MiP) and micro-recall (MiR) , and iscalculated as follows:

Micro-F =2 ·MiP ·MiRMiP + MiR

,

where

MiP =

∑Nai=1

∑Nj=1 yij · yij∑Na

i=1

∑Nj=1 yij

7http://participants-area.bioasq.org/general_information/Task6a/

52

Parameter Value(s)

|V| 1.7Mde1 256de2 512dh 256N 28,340L ≤512 (truncated if longer)Na BioASQ 5,833∼10,488Na Ours 100,000K 0.1k, 0.5k, 1k, 3MM 128, 256, 512, 1,024Learning Rate 0.002, 0.001, 0.0005BiGRU Layer(s) 1, 2, 3, 4

Table 1: Parameter Values. For hyperparameters, wehighlight the optimal ones among all tried values.

0 200 400 600 800 1000Mask Size

0.2

0.4

0.6

0.8

1.0

Mic

ro-R

ecal

l

0.1k nn0.5k nn1k nn3M nn

Figure 2: The micro-recall of MeSH terms versus dif-ferent mask sizes for different numbers of neighbor ar-ticles.

MiR =

∑Nai=1

∑Nj=1 yij · yij∑Na

i=1

∑Nj=1 yij

In these equations, i is indexed for articles and j isindexed for MeSH terms, so Na is the number ofarticles in the test set, and N is the number of allMeSH terms. yij and yij are both binary encodedvariables to denote whether MeSH term j is in ar-ticle i in ground-truth and prediction, respectively.

4.4 Evaluation of Masking LayerSelecting relevant MeSH terms from neighbor ar-ticles can be regarded as a weak classifier itself,and high-recall setting is favored in this step. Wemeasure the micro-recall for different maskinglayer settings, and the results are shown in Fig-ure 2. Basically, there are two hyperparameters forit: the number of neighbor articlesK and the num-ber of highest ranking MeSH terms selectedM . Anon-trivial baseline for K is 3M, i.e. the numberof all training articles. Under this circumstance,the ranked MeSH list is determined by global fre-quency, thus is non-specific to any article.

We choose the number of nearest articles K =

Mask Setting MiP MiR MiF

1,024 rd. 0.5891 0.0173 0.03371,024 freq. 0.6863 0.4257 0.5262

128 n.n. 0.6354 0.5880 0.6108256 n.n. 0.6690 0.5975 0.6312512 n.n. 0.6663 0.6116 0.63781,024 n.n. 0.6698 0.6262 0.6472

Table 2: Model Performance with Different Mask Set-tings. n.n.: MeSH mask selected from nearest neighborarticles (K = 1000); freq.: MeSH mask selected fromglobally frequent MeSH terms; rd.: MeSH mask ran-domly selected. All results are averaged over modelstrained by 3 random seeds.

1000 for it gives the highest recalls with the in-crease of mask size. In fact, micro-recall at M =1024 and K = 1000 is about 0.97, which almostguarantees that all true annotations are included ascandidate for a document. Before fine-tuning onother hyperparameters and the thresholds of mak-ing predictions, we first train the model with dif-ferent M , and report the results in Table 2.

4.5 Evaluation of PerformanceWhile we were developing the model, we partic-ipated in the BioASQ Task6a challenge. Duringthe challenge, there is a test set available eachweek. Each test set contains several thousands ofun-indexed PubMed citations. Each citation hasjournal name, title, abstract information. Partici-pants will run their models on the test set and up-load their predictions of MeSH annotations withina given time. The organizers will then evaluate ev-ery participants’ predictions and make the resultsavailable. The results of the whole Challenges areshowed in Figure 3. Furthermore, the results of thelast week of the Challenge are showed in Table 3.

Model Average MiF Maximum MiF

Access Inn MAIstro 0.2788 0.2788MeSHmallow 0.3161 0.3161UMass Amherst T2T 0.4988 0.4988iria 0.4992 0.5161MTI First Line Index 0.6332 0.6332DeepMeSH 0.6451 0.6637Default MTI 0.6474 0.6474AttentionMeSH 0.6635 0.6635xgx 0.6862 0.6880

Table 3: Model Performance of the Final BioASQ TestSet. The models are ranked top-down from the lowestaverage MiF to the highest one. Our on-constructionAttentionMeSH ranked second by average MiF.

It should be noted that the models we used in

53

B1W1

B1W2

B1W3

B1W4

B1W5

B2W1

B2W2

B2W3

B2W4

B2W5

B3W1

B3W2

B3W3

B3W4

B3W5

NOW

Test Set

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68A

vera

ge M

icro

-F

Best of OthersAttnMeSHDeepMeSHXGX

Figure 3: BioASQ Challenge Task6A results. BaWb:Test week b of batch a. From B1W1 to B3W5,we show the average MiF of different models: At-tentionMeSH, DeepMeSH, XGX, and the best per-formance of all other models. Results are retrievedfrom http://participants-area.bioasq.org/results/6a/ on June 10th, 2018. And NOWshows the most up-to-date results from our test set.

BioASQ are not our final model. We include theup-to-date results in ‘NOW’ in Figure 3 and reportthe ablation test results in Table 4.

Model MiP MiR MiF

AttentionMeSH (AM) 0.6698 0.6262 0.6472

AM w/o BiGRU 0.6362 0.5848 0.6093AM w/o learning w.e. 0.6657 0.6106 0.6369AM w/o attention 0.6807 0.5519 0.6095

AM w/ t.t. 0.7048 0.6393 0.6704AM w/ ensemble & t.t. 0.7172 0.6543 0.6844

Table 4: Model Performance with Ablations and FinerTuning. w/o: without; w/: with; t.t.: MeSH termthreshold tuning; w.e.: word embeddings. Ensemblingtakes the average prediction of 8 models trained by dif-ferent seeds. All results, except the ensemble one, areaveraged over models trained by 3 random seeds.

4.6 Evaluation of Interpretability

At inference time, attention matrix provides word-level interpretation: For each MeSH prediction,the model shows which words are given high at-tention. It helps the indexers to evaluate and proof-read the indexing results of our model. Figure 4shows an example of attention for interpretation.

To qualitatively evaluate the interpretability ofdifferent models, the best way would be to mea-sure the time efficiency of manual indexing withthe assistance of different models. However, thismight require well-trained NLM indexers to eval-uate. Instead, we asked two independent re-searchers with Ph.D. degrees in related fields to

label relevant words for 100 MeSH-article pairs.Their intersected labels are regarded as ground-truth. We model the interpretability evaluation asan information retrieval task, and evaluate eachmethod’s recall at different numbers of outputs inTable 5. Since other models like DeepMeSH andMTI don’t report how to interpret their model out-puts, we use string-matching as a non-trivial base-line.

Model R@5 R@10 R@20

String-Matching 0.3890 0.4180 0.4336

AM Embeddings 0.6180† 0.7486† 0.8088†

AM Whole Model 0.6929†‡ 0.8389†‡ 0.8993†‡

Table 5: Interpretability Evaluation. R@n: The aver-age recall of ground-truth relevant words if the modeloutputs n words. AM Whole Model: The whole modelof AttentionMeSH is used to get the attention matrix,and n words with highest attention weights will be theoutput; AM Embeddings: We only use the trained wordand MeSH embeddings of AttentionMeSH model, andwe output n words that have highest dot products witheach specific MeSH. String-Matching: A string match-ing method that takes all words in the abstracts that aresame to any word in the MeSH name. †: Significantdifferences with String-Matching; ‡: Significant differ-ences with AM Embeddings. Significance is definedby p < 0.05 in paired t tests.

5 Discussion

One intrinsic limitation of all present automaticMeSH indexing models, including us, is that thesemodels just annotate MeSH terms from the ab-stract, title, journal name etc, but they don’t lookinto the article bodies. However, the human index-ers in NLM do need to look into the bodies to an-notate each article, and thus the textual evidencefor certain annotations is missed during training.As such, all present models won’t have enough in-formation to do the annotation, and certain per-cent of false negatives is inevitable, and the per-formance is upbounded by them. For example,MeSH terms ‘Humans’, ‘Males’, ‘Females’ areannotated to our demo article in Figure 4. How-ever, the abstract doesn’t contain any relevant in-formation. 35 articles in our 100 MeSH-articlepairs evaluated by experts don’t have any wordsrelevant to the MeSH term.

We noted that AttentionMeSH predicted manyMeSH terms to documents that were not annotatedby NLM indexers, which appears to be ”false pos-

54

PLoS OneAssociation of SNP rs80659072 in the ZRS with polydactyly in Beijing You chickens

AbstractThe Beijing You chicken is a Chinese native breed with superior meat quality and a unique appearance. The G/T mutation of SNP rs80659072in the Shh long-range regulator of GGA2 is highly associated with the polydactyly phenotype in some chicken breeds. In the present study, thisSNP was genotyped using the TaqMan detection method, and its association with the number of toes was analyzed in a flock of 158 birds of theBeijing You population maintained at the Beijing Academy of Agriculture and Forestry Sciences. Furthermore, the skeletal structure of thedigits was dissected and assembled in 113 birds. The findings revealed that the toes of Beijing You chickens were rich and more complex thanexpected. The plausible mutation rs80659072 in the zone of polarizing activity regulatory sequence (ZRS) in chickens was an essential but notsufficient condition for polydactyly and polyphalangy in Beijing You chickens. Several individuals shared the T allele but showed normal four-digit conformations. However, breeding trials demonstrated that the T allele could serve as a strong genetic marker for five-toe selection inBeijing You chickens.

True Positive MeSH: Toes• …with the number of toes was analyzed in a…• …findings revealed that the toes of Beijing You chickens…• …skeletal structure of the digits was dissected and assembled…

True Positive MeSH: Polymorphism, Single Nucleotide• Association of SNP rs80659072 in the ZRS…• …The G/T mutation of SNP rs80659072 in the Shh…• …this SNP was genotyped using the TaqMan…

False Positive MeSH: China• …ZRS with polydactyly in BeijingYou chickens…• …BeijingYou chicken is a Chinese native breed with…

False Negative MeSH: Meat• …native breed with superior meat quality and…• …The Beijing You chicken is a Chinese native breed…• …in Beijing You chickens

Figure 4: Attention Display. In a randomly-selected test article (Chu et al., 2017), we show the 3 words that aregiven highest attention weights for 4 MeSH terms, including two true positive, one false positive and one falsenegative predictions.

itives”. However, after manual inspection, we no-ticed that many of our predictions are semanticallysensible. For example, both the articles in Figure 4and Figure 5 discuss genotype-phenotype relation-ship in Beijing You chickens. However, MeSHterm China is annotated to the article in Figure 5,but not the one in Figure 4. We conjecture that thismay be due to inconsistency among indexers andthat automatic indexing may assign more semanti-cally sensible annotations to enhance the coverageof concepts in a document.

In consideration of the limitations and problemsmentioned above, some false positive and falsenegative MeSH terms are unavoidable. We arguethat human experts’ performance on test datasetbased on the same input as given in BioASQ isneeded to provide better evaluation and compari-son of performance of current methods.

Concerning how the explanations will help, wejust perform a preliminary study by human evalua-tors, where we model the interpretability as an in-formation retrieval (IR) task. However, the poten-tial users don’t regard the annotation task as an IRtask. Thus, it would be more convincing to recruitsome indexers at NLM and conduct a user study,measuring the annotation efficiency and accuracywith and without the help of AttentionMeSH.

Animal BiotechnologyThe effect of a mutation in the 3-UTR region of the HMGCR gene on cholesterol in Beijing-you chickens.

AbstractThe 3-hydroxyl-3-methylglutaryl Coenzyme A reductase(HMGCR) gene was examined for polymorphisms in Beijing-you chickens. A "T" base insert was detected at nucleotide 2749of the 3-UTR region of the HMGCR gene and was used as thebasis for distinguishing a B allele, distinct from the A. Serum andmuscle contents of total cholesterol. LDL-cholesterol in serumwas significantly lower in AB birds and lowest in BB birds. Real-time PCR showed that the same trends across genotypes occurredin an abundance of HMGCR transcripts in liver, but there was nodifference in contents of HMGCR mRNA in breast or thighmuscles. Hepatic expression and serum LDL-cholesterol weremeaningfully correlated (partial, with total serum cholesterol heldconstant, r = 0.923). In muscle, similar genotypic differenceswere found for the abundance of the LDL receptor (LDLR)transcript. Cholesterol content in breast muscle related to LDLRexpression (partial correlation with serum LDL-cholesterol heldconstant, r = 0.719); the equivalent partial correlation in thighmuscle was not significant. The results indicated that the B allelesignificantly reduces hepatic abundance of HMGCR transcripts,probably accounting for genotypic differences in serumcholesterol. In muscle, the cholesterol content appeared to reflectdifferences in LDLR expression with apparent mechanisticdifferences between breast and thigh.

Figure 5: A Contradictorily Indexed Article (Cui et al.,2010). MeSH term China is annotated to this article,while not to a similar one at Figure 4.

55

6 Conclusions

We present AttentionMeSH, an automatic MeSHindexer, which is simple and interpretable. Italso achieves comparable performance to the cur-rent state-of-the-art. Since even the state-of-the-art model has only about 0.69 by MiF metric,manual annotations are still required. Thus, in-terpretability of the models is vital. We evaluatethe interpretability of AttentionMeSH by retriev-ing capability of experts-labeled relevant words.Our model achieves high performance by this task.To the best of our knowledge, AttentionMeSH isthe only interpretable model for indexing MeSHwhich has close to state-of-the-art performance.

7 Acknowledgement

This work has been supported by NIH grant No.5R01LM012011. We thank Dr. Chunhui Cai andDr. Lujia Chen for their independent evaluationsof model interpretability. We are also grateful forthe anonymous reviewers who gave us very in-sightful suggestions. The content is solely the re-sponsibility of the authors and does not necessarilyrepresent the official views of the National Insti-tutes of Health.

ReferencesAlan R Aronson. 2001. Effective mapping of biomed-

ical text to the umls metathesaurus: the metamapprogram. In Proceedings of the AMIA Symposium,page 17. American Medical Informatics Associa-tion.

Alan R Aronson, James G Mork, Clifford W Gay, Su-sanne M Humphrey, Willie J Rogers, et al. 2004.The nlm indexing initiative’s medical text indexer.Medinfo, 89.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

Qin Chu, Zhixun Yan, Jian Zhang, Tahir Usman, YaoZhang, Hui Liu, Haihong Wang, Ailian Geng, andHuagui Liu. 2017. Association of snp rs80659072in the zrs with polydactyly in beijing you chickens.PloS one, 12(10):e0185953.

Alexis Conneau, Holger Schwenk, Loıc Barrault, andYann Lecun. 2017. Very deep convolutional net-works for text classification. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, volume 1, pages 1107–1116.

HX Cui, SY Yang, HY Wang, JP Zhao, RR Jiang,GP Zhao, JL Chen, MQ Zheng, XH Li, and J Wen.2010. The effect of a mutation in the 3-utr region ofthe hmgcr gene on cholesterol in beijing-you chick-ens. Animal biotechnology, 21(4):241–251.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficient textclassification. arXiv preprint arXiv:1607.01759.

Amel Karaa and Amy Goldstein. 2015. The spectrumof clinical presentation, diagnosis, and managementof mitochondrial forms of diabetes. Pediatric dia-betes, 16(1):1–9.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. nature, 521(7553):436.

Zachary C Lipton. 2016. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490.

James G Mork, Antonio Jimeno-Yepes, and Alan RAronson. 2013. The nlm medical text indexer sys-tem for indexing biomedical literature. In BioASQ@CLEF.

James Mullenbach, Sarah Wiegreffe, Jon Duke, JimengSun, and Jacob Eisenstein. 2018. Explainable pre-diction of medical codes from clinical text. arXivpreprint arXiv:1802.05695.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.

Shengwen Peng, Ronghui You, Hongning Wang,Chengxiang Zhai, Hiroshi Mamitsuka, and Shan-feng Zhu. 2016. Deepmesh: deep semantic repre-sentation for improving large-scale mesh indexing.Bioinformatics, 32(12):i70–i79.

Ignazio Pillai, Giorgio Fumera, and Fabio Roli. 2013.Threshold optimisation for multi-label classifiers.Pattern Recognition, 46(7):2055–2065.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing, 45(11):2673–2681.

Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang,and Eric P Xing. 2017. Towards automatedicd coding using deep learning. arXiv preprintarXiv:1711.04075.

56

George Tsatsaronis, Georgios Balikas, ProdromosMalakasiotis, Ioannis Partalas, Matthias Zschunke,Michael R Alvers, Dirk Weissenborn, AnastasiaKrithara, Sergios Petridis, Dimitris Polychronopou-los, et al. 2015. An overview of the bioasqlarge-scale biomedical semantic indexing and ques-tion answering competition. BMC bioinformatics,16(1):138.

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto.2007. On early stopping in gradient descent learn-ing. Constructive Approximation, 26(2):289–315.