Comparison of Machine Learning Methods for Multi-label ...

9
Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 85–93 November 19, 2020. c 2020 Association for Computational Linguistics 85 Comparison of Machine Learning Methods for Multi-label Classification of Nursing Education and Licensure Exam Questions John T. Langton Krishna Srihasam Wolters Kluwer Health, 230 3rd Avenue, Waltham, MA 02451 {john.langto, krishna.srihasam, junlin.jiang} @wolterskluwer.com Junlin Jiang Abstract In this paper, we evaluate several machine learning methods for multi-label classification of text questions. Every nursing student in the United States must pass the National Coun- cil Licensure Examination (NCLEX) to be- gin professional practice. NCLEX defines a number of competencies on which students are evaluated. By labeling test questions with NCLEX competencies, we can score students according to their performance in each com- petency. This information helps instructors measure how prepared students are for the NCLEX, as well as which competencies they may need help with. A key challenge is that questions may be related to more than one competency. Labeling questions with NCLEX competencies, therefore, equates to a multi- label, text classification problem where each competency is a label. Here we present an eval- uation of several methods to support this use case along with a proposed approach. While our work is grounded in the nursing educa- tion domain, the methods described here can be used for any multi-label, text classification use case. 1 Introduction All nurses within the United States must pass the National Council Licensure Examination (NCLEX ® ) to begin professional practice. A nurs- ing curriculum will typically cover a wide range of topics related to the theory and practice of nursing. However, the NCLEX measures students against a specific set of competencies comprising the ac- tivities that entry-level nurses are most commonly expected to perform. These activities are identified by the National Council of State Boards of Nursing (NCSBN) through analysis of nursing practice. Figure 1 shows a subset of NCLEX competen- cies called ”activity statements” with descriptions. Activity statements are grouped into primary topics and sub-topics, as shown in the image. Nursing education content may be related to one or more competency; they are not mutually exclusive. Figure 1: Sample of NCLEX competencies or activity statements. Passage of the NCLEX has significance not only for students but also for learning institutions. Nurs- ing school accreditation is partially based on how well their student body performs on the NCLEX. If their performance drops below a certain threshold for too many consecutive years, the school risks losing its accreditation. It is, therefore, paramount for instructors to gauge student preparedness for the NCLEX and course correct where necessary. One way to do this is by repeatedly testing students with simulated exams. This approach may reveal that gaps exist, however, it does not necessarily identify what competencies are deficient or what content may address those deficiencies. By label- ing both questions and educational content with the competencies that they relate to, instructors can

Transcript of Comparison of Machine Learning Methods for Multi-label ...

Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 85–93November 19, 2020. c©2020 Association for Computational Linguistics

85

Comparison of Machine Learning Methods for Multi-label Classificationof Nursing Education and Licensure Exam Questions

John T. Langton Krishna SrihasamWolters Kluwer Health, 230 3rd Avenue, Waltham, MA 02451

{john.langto, krishna.srihasam, junlin.jiang} @wolterskluwer.com

Junlin Jiang

Abstract

In this paper, we evaluate several machinelearning methods for multi-label classificationof text questions. Every nursing student in theUnited States must pass the National Coun-cil Licensure Examination (NCLEX) to be-gin professional practice. NCLEX defines anumber of competencies on which studentsare evaluated. By labeling test questions withNCLEX competencies, we can score studentsaccording to their performance in each com-petency. This information helps instructorsmeasure how prepared students are for theNCLEX, as well as which competencies theymay need help with. A key challenge is thatquestions may be related to more than onecompetency. Labeling questions with NCLEXcompetencies, therefore, equates to a multi-label, text classification problem where eachcompetency is a label. Here we present an eval-uation of several methods to support this usecase along with a proposed approach. Whileour work is grounded in the nursing educa-tion domain, the methods described here canbe used for any multi-label, text classificationuse case.

1 Introduction

All nurses within the United States must passthe National Council Licensure Examination(NCLEX®) to begin professional practice. A nurs-ing curriculum will typically cover a wide range oftopics related to the theory and practice of nursing.However, the NCLEX measures students againsta specific set of competencies comprising the ac-tivities that entry-level nurses are most commonlyexpected to perform. These activities are identifiedby the National Council of State Boards of Nursing(NCSBN) through analysis of nursing practice.

Figure 1 shows a subset of NCLEX competen-cies called ”activity statements” with descriptions.Activity statements are grouped into primary topics

and sub-topics, as shown in the image. Nursingeducation content may be related to one or morecompetency; they are not mutually exclusive.

Figure 1: Sample of NCLEX competencies or activitystatements.

Passage of the NCLEX has significance not onlyfor students but also for learning institutions. Nurs-ing school accreditation is partially based on howwell their student body performs on the NCLEX. Iftheir performance drops below a certain thresholdfor too many consecutive years, the school riskslosing its accreditation. It is, therefore, paramountfor instructors to gauge student preparedness forthe NCLEX and course correct where necessary.One way to do this is by repeatedly testing studentswith simulated exams. This approach may revealthat gaps exist, however, it does not necessarilyidentify what competencies are deficient or whatcontent may address those deficiencies. By label-ing both questions and educational content withthe competencies that they relate to, instructors can

86

more precisely identify where students are strug-gling and what content may help with remediation.This approach enables coursework to be tailoredfor individual students based on their performancein a manner that maximizes likelihood of passingthe NCLEX. For instance, if a student incorrectlyanswers questions related to ”Provide pulmonaryhygiene” (activity statement shown in Figure 1),the instructor may assign the student additionalcontent (e.g., simulations and practice problems)related to that competency. This general approachis called formative testing.

PrepU is a Wolters Kluwer product for nursingeducation that features several types of contentincluding books, simulations, videos, audio, andquizzes. To support formative testing in PrepU,quiz questions are tagged with the NCLEX compe-tency related to them. When students take quizzes,their scores can be aggregated according to NCLEXcompetency. Figure 2 shows an example of theinterface displaying this information. The imageshows the student achieves a score of 66.7% forthe competency ”Prioritize the delivery of clientcare”. This is calculated based on the student an-swering 4 out of 6 questions correctly that werelabeled with that competency. There is also a tabthat shows class performance so that instructors cansee if there is a pattern of multiple students strug-gling with a particular competency. Instructors canuse this information to make changes to the curricu-lum to address problem areas. Corrective actionsmay include assigning students additional contentor practice materials related to the competenciesthey are struggling with.

Figure 2: PrepU screenshot showing quiz results bro-ken down according to NCLEX competencies.

Figure 3: Editorial platform where editors manually tagquestions with associated NCLEX competencies.

To aggregate scores according to NCLEX com-petencies as shown in Figure 2, each question needsto be labeled according to which competencies it re-lates to. Prior to our work, editors would manuallylabel questions using the editorial platform shownin Figure 3. A drop-down menu shows a selectionof NCLEX competencies. The editor must scrollthrough this list, identify which are appropriate,and select them to add them to the question. Thisprocess was costly and time-consuming. One chal-lenge is that each question can belong to more thanone competency. Further, different editors mayhave differing opinions as to which competenciesa question relates to. Reconciling these differencesand maintaining consistency across editors and con-tent is a huge challenge.

To streamline the labeling of nursing educationquestions, we integrated a machine learning modelfor automated tagging into the current workflow.As editors review each question, the model makessuggestions about which NCLEX competenciesare related to that question. Rather than scrollingthrough a long list of options, editors can rapidlyclick to accept or reject suggestions (though theystill have the ability to scroll through all possibili-ties if they believe none of the suggestions are appli-cable). This approach has greatly streamlined theprocess of labeling questions and added additionalconsistency in the application of labels. The follow-ing sections detail the data involved, the modelingtechniques evaluated, and the chosen solution.

87

2 The Data

In this paper, we focus on NCLEX competenciesrelated to what are called ”activity statements”.Activity statements are presented in a hierarchi-cal structure with two levels as shown in Figure1. We consider only leaf nodes to simplify theproblem. Given this consideration, there are 138activity statements or labels in total.

41125 questions were manually labeled with oneor more of the 138 possible activity statements re-lated to them. This data was used for both trainingand testing of our machine learning models. Thedistribution of questions across activity statementswas non-uniform and presented a class imbalancechallenge. The majority of activity statements wereassigned to 5 or fewer questions. However, therewas a small set of activity statements that werecommonly used, and two that were associated withnearly 3000 questions. The distribution of ques-tions to activity statements is shown in Figure 4.Each bar corresponds to one of the 138 activitystatements and its height represents the number ofquestions assigned to it.

Less than 100 questions that were assigned morethan one activity statement label. However, therewas a desire to accommodate multiple activitystatements per question for future labeling efforts.Therefore, we maintained an approach using multi-label classification techniques.

3 Related Research

The task of tagging questions with relevant activitystatements can be considered a multi-label docu-ment classification task where each question is adocument. There are several well-known meth-ods for this type of task. Many of them representa document as a vector of numbers. We can usesimilarity and/or distance metrics between docu-ment vectors to perform several operations suchas clustering and classification. A key set of de-cisions is how to represent documents as vectors,and what distance metrics to use for comparingthem. The following sections describe a number ofapproaches for document vectorization as well asmethods for multi-label classification.

3.1 Text Vectorization and Classification

Bag of words approaches for document vectoriza-tion are quite common and have been used witha number of different algorithms (Mccallum and

Nigam, 2001). These approaches use word fre-quency to determine vector representations for doc-uments and may employ a number of feature selec-tion and normalization techniques (Xu et al., 2009).One dominant technique is called Term Frequency– Inverse Document Frequency (TF-IDF).

While bag of words methods have proven quiteeffective, they suffer a number of weaknesses.When paired with algorithms such as naıve Bayesclassifiers, there is no consideration of word order,proximity, or co-occurrence within a document.This can be somewhat mitigated using n-gram tech-niques (i.e. considering n consecutive words asone element in document vectors). Synonyms canalso confound bag of word approaches since twoor more words may appear as unique elements in adocument vector despite being semantically equiv-alent. For instance, ”water” and ”H2O” may showup as distinct vocabulary terms in a TF-IDF vec-tor. When computing the cosine similarity betweenthe vector for a document that discusses ”water”and one that discusses ”H2O”, the result wouldinaccurately indicate they were dissimilar.

Word embeddings using neural networks area more recent and popular method for vectoriz-ing text (Kim, 2014). Long Short Term Memory(LSTM) and Gated Recurrent Unit (GRU) are re-current neural network (RNN) models that lever-age connections between adjacent nodes in a sin-gle layer to better address word order and context.Huang, Xu, and Yu (Huang et al., 2015) compareseveral ensembles of bidirectional LSTMs and Con-ditional Random Fields (CRF) for sentence classi-fication. Neural network models have specificallybeen used for multi-label document classification(Baumel et al., 2017) (Lenc and Kral, 2017).

One of the most recent advances in natural lan-guage processing with neural networks is the use ofpretrained, deep transformer models such as BERT(Devlin et al., 2018). BERT has outperformedmany competing methods in standard language un-derstanding tasks and has been used specificallyfor document classification (Adhikari et al., 2019).There is a great deal of research combining thesedifferent approaches for multiple use cases.

3.2 Multi-label Classification

Multi-label classification refers to a classificationproblem where each item being classified can be-long to more than one class (or label) at the sametime. This contrasts with standard classification

88

Figure 4: Sample distributions of number of questions per NCLEX activity statement.

where each item is assigned to only one class. Atrivial example would be classifying geometricshapes where a square could be both a square anda rectangle.

There are a few standard techniques for dealingwith multi-label classification. Many transform theproblem into a standard classification task. Oneapproach is to train a binary classifier for every la-bel independently. Each classifier is then executedon the same input to predict whether its associ-ated label should be applied (Read et al., 2015). Inthis scenario, each classifier only has the knowl-edge of one label and only makes predictions formembership or non-membership in that label groupor class. This strategy is similar to “one-versus-rest” approaches, however, it often employs tech-niques more analogous to one class classificationor anomaly detection. An extension to this methodis to chain multiple binary classifiers together ina sequence. The predictions from one classifier ispassed as a feature to the next classifier until a finalset of predictions is output. Probabilistic methodscan be used to optimize the order of classifiers.

Another common approach for multi-label clas-sification is to take the power set of label per-mutations and treat each as an independent class.This approach transforms the problem into astandard multi-class classification task. New-ton et. al. compare a number of methods forsuch problem transformations (SpolaoR et al.,2013). For instance, we can transform a set of3 labels, (A,B,C), into a power set of classes:{(A), (B)(C), (A,B), (A,C), (B,C)}.

4 Evaluating Vectorization Methods andSimilarity Metrics for Clustering andClassification

We began our analysis by evaluating how differ-ent vectorization techniques and similarity metrics

perform at differentiating questions related to onelabel (i.e., activity statement) from another. Theability to differentiate questions in this manner di-rectly affects the performance of classification andclustering algorithms. The results helped establisha baseline of how much overlap there was betweenquestions in different label groups. It also informeddecisions on which vectorization methods and sim-ilarity metrics to use with what algorithms for eval-uation.

To perform this analysis, we leveraged tech-niques often used in clustering. The nursing ed-ucation questions were grouped into clusters basedon the activity statements they were associated with.This resulted in 138 clusters, one for each of the ac-tivity statements. Questions associated with morethan one activity statement were included in thegroups for each. We experimented with severalvectorization methods (techniques for transformingthe questions into numeric vectors) as well as simi-larity metrics for comparing vectors. We convergedon using cosine similarity to compare vectors be-cause of its ability to deal with both sparse anddense vectors when normalized. The vectorizationsevaluated included the following:

• term frequency – inverse document frequency(TF-IDF)

• word embeddings pretrained on google news[(Mikolov et al., 2013)]

• word embeddings pretrained on PubMed[(Pyysalo et al., 2013)]

• word embeddings pretrained on PubMed andupdated on text content from Wolters Kluwernursing education

For each vectorization method, we computed thesilhouette score across our manually constructed

89

clusters. The silhouette score measures the similar-ity of questions within a cluster (cohesion) versusthe dissimilarity of questions in one cluster as com-pared to those in other clusters (separation). Highersilhouette scores indicate better cohesion withinclusters and separation between clusters. In clus-tering, this measure can help inform the numberof clusters to use. For our analysis, we were moreinterested in what vectorization methods achievedbetter separation of questions assigned to differ-ent NCLEX labels. The vectorizations achievingthe best silhouette scores could be expected to per-form better in classification tasks. We thereforecontrolled the number of clusters to the number ofactivity statements, i.e. 138.

Table 1 shows the different vectorization meth-ods evaluated along with their respective silhouettescores. We also include metrics based on the co-hesion component of the silhouette score. Specifi-cally, the binary relevance scores measure the dis-tance between question vectors that are all taggedwith the same label. The table reports the mean,minimum, maximum, and standard deviation of bi-nary relevance scores across all 138 label clusters.

Figure 5 shows the silhouette scores for everypair of activity statement clusters using TF-IDFvectorization of questions. TF-IDF resulted inthe highest silhouette score of -.02. However, thescores for all vectorization methods were relativelylow. This result indicated two things 1) there isa great degree of similarity between questions as-signed to different activity statement labels, and2) no vectorization method performed much betterthan the others. This result indicated that algo-rithms may need further grouping and sampling ofquestions to better differentiate them during classi-fication.

We hypothesized that ignoring the current labelsand clustering questions may achieve better separa-tion for classification algorithms. To evaluate thishypothesis, we performed a standard clustering ofquestions using the various vectorization methods.Normally we would optimize the number of clus-ters based on the silhouette score or other relatedmetrics. However, in the interest of time, we useda fixed number of 512 clusters. This number wasestimated from the number of questions and theirdistribution across activity statements. The result-ing silhouette scores improved by .02 on averagebut did not reflect a significant change. On average,each cluster contained questions from five different

Figure 5: Distribution of pairwise silhouette scoresacross activity statements.

activity statement label groups. This result moti-vated some of the modeling experiments describedin the following section.

5 Modeling

We use micro-averaged area under the receiver op-erating characteristic (AUC-ROC) to compare mul-tiple algorithms for the described use case. Thismetric is able to address class imbalance and isused canonically for benchmarking models (Haru-tyunyan et al., 2019). Note that the metrics reportedhere only reference the AUC score of first label pre-dicted. In production, the top five labels with thehighest confidence values are shown to users andresults in an accuracy of 95% in predicting all rel-evant labels. This is discussed further in Section5.6. Nonetheless, the initial AUC score of the firstprediction was a good benchmark for comparingmodels. The following sections provide details oneach algorithm evaluated.

5.1 One versus Rest Support VectorMachines (SVM) with TF-IDFVectorization

The first model employed a variant of TF-IDF vec-torization referred to as Term Frequency – InverseLabel Frequency (TF-ILF). The primary differenceis in what is regarded as a “document”. Insteadof individual questions being treated as a docu-ment, questions are grouped according to their la-bels and then the group is treated as a document.

90

VectorizationMethod

SilhouetteScore

MeanBinaryRelevance

Max BinaryRelevance

Min BinaryRelevance

Std Dev BinaryRelevance

TF-IDF -0.02 0.0 0.01 -0.003 0.0001Google -0.21 -0.02 0.09 -0.14 0.04PMC -0.22 -0.02 0.17 -0,13 0.05PrepU -0.91 -0.07 0.3 -0.27 0.08

Table 1: Silhouette scores for different vectorization Methods

This document specification results in a slight dif-ference in how document vectors are normalized.The vocabulary for the vectorization included theuse of bi-grams and tri-grams (i.e., 2 and 3 wordsequences). After eliminating stop words (e.g., “a”,“and”, “the”), stemming, and performing synonymreplacement, the total vocabulary was constrainedto 30,000 features. Specifically, we kept only the30,000 features with the highest TF-ILF values.

All questions were first vectorized. Each vec-tor was then labeled according to the manually as-signed labels using the LabelEncoder python classfrom the SciKit Learn package (Pedregosa et al.,2011). To address the multi-label issue (each ques-tion could have more than one activity statementlabel), we employed a one-versus-rest approachusing support vector machines. A binary classi-fier was trained for each activity statement labelwith SciKit Learn’s LinearSVC algorithm. Thisapproach resulted in 138 models. The hyperpa-rameters for the algorithm included an L2 penaltyand ran with 1000 maximum iterations. Modelswere evaluated using cross-validation with the Cal-ibratedClassifierCV python class. Cross-validationprovides a more robust evaluation and can revealvariability between multiple executions of the algo-rithm.

Given a question as input, each classifier wouldpredict whether a single activity statement labelshould be assigned to the question, without re-gard to any other labels. We computed confidencethresholds for each classifier as to whether to ac-cept its prediction or not. Thresholds were estab-lished using evaluation metrics such as recall andprecision. Questions were then fed into each classi-fier and any label predictions that met the requiredthresholds were assigned. In this manner, questionscould be assigned more than one label providedmore than one model prediction met the requiredthreshold.

The micro-averaged AUC-ROC of the SVM

model was .968.

5.2 Convolutional Neural NetworkA convolutional neural network (CNN) modelwas trained using a Keras tokenizer and wordembeddings pretrained on articles from PubMed[(Pyysalo et al., 2013)]. LabelEncoder was usedagain to encode question labels. The model used asoftmax activation function and categorical cross-entropy for the loss function. The output of themodel was structured as per-label probabilities be-tween 0 and 1. The softmax output enabled morethan one label to have a non-zero probability for agiven question input therefore addressed the multi-label problem. Details of the network architectureare shown in Figure 6.

The micro-averaged AUC-ROC of the CNNmodel was .972.

5.3 Bidirectional LSTMA bidirectional LSTM was trained using many ofthe same parameters as the CNN including a soft-max activation function, categorical cross-entropyloss function, and word embeddings pretrained onPubMed. The important difference in this neuralnetwork is that layer nodes were connected in asequential manner, both forward and backwards.Attention was also used to bias more importantweights in the network architecture. Details of thenetwork architecture are shown in Figure 7.

The micro-averaged AUC-ROC of the bidi-rectional LSTM model was .940.

5.4 Random Forrest Ensemble of SVM, CNN,and LSTM Models

All models had similar AUC metrics. We hypoth-esized that different models may perform well ondifferent subsets of labels. If this were true, itwould be possible to combine the models in anensemble to increase overall performance acrossall labels. We trained an ensemble classifier toevaluate this hypothesis.

91

Figure 6: Architecture of convolutional neural networkfor multi-label text classification.

Figure 7: Architecture of LSTM recurrent neural net-work for multi-label text classification.

A random forest model was trained on the out-puts of the previously described models to weightthe predictions of each and make a final predic-tion. Questions were first vectorized and input toeach component model (i.e., the SVMs, CNN, andLSTM). The output probabilities of each modelwas then fed into the random forest. Specifically,the inputs of the random forest were 414 valuesbetween 0 and 1 consisting of:

• a probability from each of 138 binary SVMs

• a probability for each of 138 output nodes ofthe CNN

• a probability for each of 138 output nodes ofthe LSTM

The output of the random forest was a binary vec-tor of 138 elements. Each element correspondedto an activity statement label. A value of 1 indi-cated the input question should have that label anda value of 0 indicated that it should not.

Modeling Method AUC-ROC (micro-avg)TF-ILF+SVM 0.968CNN 0.972LSTM 0.940Ensemble 0.937

Table 2: AUC of different methods

The micro-averaged AUC-ROC of the ran-dom forest ensemble combining the output ofthe other models was .937.

5.5 Model Comparison and Discussion

Table 2 shows the micro-averaged AUC-ROC ofthe models evaluated. The best performing modelwas the CNN. However, none of the algorithmsperformed dramatically different from one another.We believe that several confounding factors in thedata were equally challenging for the various meth-ods.

Class imbalance likely complicated classifica-tion attempts and may also indicate other issueswith the manual labeling process. Editors, pressedfor time, may choose labels that are higher in thedrop-down list of the editorial platform. Theymay also choose labels that are less precise butmore general and, therefore, likely to be accept-able. These behaviors could explain why a smallset of activity statements labels were associatedwith thousands of questions whereas the rest of thelabels were only associated with a few questionseach.

We also found that editors sometimes disagreeabout question labels. To address this issue, thereis a manual process for label reconciliation. Senioreditors can be consulted to make final decisionswhere necessary. Editors also pointed out that ques-tions could be assigned far more activity statementsthan is currently the case. To optimize the adaptivequizzing experience for users, editors limit labelingto one or two labels that best fit the question.

5.6 Final Model Evaluation

The best performing model was the CNN thoughthere was not a significant difference between themethods evaluated. While we limited the timespent on hyper-parameter tuning of the ensembleapproach, it was interesting that it fared the worstin our evaluation. The AUC-ROC score enabledus to compare modeling approaches but does notreflect the performance in production. When de-

92

Number of Tags Shown AccuracyTop 5 labels 0.95Top 3 labels 0.76Top 1 labels 0.47

Table 3: TF-ILF+SVM Model Accuracy

ployed, the model shows users the top five labelpredictions. Users can pick any subset of those la-bels to apply them to the question being reviewed.To get a sense of accuracy in production, we loghow many times we cover all relevant labels in thetop N predictions as shown in Table 3.

6 Impact Analysis

We are currently logging editor activity and calcu-lating metrics to perform a thorough impact anal-ysis. Initial estimates show that time spent on la-beling questions with NCLEX tags went from afew minutes pre-machine learning to less than oneminute after our solution was deployed. There aretens of thousands of questions in Wolters Kluwerproducts like PrepU and CoursePoint and morecontent being generated every year. This impactis therefore significant, measuring several hoursand potentially up to $100,000 or more savingsannually.

Editors have responded very positively and reg-ularly use machine learning label suggestions intheir current workflow. That said, it will take sometime for them to accept a completely automatedprocess. Perhaps more importantly, subject matterexperts have assessed that the consistency and qual-ity of labels assigned to questions increase with themodel suggestions. Nursing content editors oftenapply labels based on their personal understandingof content, which is sometimes subjective. Theremay also be biases in selecting ”convenient” labelswhen having to choose from a lengthy list in a com-plicated workflow. The predictive model providesconsistent label suggestions which in turn resultsin more consistent labels being assigned.

7 Future Work

The class imbalance of this task motivates the po-tential use of active machine learning. Somelabels have only been assigned to a handful of ques-tions. For these labels, we may work with editorsto find more exemplar questions or create new ones.These new questions can then be merged with train-ing data and the model retrained to ameliorate ef-

fects of class imbalance. In active learning, thisprocess is typically repeated in an iterative processto target problem areas for a model. By selectivelylabeling new questions and down sampling overrepresented labels, we can fine tune data for retrain-ing models to improve overall accuracy. Activemachine learning has specifically been used formulti-label text classification problems (Yang et al.,2009).

Another area for further study is the evaluationof more recent, deep, transformer models. Becausethere is a great deal of semantic similarity betweenquestions, these models may not fare better thanmore traditional vectorization and classificationtechniques. We intend to evaluate this hypothe-sis in future work.

There are many different tag sets and taxonomiesthat can be used to label nursing education content.Tagging both content and questions supports moreadvanced features such as dynamic remediationand adaptive learning. For instance, when a studentanswers a question incorrectly, learning softwarecan automatically provide links to learning materi-als that are related to that topics addressed in thatquestion. We are actively investigating how tag-ging and organizing content can support varioususe cases for adaptive learning.

ReferencesAshutosh Adhikari, Achyudh Ram, Raphael Tang,

and Jimmy Lin. 2019. DocBERT: BERT forDocument Classification. arXiv e-prints, pagearXiv:1904.08398.

Tal Baumel, Jumana Nassour-Kassis, Michael Elhadad,and Noemie Elhadad. 2017. Multi-label classifica-tion of patient notes a case study on ICD code as-signment. CoRR, abs/1709.09587.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale,Greg Ver Steeg, and Aram Galstyan. 2019. Multi-task learning and benchmarking with clinical timeseries data. Scientific Data, 6(1).

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-rectional LSTM-CRF models for sequence tagging.CoRR, abs/1508.01991.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1746–1751,

93

Doha, Qatar. Association for Computational Lin-guistics.

Ladislav Lenc and Pavel Kral. 2017. Word embed-dings for multi-label document classification. InProceedings of the International Conference RecentAdvances in Natural Language Processing, RANLP2017, pages 431–437, Varna, Bulgaria. INCOMALtd.

Andrew Mccallum and Kamal Nigam. 2001. A com-parison of event models for naive bayes text classifi-cation. Work Learn Text Categ, 752.

Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous spaceword representations. In HLT-NAACL, pages 746–751. The Association for Computational Linguistics.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Sampo Pyysalo, F Ginter, Hans Moen, T Salakoski, andSophia Ananiadou. 2013. Distributional semanticsresources for biomedical text processing. Proceed-ings of Languages in Biology and Medicine.

Jesse Read, Luca Martino, Pablo M. Olmos, and DavidLuengo. 2015. Scalable multi-output label predic-tion: From classifier chains to classifier trellises.Pattern Recognition, 48(6):2096 – 2109.

Newton SpolaoR, Everton Alvares Cherman,Maria Carolina Monard, and Huei Diana Lee.2013. A comparison of multi-label feature selectionmethods using the problem transformation approach.Electron. Notes Theor. Comput. Sci., 292:135–151.

Jinzhong Xu, Jie Liu, and Xiaoming Liu. 2009. Re-search on topic relevancy of sentences based onhownet semantic computation. In 9th Interna-tional Conference on Hybrid Intelligent Systems(HIS 2009), August 12-14, 2009, Shenyang, China,pages 195–198. IEEE Computer Society.

Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and ZhengChen. 2009. Effective multi-label active learningfor text classification. In Proceedings of the ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 917–926.