Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika...

9
Syndromic Surveillance through Measuring Lexical Shift in Emergency Department Chief Complaint Texts Hafsah Aamer, Bahadorreza Ofoghi, Karin Verspoor Department of Computing and Information Systems The University of Melbourne Parkville Victoria 3010, Australia [email protected] [email protected] [email protected] Abstract Syndromic Surveillance has been per- formed using machine learning and other statistical methods to detect disease out- breaks. These methods are largely depen- dent on the availability of historical data to train the machine learning-based surveil- lance system. However, relevant train- ing data may differ from region to region due to geographical and seasonal trends, meaning that the syndromic surveillance designed for one area may not be effec- tive for another. We proposed and analyse a semi-supervised method for syndromic surveillance from emergency department chief complaint textual notes that avoids the need for large training data. Our new method is based on identification of lexi- cal shifts in the language of Chief Com- plaints of patients, as recorded by triage nurses, that we believe can be used to monitor disease distributions and possible outbreaks over time. The results we ob- tained demonstrate that effective lexical syndromic surveillance can be approached when distinctive lexical items are available to describe specific syndromes. 1 Introduction The increase in new emerging pathogenic diseases like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys- tem. A syndromic surveillance system keeps track of the frequency of patients experiencing specific syndromes over time. Any abnormality in the nor- mal trend of syndromes with respect to time may imply a proximal disease outbreak. There are many data sources that can be used to perform syndromic surveillance, such as data from hospital emergency departments. Chief com- plaints provide us with one such rich data source to perform syndromic surveillance. A chief com- plaint is the set of signs and symptoms that the triage nurse registers upon a patient’s arrival at the emergency department of a hospital. The symp- toms are usually described in a terse, ungrammat- ical, and abbreviated language. Once the chief complaint is registered at the initial assessment, the patient sees the doctor and receives a detailed check up. This may include sending the patient’s specimen for testing in the laboratory especially when dealing with a potentially dangerous infec- tious disease. The results from the laboratory tests may take days before the end result is available and by then, there is a large risk of spreading the disease to other people who may come in contact with the infected person. Chief complaints are readily available in a dig- ital format and can therefore be easily processed using natural language processing algorithms. In the past, various work has been done to perform syndromic surveillance using supervised machine learning and statistical algorithms (Tsui et al., 2003; Espino et al., 2007; Chapman et al., 2005; Bradley et al., 2005). The major drawback of ma- chine learning methods is the requirement of his- torical data that can be used to train the system, and a sensitivity to the characteristics of specific text types (Baldwin et al., 2013). In the case of syndromic surveillance, there is evidence of a need for localized training data; as Ofoghi and Ver- spoor (2015) found, a machine learning classifier trained on an American data set may not be effec- tive on an Australian data set. The authors found that the American off-the-shelf syndromic classi- fier (CoCo) achieved a lower F-score on the Aus- tralian data set compared with another classifier (SyCo) that was trained with the Australian data set. Moreover, there may be a lack of resources to collect ongoing data for chief complaints, espe- cially in remote areas. Therefore, there is a need for a surveillance system that can work without the Hafsah Aamer, Bahadorreza Ofoghi and Karin Verspoor. 2016. Syndromic Surveillance through Measuring Lexical Shift in Emergency Department Chief Complaint Texts. In Proceedings of Australasian Language Technology Association Workshop, pages 45-53.

Transcript of Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika...

Page 1: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

Syndromic Surveillance through Measuring Lexical Shift in EmergencyDepartment Chief Complaint Texts

Hafsah Aamer, Bahadorreza Ofoghi, Karin VerspoorDepartment of Computing and Information Systems

The University of MelbourneParkville Victoria 3010, Australia

[email protected]@[email protected]

Abstract

Syndromic Surveillance has been per-formed using machine learning and otherstatistical methods to detect disease out-breaks. These methods are largely depen-dent on the availability of historical data totrain the machine learning-based surveil-lance system. However, relevant train-ing data may differ from region to regiondue to geographical and seasonal trends,meaning that the syndromic surveillancedesigned for one area may not be effec-tive for another. We proposed and analysea semi-supervised method for syndromicsurveillance from emergency departmentchief complaint textual notes that avoidsthe need for large training data. Our newmethod is based on identification of lexi-cal shifts in the language of Chief Com-plaints of patients, as recorded by triagenurses, that we believe can be used tomonitor disease distributions and possibleoutbreaks over time. The results we ob-tained demonstrate that effective lexicalsyndromic surveillance can be approachedwhen distinctive lexical items are availableto describe specific syndromes.

1 Introduction

The increase in new emerging pathogenic diseaseslike SAARS, Ebola, and the Zika virus requiresan ongoing effective syndromic surveillance sys-tem. A syndromic surveillance system keeps trackof the frequency of patients experiencing specificsyndromes over time. Any abnormality in the nor-mal trend of syndromes with respect to time mayimply a proximal disease outbreak.

There are many data sources that can be usedto perform syndromic surveillance, such as datafrom hospital emergency departments. Chief com-plaints provide us with one such rich data source

to perform syndromic surveillance. A chief com-plaint is the set of signs and symptoms that thetriage nurse registers upon a patient’s arrival at theemergency department of a hospital. The symp-toms are usually described in a terse, ungrammat-ical, and abbreviated language. Once the chiefcomplaint is registered at the initial assessment,the patient sees the doctor and receives a detailedcheck up. This may include sending the patient’sspecimen for testing in the laboratory especiallywhen dealing with a potentially dangerous infec-tious disease. The results from the laboratory testsmay take days before the end result is availableand by then, there is a large risk of spreading thedisease to other people who may come in contactwith the infected person.

Chief complaints are readily available in a dig-ital format and can therefore be easily processedusing natural language processing algorithms. Inthe past, various work has been done to performsyndromic surveillance using supervised machinelearning and statistical algorithms (Tsui et al.,2003; Espino et al., 2007; Chapman et al., 2005;Bradley et al., 2005). The major drawback of ma-chine learning methods is the requirement of his-torical data that can be used to train the system,and a sensitivity to the characteristics of specifictext types (Baldwin et al., 2013). In the case ofsyndromic surveillance, there is evidence of a needfor localized training data; as Ofoghi and Ver-spoor (2015) found, a machine learning classifiertrained on an American data set may not be effec-tive on an Australian data set. The authors foundthat the American off-the-shelf syndromic classi-fier (CoCo) achieved a lower F-score on the Aus-tralian data set compared with another classifier(SyCo) that was trained with the Australian dataset. Moreover, there may be a lack of resourcesto collect ongoing data for chief complaints, espe-cially in remote areas. Therefore, there is a needfor a surveillance system that can work without the

Hafsah Aamer, Bahadorreza Ofoghi and Karin Verspoor. 2016. Syndromic Surveillance through Measuring Lexical Shift inEmergency Department Chief Complaint Texts. In Proceedings of Australasian Language Technology AssociationWorkshop, pages 45−53.

Page 2: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

availability of historical data for a long period oftime.

In this work, rather than using machine learningalgorithms to classify the new chief complaint ofa patient into pre-defined syndromic groups, weexplored the lexical content of chief complaintsover time, specifically the changes in the distri-bution of terms, that may indicate an impendingevent. We hypothesise that when the probabilitydistribution of terms in chief complaints of con-secutive time-frames exhibits a large divergence,then there has been a measurable change in thetrend of syndromes, which can be used for de-tecting an outbreak. The Jensen-Shannon Diver-gence (JSD) (Lin, 1991), also known as infor-mation radius, between consecutive time-framesover a chief complaint corpus was used in com-bination with CUSUM (Cumulative Sum) algo-rithms (Fricker Jr et al., 2008) to detect anyaberrancy in the data. We also experiment withthe text segmentation algorithm Link Set Median(LSM) (Hoey, 1991) to find segmentations in thechief complaints texts, as a proxy for lexical shift.

The remainder of this paper is organized as fol-lows. We first describe the SynSurv data set usedfor our experiments. Then, the three differenttypes of time-frames necessary to perform our lex-ical analyses will be introduced. This is followedby the discussion of how we modeled chief com-plaints for textual analysis using statistical meth-ods. Finally, we discuss the utilization of ab-normality detection algorithms and the results ob-tained in our experiments.

2 The SynSurv Data Set

The Syndromic Surveillance (SynSurv) data weused for our analyses was collected from two ofthe main hospitals in Melbourne, Australia; theRoyal Melbourne Hospital and the Alfred Hospi-tal. The data was collected on behalf of the Victo-rian Department of Health, initially to enable mon-itoring during the 2006 Commonwealth Gamesheld in Melbourne. The data contains 314,630chief complaints labeled with a syndromic groupas well as with disease codes in the ICD-10 andSNOMED terminologies. The original SynSurvdata set contained data with respect to eight syn-dromes Flu-Like Illness, Diarrhea, Septic Shock,Acute Flaccid Paralysis, Acute Respiratory, Radi-ation Injury, Fever with CNS, and Rash with Fever.Due to the sparsity of data for many of the syn-

dromes in the data, we chose the top three with thehighest numbers of positive cases in the SynSurvdata, i.e., Flu-Like Illness, Diarrhea, and AcuteRespiratory.

Chief complaints differ from the majority ofother free text that can be found in textual doc-uments or social media posts in that they con-tain medical acronyms, abbreviations, as well asnumeric values representing body temperature,blood pressure, and the like. They are notes en-tered by the triage nurses lacking well-defined lin-guistic structure. Therefore, some preprocessingwas carried out on the set of chief complaints.

We approached the chief complaint preprocess-ing task first by lowercasing, lemmatization usingStanford Lemmatizer, and removing stop wordsand the already assigned ICD-10 and SNOMEDdisease codes from the end of the chief complaintstrings. These are the disease codes assigned toeach chief complaint upon a patient’s discharge.We removed these codes to retain the chief com-plaint texts in their original format. We also re-moved all the non-alphanumeric symbols except“/”, which plays a meaningful role in the med-ical domains. For instance, blood pressure isrecorded as two numbers, as 120/80, the first num-ber representing systolic blood pressure and thesecond number representing diastolic blood pres-sure. If we remove “/”, this numeric reading be-comes “12080” or “120 80” which results in aloss of information; neither choice may define agood feature to represent a chief complaint with aspecific syndrome. In addition, the same symbol“/” is used for shorthand notations while makingnotes, for example, the chief complaint texts heav-ily contained “o/a” meaning “overall pale”. Othernotations included: “r/a”, “p/s”, “c/o”, and “r/t”.We also observed that the use of such notation de-pended on the nurse’s own preference; while therewere many chief complaints with “o/a” there werealso many occurrences of “o a” without the sym-bol “/” in between the characters. This meant wecould not target all the notations consistently.

However, we did not remove any numeric val-ues from the chief complaints because numericvalues may be relevant for diagnosing a syndrome.For example, body temperature is one of the mainrecordings that a doctor (or a nurse) takes when apatient visits the emergency department to observeif there is a serious illness.

46

Page 3: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

3 The Choice of Time-Frame

The number of patients that visit emergency de-partments varies between weekdays and weekendsand also seasonally. To cater for such day-of-the-week and seasonal trends, we used differenttime-frames to accumulate chief complaints, someof which normalize the frequencies of chief com-plaints over longer (seven-day) periods. We exper-imented with the following three time-frames:

Intersecting seven-day windowsThe chief complaints over seven days were ac-cumulated as one time-frame window; then, thistime-frame window was advanced by one day. Theseven-day time-frame windows cater for the vary-ing frequency of patients visiting over weekdaysand weekends. The one-day shifting time-frameis popular for supervised syndromic surveillanceas it allows for real-time syndromic surveillance atthe end of each day (Hutwagner et al., 2003). Notethat for lexical shift analysis with seven-day win-dows shifting by one day, there is a large vocab-ulary overlap for the six intersecting days; how-ever, the variance would be significant enough tobe captured by aberrancy detection algorithms thatwill be discussed later.

Disjoint seven-day windowsThe chief complaints over seven days were accu-mulated as one time-frame corpus; then, the time-frame was advanced by seven days so the two con-secutive windows were disjoint.

Disjoint one-day windowsIn this case, all of the chief complaints in aday formed one time-frame corpus of chief com-plaints. The time shift was for one day and there-fore, the chief complaints in the next day formedthe next time-frame corpus. Although this time-frame does not normalize day-of-the-week vari-ances, we incorporated this type in order to findsome daily patterns in the SynSurv data set.

4 Textual Modeling of Chief Complaints

We examine the distribution of lexical items inchief complaints to find the differences in the ter-minology used in consecutive time-frames. Forthis, statistical methods were utilized, as will bediscussed in the following sections. Such statisti-cal methods have been previously used for corpusanalysis and comparison (Verspoor et al., 2009;Rayson and Garside, 2000). We follow that prior

work here, considering chief complaints in theconsecutive time-frames as separate corpora to becompared with each other.

4.1 Jensen-Shannon Divergence

The Jensen-Shannon divergence, also known asInformation Radius, is a symmetric measure thatmeasures the similarity between two probabilitydistributions P and Q over the same event space.The JSD between the two probability distributionsP and Q is a symmetrized and smoothed versionof the Kullback-Leibler Divergence (KLD) and iscalculated using Equation 1.

JSD(P‖Q) =1

2D(P‖M) +

1

2D(Q‖M) (1)

where,M = 12(P +Q) andD represents the KLD

distance between the two probability distributions;on a finite set χ is calculated using Equation 2.

D(P‖Q) =∑x∈χ

P (x) lognP (x)

Q(x)(2)

When modeling the chief complaint corpora,the union set of the vocabularies of two corporato be compared was constructed (V = Vc1 ∪Vc2), where (c1) and (c2) were the chief complaintcorpora belonging to the two consecutive time-frames. P and Q represent the probability of cor-pus terms. Since JSD inherently performs proba-bility smoothing, to find the probability distribu-tion of each term tk over Vci for each corpus, theconditional probability of term tk in each corpuswas calculated using Equation 3, based on the rawterm frequencies (tf ) of the terms in ci.

P (tk|ci) =tf(tk, ci)∑

tx∈Vcitf(tx, ci)

(3)

4.2 Log-Likelihood for Term-Level Filtering

When two corpora are lexically compared witheach other, especially in the case of overlappingtime-frames, they may share a large number ofterms. Therefore, calculating probability distribu-tions over the entire union set of terms containedin the text of the two corpora may not be an ef-fective method, as all the terms will have equalimportance. In this case, there is a need for filter-ing out the terms that do not distinguish the twocorpora well (i.e., terms that are common in bothcorpora).

47

Page 4: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

The log-likelihood score of a term representsthe relative frequency difference of that term in thetwo different corpora under comparison (Raysonand Garside, 2000). This measure is calculatedbased on the expected value for term tk ∈ V usingthe total frequency of all terms in the corpus andthe actual frequency (or the sum of occurrences) ofterm tk in the same corpus. Equation 4 shows howthe log-likelihood score is calculated for tk ∈ Vwhere Nci is the total frequency of all terms incorpus ci, and Otk,ci represents the observed fre-quency of term tk in the same corpus ci.

Etk,ci =Nci

∑iOtk,ci∑iNci

(4)

The expected frequency of term tk in corpus cidenoted by Etk,ci is a frequency that is expectedfor tk if the occurrences were evenly distributedacross the two corpora (Verspoor et al., 2009). Thelog-likelihood of the term is therefore a measurethat tells us how different the actual frequency ofthe term is from the expected frequency of thesame term. For this, the log-likelihood is calcu-lated using Equation 5.

LL = 2∑i

(Otk,ci ln(Otk,ciEtk,ci

)) (5)

An alternative to the log-likelihood measure forstatistical analysis of textual corpora is Pearson’sχ2 statistic. This measure assumes a normal dis-tribution of terms in the corpora and has beenshown in (Dunning, 1993) to be less reliable es-pecially in the case of small textual corpora withrare terms. Given the relatively small size of thechief complaint corpora for each time-frame, thelog-likelihood analysis was preferred here.

Once the log-likelihood of each term was calcu-lated, all of the terms with the log-likelihood be-low a set threshold were filtered out. The texts ofthe two corpora now contained only the most im-portant terms that participated in the calculation ofprobability distributions using JSD.

To estimate the best log-likelihood threshold,we calculated the JSD between consecutive time-frames when filtering terms based on differentlog-likelihood thresholds ranging from 0 to 20.Since we were not comparing two distinct cor-pora (cf., (Rayson and Garside, 2000; Baldwinet al., 2013)) but consecutive time-frames over asingle corpus, we calculated the JSD values be-tween all of the consecutive time-frames and then

took the mean of all JSD values for each possi-ble log-likelihood threshold value. It can be seenfrom Figure 1 that as the threshold increases, moreterms are filtered out and eventually, hardly anyterms remain and the divergence between consec-utive time-frames approaches zero. A thresholdthis high is not ideal. We therefore applied the el-bow method (Kodinariya and Makwana, 2013) toset the log-likelihood thresholds to 1.75, 3.0, and1.25 for the one-day, disjoint seven-day, and inter-secting seven-day time-frames when filtering outthe non-important terms.

Figure 1: The analysis of the effect of differentlog-likelihood threshold values on the changes inthe JSD distances between consecutive chief com-plaint corpora in the SynSurv data set

5 Lexical Shift Analysis with Link SetMedian

In a separate set of experiments from the statisticalJSD distance analysis of the SynSurv data set, weused the Link Set Median (LSM) algorithm (Hoey,1991) to find segments in the set of chief com-plaints that suggest lexical shifts in the data set.

The LSM algorithm is based on the idea that atypical text has a cohesive format. Therefore, lex-ical overlaps can be used to measure the similar-ity between two sentences. If two sentences sharesimilar terms, then the sentences may be on thesame or similar topic. The LSM method identifieslexical repetitions across sentences within a cor-pus to find textual segments of similar topics. Thealgorithm has a technique to represent each sen-tence in the corpus based on its links with othersentences in the same corpus. The lexical link isa repetition between a pair of sentences in the cor-pus. The set of the links per sentence contains the

48

Page 5: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

number of times each sentence has lexical over-laps with any other sentence in the corpus. Forinstance, if sentence 1 shares two terms with sen-tence 2 and one term with sentence 4, then “2”is added twice in the link set of sentence 1 as{2,2,4}. After the link sets have been created forall sentences in the text, the median of the link setsare calculated. This median represents the lexi-cal span of that sentence in the text. If the sen-tence has a median of “5” for instance, this meansthat the sentence has a lexical span from i − 5 toi + 5 where i equals the original position of thesentence. Once each sentence in the text has a cor-responding median, an average median of the en-tire text is calculated. This average median is usedas a threshold to find segment boundaries. If themedian difference between two consecutive sen-tences is larger than the threshold, then a segmentboundary is placed between two sentences.

In our experiments, we modeled a “sentence” asa set of chief complaints over a specific time-frameand assume a “segment” boundary to be indicativeof a lexical shift. We applied the LSM algorithmon the SynSurv data set to understand whether thealgorithm will place segment boundaries wherethe lexical contents of chief complaints deviateand whether such lexical shifts are tied with thechanges in the frequencies of syndromic groups.

6 Aberrancy Detection

When using the JSD values to find changes in theterm probability distributions in consecutive time-frames, there was a need for a method to detectabnormalities in the probability deviations overtime. The Early Aberration Reporting System(EARS) C algorithms (Hutwagner et al., 2003)were designed to detect such changes in syn-dromic surveillance data. The CUSUM algorithmsare based on a statistically detectable change incounts of relevant events over specified windowsof time. The EARS implemented three CUSUMalgorithms known as C1:mild, C2:medium, andC3:ultra for detecting large, sudden deviations ofoccurrences of specific events that significantlydepart from the norm over time. These algo-rithms have been named after their level of sen-sitivity and represent an unsupervised monitoringapproach that omits the requirement for long his-torical data for system training. Since syndromicsurveillance systems require real-time analysis ofchief complaints, the EARS CUSUM algorithms

are ideal. The algorithms have recently been usedin another study to find possible disease outbreaksalong with machine learning classification tech-niques (Aamer et al., 2016).

The C1 algorithm calculates the sample meanand the sample standard deviation over rollingwindows of samples for t− 7 to t− 1 days, wheret is the current day. C2 adds a two-day lag ontothe calculation of the mean and the standard devi-ation, and C3 is calculated on the basis of the pre-vious two C2 values, details to be found in (Hut-wagner et al., 2003). In our work, we used thepre-set threshold values for the C algorithms, i.e.,the threshold for C1=3, C2=3, and for C3=2.

Note that the LSM algorithm has an internalmethod to calculate a threshold based on whichtextual segment boundaries are identified. There-fore, when using the LSM algorithm, we did notapply the CUSUM algorithms.

7 Results and Discussion

We applied the JSD and log-likelihood filtering al-gorithms on the chief complaints in the SynSurvdata set with the different time-frames describedin section 3. Then, the aberrancy detection algo-rithms were utilized over the JSD measures of con-secutive time-frames to retrospectively find anydisease outbreaks in the data set. We separatelyapplied the LSM method on the same data set tofind likely aberrancies.

Table 1 shows the results obtained (i.e., thenumber of dates for which an aberrancy was de-tected) when the aberrancy detection algorithmswere applied based on the JSD lexical distributionvalues, derived from the chief complaints in theSynSurv data set over the different time-frames.As can be seen in Table 1, the C2 and C3 al-gorithms with higher levels of sensitivity set offa large number of signals. Algorithms that areoverly sensitive to fluctuations in disease frequen-cies are not ideal; they may alert health practition-ers too frequently, resulting in alert fatigue andmistrust of the algorithm. On the other hand, al-gorithms that miss viable shifts corresponding tomeaningful events are also not desirable. The C1algorithm is the least sensitive of the three algo-rithms while still raising alerts; it appears to bal-ance the two criteria most effectively.

Note that to apply the aberrancy detection algo-rithms on the resulting JSD values over the dis-joint seven-day windows, there was a need for 7 to

49

Page 6: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

Time-frame Method #Aberrancies with ADAC1 C2 C3

1-day JSD+LL 64 (2.4%) 67 (4.0%) 205 (13.2%)7-day disjoint JSD+LL 10 (4.6%) 14 (6.5%) 30 (13.8%)7-day intersecting JSD+LL 36 (2.4%) 61 (4.0%) 201 (13.3%)

Table 1: The number of aberrancy dates detected by different algorithms in the chief complaints overdifferent time-frames. The total number of windows differed for each time-frame. The 1-day time-frame consisted of 1522 windows starting from July 2005 to August 2009, the 7-day intersecting time-frame had 1516 windows, and the 7-day disjoint time-frame had 217 windows. Note: ADA=Aberrancydetection algorithm, JSD=Jensen-Shannon Divergence, and LL=Log-likelihood.

11 weeks of baseline data which would mean a 7to 11 week wait before any aberrancy can be de-tected. For this reason, in practice, it may not beuseful to apply the C algorithms on the JSD valuesfor disjoint seven-day time-frames.

Further analysis of the C1 algorithm, however,showed that the weeks for which C1 detected out-breaks were spread throughout the years in thedata set, indicating that LSM is highly sensitiveand produces too many noisy signals, which is notdesirable. Since the texts of chief complaints areall in a single text corpus assigned to various syn-dromes, a single chief complaint may correspondto more than one syndrome. For example, due tothe very similar set of syndromes that Flu-Like Ill-ness and Acute Respiratory share, a single chiefcomplaint may be labelled Flu-Like Illness, AcuteRespiratory, and No Diarrhea at the same time.Therefore, a large lexical shift in the SynSurv dataset may correspond to any of the syndromes (in-cluding the original set of syndromes introduced insection 2). As a result, there was a need for a moreinformed method to distinguish between the aber-rancies related to each syndrome. For this, insteadof using log-likelihood of terms, we collected allthe chief complaints signaling the presence of asyndrome into one corpus and extracted the termswith the largest term frequency for each syndrome.Table 2 shows the top 10 terms for the three syn-dromic groups. We can see is a large intersectionfor Flu-Like Illness and Acute Respiratory.

We calculated JSD values over the SynSurv dataset with the original chief complaint terms (no fil-tering) removing stop words only, as well as withthe top 10 terms based on term frequencies foreach syndromic group. Figure 2 shows the re-sulting diagrams for Flu-Like Illness and Diarrheaover the different time-frames. The results forAcute Respiratory were not shown since this syn-dromic group is most similar to Flu-Like Illness.

TF rank Diarrhea FLI AR1 pain sob sob2 vomiting hr hr3 abdo cough cough4 diarrhea pain chest5 hr o/a pain6 nausea chest o/a7 nil nil hx8 o/a hx respiratory9 gastrointestinal throat nil10 hx respiratory phx

Table 2: The top TF ranked terms related to eachsyndrome in the SynSurv data set. Note: TF=Termfrequency, FLI=Flu-Like Illness, and AR=AcuteRespiratory. The terms include many medical ab-breviations such as “sob” for shortage of breath,“hr” for heart rate, etc.

When comparing the effect of the differenttime-frames, as shown in Figure 2, the resultswith the one-day window time-frames seem muchnoisier than those of the two other types of time-frames. It is hardly possible to detect any aber-rancies in the one-day time-frame output datawhereas with the seven-day windows, both dis-joint and intersecting, some clear trends of peakscan be detected around a few dates in the data set.

More importantly, the utilisation of the topterms significantly changed the outputs of theaberrancy detection as indicated by the differentpatterns in the diagrams in Figure 2 with the orig-inal versus top 10 terms. However, it is noticeablethat the peaks form at very similar dates for bothFlu-Like Illness and Diarrhea conditions, whenthe top 10 terms are retained in the analysis. Wesuspect this is largely due to the significant inter-section between the list of top 10 terms for thesetwo syndromic groups, as seen in Table 2. Indeed,there is a 50% overlap between the lists of top fre-quency terms including “pain, “hr”, “o/a”, “nil”,and “hx”. These overlapping terms are general;

50

Page 7: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

Figure 2: Variation of JSD values in the SynSurv data set with different time-frames when original andtop 10 terms retained

they seem non-specific to any syndrome.We then experimented with the top frequency

terms that were not shared between the lists of thetwo syndromic groups. We took the top 3 termsfrom the two lists in Table 2 and calculated theJSD values over the disjoint one-day windows ofchief complaints. The results are shown in Fig-ure 3 in which the patterns of peaks are differentwith drastic increases of JSD values at differentand separate dates. This suggests that if effectiveand descriptive sets of terms are found to representeach syndromic group (hence semi-supervised),the lexical shift as measured with deviating JSDvalues can be utilised as an indication of possibleoutbreaks of corresponding syndromes.

Figure 3: Variation of daily JSD values in the Syn-Surv data set when only top 3 terms retained

As a proof of concept to understand whetherJSD is in fact sensitive to the raw frequenciesof positive cases of syndromic groups, we cross-checked deviations of JSD values (over all terms)with reference to the actual positive cases of eachsyndrome in the SynSurv data set. The results areshown in Figure 4 where JSD is demonstrated tobe sensitive especially to the cases where there areno positive labelled chief complaints for any of thesyndromes. In such cases, as highlighted in thediagrams of Figure 4, JSD values deviate drasti-cally, indicating a rapid change in the frequenciesof syndromes of interest.

In the last round of experiments, we applied theLSM algorithm over the different time-frames ofthe SynSurv data set. The algorithm detected 77segments with the disjoint seven-day time-frame,178 with the one-day window, and 268 with the in-tersecting seven-day time-frame. When analysingthe dates of segmentations, it was observed that al-though the dates were spread throughout the years,the segments by the disjoint seven-day and one-day time-frames were mostly in the years 2007,2008, and 2009, similar to the times when thepeaks were observed with JSD. However, the 268segments found using the intersecting seven-daytime-frame were distributed over the four years.This may be due to the large vocabulary over-laps where the windows intersect over six daysof chief complaints. The intersecting seven-daytime-frame, therefore, may not facilitate an effec-tive process of outbreak detection using lexical

51

Page 8: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

Time-frame Method # of Segments1-day LSM 178 (11.7%)7-day disjoint LSM 77 (35.5%)7-day intersecting LSM 268 (17.7%)

Table 3: The number of segments detected in the chief complaints over different time-frames. The totalnumber of windows differed for each time-frame. The 1-day time-frame consisted of 1522 windowsstarting from July 2005 to August 2009, the 7-day intersecting time-frame had 1516 windows, and the7-day disjoint time-frame had 217 windows. Note: LSM = LinkSetMedian

Figure 4: Analysing sensitivity of lexical shifts measured with JSD to actual deviations of the numberof daily syndromic chief complaints. Vertical lines point to the dates when drastic disease frequencychanges align with large JSD values.

shift analysis with LSM. Based on our results, thelarge number of deviations/outbreaks detected bythe LSM algorithm (in its current form) seems tobe an impediment in using the algorithm for syn-dromic surveillance.

8 Conclusion

We proposed a new semi-supervised method toperform syndromic surveillance over the SynSurvdata set containing a large number of emergencydepartment Chief Complaints in the Victoria Stateof Australia. This new method is based on lo-cating significant lexical divergences in the textsof chief complaints accumulated over consecutiveperiods of time. We analysed the lexical shifts us-ing Jensen-Shannon Divergence (JSD); a probabil-ity distribution divergence measure; and Link SetMedian (LSM); a text segmentation algorithm; forthe three syndromes Flu-Like Illness, Acute Respi-

ratory, and Diarrhea. The aim was to find whetherlexical shifts are tied with possible disease out-breaks in the historical SynSurv data set. We eval-uated the lexical shifts with three types of time-frames: i) one-day windows, ii) disjoint seven-daywindows, and iii) intersecting seven-day windowsadvancing by one day.

We found that all three time-frames had somelimitations: the disjoint seven-day time-framewhen used in combination with the EARS C algo-rithms is not efficient, and inherently, it will resultin longer periods of wait before a likely outbreak issignalled. The seven-day intersecting time-frame,on the other hand, resulted in noisy and frequentsignals with the LSM algorithm, mostly as a resultof large textual overlaps in consecutive overlap-ping time-frames. The one-day time-frame pro-duced interesting results but suffers from the day-of-the-week effect.

52

Page 9: Syndromic Surveillance through Measuring Lexical Shift in ... · like SAARS, Ebola, and the Zika virus requires an ongoing effective syndromic surveillance sys-tem. A syndromic surveillance

Our results also demonstrate that if each syn-dromic group (i.e., disease) is represented withits corresponding distinguishable high-frequencyterms, then the JSD measure provides evidence forlexical shifts that is aligned with drastic changesin the frequency of syndromic-labelled chief com-plaints. The need for distinguishable terms foreach syndrome under consideration means that ourmethods are semi-supervised. Based on our exper-iments, therefore, the JSD method with syndrome-specific term sets analysed over the one-day time-frames resulted in the most promising outcomes.

In future work, we plan to expand the idea ofusing high frequency terms into the utilisation ofrelated semantic representations, such as health-related synonyms. In addition, we are planningto analyse lexical shifts in chief complaints us-ing other natural language processing techniques,such as textual novelty detection with Topic Track-ing and Detection algorithms. TDT algorithms cantrack changes in text and find event-level shifts (orfirst stories) in a corpus. We would like to ex-periment with such unsupervised algorithms andfind whether first story boundaries match drasticchanges in the frequencies of chief complaints re-lated to specific syndromic groups.

Acknowledgments

We acknowledge the Australian Defence Scienceand Technology Group (DST Group) for the sup-port of our work on syndromic surveillance.

References

Hafsah Aamer, Bahadorreza Ofoghi, and Karin Ver-spoor. 2016. Syndromic surveillance on the victo-rian chief complaint data set using a hybrid statisti-cal and machine learning technique. In K. Barbuto,L. Schaper, and K. Verspoor, editors, Proceedings ofthe Health Data Analytics Conference.

Timothy Baldwin, Paul Cook, Marco Lui, AndrewMacKinlay, and Li Wang. 2013. How noisy socialmedia text, how diffrnt social media sources? InProceedings of the Sixth International Joint Confer-ence on Natural Language Processing, pages 356–364, Nagoya, Japan, October. Asian Federation ofNatural Language Processing.

Colleen A. Bradley, H. Rolka, D. Walker, andJ. Loonsk. 2005. BioSense: implementation of anational early event detection and situational aware-ness system. MMWR Morb Mortal Wkly Rep,54(Suppl):11–19.

Wendy W. Chapman, Lee M. Christensen, Michael M.Wagner, Peter J. Haug, Oleg Ivanov, John N. Dowl-ing, and Robert T. Olszewski. 2005. Classifyingfree-text triage chief complaints into syndromic cat-egories with natural language processing. Artificialintelligence in medicine, 33(1):31–40.

Ted Dunning. 1993. Accurate methods for the statis-tics of surprise and coincidence. Computational lin-guistics, 19(1):61–74.

Jeremy U. Espino, John Dowling, John Levander, Pe-ter Sutovsky, Michael M. Wagner, and Gregory F.Cooper. 2007. SyCo: A probabilistic machinelearning method for classifying chief complaintsinto symptom and syndrome categories. Advancesin Disease Surveillance, 2(5).

Ronald D. Fricker Jr, Benjamin L. Hegler, andDavid A. Dunfee. 2008. Comparing syndromicsurveillance detection methods: EARS versus aCUSUM-based methodology. Statistical Medicine,27(17):3407–29.

M. Hoey. 1991. Patterns of lexis in text. Oxford Uni-versity Press.

Lori Hutwagner, William Thompson, G. Matthew See-man, and Tracee Treadwell. 2003. The bioterror-ism preparedness and response early aberration re-porting system (ears). Journal of Urban Health,80(1):i89–i96.

Trupti M. Kodinariya and Prashant R. Makwana. 2013.Review on determining number of cluster in k-means clustering. International Journal of AdvanceResearch in Computer Science and ManagementStudies, 1(6):90–95.

Jianhua Lin. 1991. Divergence measures based on theShannon entropy. IEEE Transactions on Informa-tion theory, 37(1):145–151.

Bahadorreza Ofoghi and Karin Verspoor. 2015. As-sessing the performance of American chief com-plaint classifiers on Victorian syndromic surveil-lance data. In Proceedings of Australia’s Big Datain Biomedicine & Healthcare Conference, Sydney,Australia.

P. Rayson and R. Garside. 2000. Comparing cor-pora using frequency profiling. In Proceedings ofthe Workshop on Comparing Corpora, held in con-junction with ACL 2000, pages 1–6.

Fu-Chiang Tsui, Jeremy U. Espino, Virginia M. Dato,Per H. Gesteland, Judith Hutman, and Michael M.Wagner. 2003. Technical description of RODS: areal-time public health surveillance system. Jour-nal of the American Medical Informatics Associa-tion, 10(5):399–408.

Karin Verspoor, K. Bretonnel Cohen, and LawrenceHunter. 2009. The textual characteristics of tradi-tional and Open Access scientific journals are simi-lar. BMC Bioinformatics, page 10:183.

53