Post on 18-Apr-2020
ECAI 2016 - International Conference – 8th Edition
Electronics, Computers and Artificial Intelligence
30 June -02 July, 2016, Ploiesti, ROMÂNIA
Development of Thai Text-Mining Model for
Classifying ICD-10 TM
Pornrat Jatunarapit
Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University
Bangkok, Thailand
Pornrat.j@student.chula.ac.th
Krerk Piromsopa
Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University
Bangkok, Thailand
Krerk.p@chula.ac.th
Chris Charoeanlap Department of Orthopaedics
Faculty of Medicine, Chulalongkorn University
Bangkok, Thailand
Chris.cha@chula.ac.th
Abstract – This paper presents a model for classifying
ICD-10 TM using machine learning and information
retrieval. The scope of this research take systematic
approach for translating diagnosis from medical records
to ICD-10 TM is proposed. First, an information
retrieval is used to find similarity word in Thai and
English diagnose. Then, machine learning approach is
applied to classify ICD-10 TM by training models using
Naïve Bayes algorithm. The result shows that our
proposed approach can accurately classify ICD-10 TM
in Thai-English diagnose at 81.41%.
Keywords-ICD-10 TM; Text mining; IR; machine
learning; Thai diagnosis;
I. INTRODUCTION
The ICD-10 TM is an international code of diseases and sign, abnormal finding symptom circumstances, complaints, and external cause of injury as classified by WHO (World Health Organization) [1]. It is mandatory to associate every medical notes with ICD-10. In several countries, a hospital is required to submit the ICD-10 to the government.
To code ICD-10 TM, most hospitals employ medical coder [2], ICD-10 TM specialist to manually classify doctor’s diagnosis with into ICD-10 TM by comparing with codebook This manual process introduces (1) a delay in ICD-10 TM recording, (2) a requirement of a high-educated people (3) a complex process in Figure 1. .
In addition, this research short term goal is to provide an initial machine learning model for classify ICD-10 TM using text mining to assist medical coder. Also, long-term goal is to automatically extract ICD-10 TM directly from medical records. This research show a new method to classify ICD-10 TM from Thai free text. This is very useful for medical record.
Furthermore, Section II explains background knowledge related ICD-10 TM, text mining and information retrieval. Section III contains related works. Section IV is our design. Finally, evaluation
and conclusion show in our work in Section V and Section VI respectively.
Figure 1. The process of classifying ICD-10 TM of the OPD of
King Chulalongkorn Memorial hospital. (OPD)
II. BACKGROUND
This section provides background necessary for
understanding our work. We first discuss ICD-10 TM
and its complexity. Later, we discuss text mining,
Thai language tools, and machine learning models.
A. ICD-10 TM
International Classification of Diseases was initially developed for classifying diseases from the cause of the death by WHO. In present, it is the 10th edition. Given that each country has different Diseases and Health problem; they have to localize their own ICD. For examples, the United States has an ICD-9 CM (CM: Clinical Modification). Like others, Thailand also creates ICD-10 TM (TM: Thai Modification) in 2001. ICD-10 TM has been approved by WHO. In this version, the expert in each medical specialty from the Ministry of public health, colleges, the associations of medical specialties have added more diseases into ICD-10 TM.
Pornrat Jatunarapit, Kreak Piromsopa and Chris Charoeanlap
2
Code in ICD-10 TM generally begins with a capital alphabet A-Z followed by 2-4 digits of Arabic numbers. There is a “.” (dot) separating between the third digit and the forth digit. Fig. 2 is an example of ICD-10 TM.
Figure 2. ICD-10 TM Code.
ICD-10 TM is grouped by principle diagnosis, comorbidity, complication, other diagnosis, external cause and non-OR procedures [3]. To ease understanding, we first explain the terminologies.
1) Principle diagnosis is the disease that occurs before treatment or main diseases that cause the patience to receive a treatment.
2) Comorbidity is the presence of disorders that occur before treatment and less effect than principle diseases.
3) Complication is a disease that occurs while receiving treatments.
4) External cause is the cause of treatment such as injury from accident.
5) Other diagnosis is the lesser other diseases that may be affected in treatment result.
6) Non-OR process is a treatment that uses
medical tools with patience.
B. Text Mining [4][5]
Text mining is a complex procedure in information analysis study of their frequency of words, word classification, meaning of word, natural language processing log, machine learning and Information retrieval. There are 3 methods in text mining: 1) clustering 2) questioning and answering 3) concept linking.
C. Information Retrival [4][6]
Information retrieval is the search for similarity by
using index of date subject in document file, index of
questionnaire for composing the similarity of both to
find out the most closed meaning. To perform
information retrieval across languages, the finding
process is divided into 3 sub processes which are:
1) Indexing phase: In this phase, the system has to
create document preventative to decrease finding tire
from data. Invented indexing must gather all required
data then parse them before language processing to
delete requirement word before indexing. The process
involves mostly sorting by alphabet and counting
repetition and frequency
2) Translation phase [7]: This phase can be
divided into 2 types:
2.1) Direct translation: Dictionary technique
translation.
2.2) Indirect translation: Latent semantic
analysis and explant semantic analysis. Latent
Semantic Analysis (LSI) searches without translation
but used index, matrix and cosine correlation to find
out the similarity of document or query. This process
has high efficiency and popular in measuring
similarity between two data.
3) Matching phase: This phase compares or
matches index and keywords using convention
methods.
Lucene [8] which is an open-source software that uses in information retrieval develop from java programming language. It can extract data, store data and create keywords index for reference to documents.
D. Machine Learning [9]
Machine learning is artificial intelligent. There are
at least 3 types of machine learning.
1) Supervised learning is an algorithm that learned
from training data with correct result.
2) Unsupervised learning is to learn directly from
input data without any hint.
3) Reinforcement learning is to learn from
environment. The Tic-Tac-Tor game is a good
example of reinforcement learning.
E. Word Segmentation
Word segmentation is the process of dividing statement, sentences to words. Nowadays, there are panty of open-source tool for word segmentation such as standard java library, LexTo [11].
III. RELATE WORKS
A. Cross Language Information Retrival (CLIR) The related several cross language information
retrieval. T. Z. a. and Y.-J. Zhang [11] states that information retrieval for English-Chinese languages by using bilingual dictionary with synonym. P.Akewaranukulsiri [12] uses bilingual dictionary with vector model and similarity from costive in Thai, for finding meaning in Thai herb. In this work, there 3 suggestions. 1) Size of related entries from semantic analysis key word between Thai herb and modern medicine are related to the size of maturity. 2) Information retrieval will be more effective if using
Development of Thai Text-Mining Model for Classifying ICD-10 TM
3
the specific word. 3) Information retrieval gives less efficiency if using vector model from their questionnaire extension and keyword index cratering. This makes cross language information retrieval more accuracy.
B. Machine Learning
In 2009, K.Phosai [13] demonstrates Latent
semantic analysis and machine learning for Thai
question answering system. Additionally, semantic
analysis is used for answering Thai language because
Thai language is more complex than English language
(e.g. no spaces word). This work used CRFs as a
learning model for grouping query and choosing text.
Finally, Naïve Bayes average the result by using the
highest percentage of accuracy.
In 2014, N.Chirawichitchai [14] shows Emotion
classification of Thai text using Boolean weighting
with support vector machine. This result shows the
accuracy of 77.86% from support vector machine
algorithm.
C. Medical Record System
In 2008 Farkas and Szarvas [16] presented CSSs
for automatic assignment of ICD-9-CM codes (limit
number of possible ICD-9-CM codes) by Support
Vector Machine. In 2014 the Maria, et. al. [16]
implemented tools from CSSs base on tools (JAVA)
and CSS framework for natural language. This work
shows an accuracy of 92% from short medical text
(3,000 samples). The paper shows 3 step in assign
ICD-9-CM
1) Text preprocessing: transform in a standardized
form.
2) Query generation: expanded synonyms and
augment the probability of retrieving correct ICD-9-
CM code.
3) Code selection: query that identifies all the
candidate codes in knowledge base.
In 2010 the Chen, et. al. [15] used semantic
analysis of free text (English) to assign ICD-9-CM
codes from 978 patient records. This work shows the
average precision (semantic feature) of 67.0% and
75.6% (matching feature).In this research they used a
semantics analysis (semantic graph) method include
dependency parsing of clinical records and calculation
of semantic matching score to classify ICD code. The
experiment result from three domains.
1) Implemented semantic features 1) digestive
67% 2) neural 70% and 3) respiratory 63.3%.
2) Implemented matching features 1) digestive
78.8% 2) neural 71.0% and 3) respiratory 75.9%.
Nowadays, there exist research that classify ICD-
10 form medical record by using WEKA [19] to test
for accuracy between C4.5 and Naïve Bayes. This is
then applied with Apriori algorithm [18]. The result
shows that the system can classify 115 samples of
diagnosis into 3 types of diseases. Each type has 7
diseases. The research achieves the accuracy of 86%.
This paper develops a model for classify ICD-10 TM from free text using information retrieval and text classifier algorithm. The classifier is selected by validating accuracy of Naïve Bayes, Support vector machine and Decision tree against 3,000 diagnosis note.
IV. PROPOSED APPROACH
Our aim is to create a model for classifying ICD-10 TM by using information retrieval technology and machine learning technique. There are 3 steps in our design methodology, 1) Data preparation 2) Modeling and 3) Tool creation. The steps are shown in Fig. 3.
Figure 3. Overview of approach.
A. Data Preparation
Diagnosis data are collected from King Chulalongkorn Memorial hospital by blinding patient confidential information. This data includes code and expansion of ICD-10 TM in both Thai and English. The data are separated into 2 data sets (test set and training set).
B. Model
The overview of our model is shown in Fig. 4. There are 6 steps in our model. They are tha basis of our tool.
Pornrat Jatunarapit, Kreak Piromsopa and Chris Charoeanlap
4
Control word
Disease Thai
มะเร็งกระดูก …
English Bone cancer,
Malignant bone tumor , CA
…
Disease : มะเรง็กระดกู
Symptoms : คล ำพบกอ้นเน้ือบริเวณกระดูก...
Disease area: …..
ICD-10 TM :
Disease: Bone cancer, Malignant bone ,
tumor,CA
Symptoms : mass….
Disease area :….
ICD-10 TM :
Figure 4. Overview of this Model for classify ICD-10 TM.
1) Input: diagnose data and separate word using
Lexto. (Lexto is a word segment software, developed
by HLT Lab (NECTEC). The longest matching
feature embedded in the Lexto use). In this study it is
assumed diagnostic data no misspelled.
2) Cleaning & keybase: clean word and extract
keywords from diagnostic.
3) Control word: contains list of language proficiency of both English and Thai diseases. The meaning and its acronyms [20] from doctor is shown in Fig. 5 and Table I.
Figure 5. Example control word
TABLE I. ACRONYMS FROM “อกัษรยอ่ท่ีหมอใช”้ BOOK.
Acronyms Full Name
PTA Prior To Admission
THA Total hip arthroplasty
TKA Total Knee Arthroplasty
TLIF Transforaminal lumbar interbody fusion
TMT Tarsomtatarsal
yr Year
4) Indexing: create index from diagnosis and
ICD-10 TM code by using Lucene library to give
weight before semantic search.
5) Latent Semantic: search similarity word to
increase the accuracy of model.
6) Classifying: use algorithm to classify ICD-10
TM from 5).
7) Machine Learning: In case that the tool show
wrong ICD-10 TM code, user can correct the code.
The newly corrected code will be used for retraining
the system as shown in Fig. 6.
Figure 6. Machine learning process
C. Tool Creating
ICD-10 TM classification model was developed by java programming language and mangoDB [21] (noSQL) database for data storage.
D. Sample Cases
The table II shows the percentage of ICD-10 TM
code from 3,000 samples of orthopedic department.
In additional, table III presents an example input from
our information retrieval model.
TABLE II. PERCENTAGE OF ICD-10 TM CODE USE IN THE
ANALYZED SAMPLE PER SECTION OF DISEASES.
Chapter ICD-10 TM
Percentage
of code
usage
II Neoplasms 5.4 %
VI Diseases of the nervous system 1.2 %
XIII Diseases of the musculocutaneous system and connective tissue
67.1 %
XVII
Congenital malformations,
deformations and chromosomal
abnormalities
3.1 %
XIX Injury, poisoning and certain other
consequences of external causes 23.2 %
Development of Thai Text-Mining Model for Classifying ICD-10 TM
5
TABLE III. EXAMPLE CASE.
Example
Input 5 yr PTA ปวดร้ำวลงขำสองขำ้ง ซ้ำย>ขวำ ไม่ปวดหลงั ปวดเวลำเดินไกล
Word segment
(LexTo)
5 | yr | PTA | ปวดร้ำว | ลงขำ | สองขำ้ง | ซ้ำย | > | ขวำ | ไม่ปวดหลงั | ปวด | เวลำ | เดินไกล
Semantic &
Control
word
5 | year | prior to admission | ปวดร้ำว | ลงขำ | สองขำ้ง | ซ้ำย | ไป | ขวำ | ไม่ปวดหลงั | ปวด | เวลำ | เดินไกล
Output ปวดร้ำว | ลงขำ | สองขำ้ง |
V. EVALUATION
We validate our model using 3,000 samples. The
assessment is done by comparing the results with
those from medical coder to evaluate the accuracy,
precision and recall. The definition and equations are
shown in Table IV and Equation 1-3.
TABLE IV. RESULT OF EVALUATION.
Predicted condition
positive
Predicted condition
negative
The system
displays ICD-
10 TM
TP (True Positive) Correct result
FP (False Positive) Unexpected result
The system
dose not display ICD-
10 TM
FN (False Negative) Missing result
TN (True Negative)
Correct absence of
result
(1)
Recall is the result of relavant instances that are
retrieved.
(2)
Precision is the result of retrieved instances that are relevant.
(3)
Accuracy is the result value from tool with the
actual value.
TABLE V. RESULT FROM WEKA (10-FOLD CROSS
VALIDATION)
Algorithm
Predicted condition
positive Precision Recal
l Correct Incorrect
Support vector
machines 80.21 % 19.79 % 0.802 0.839
Naïve Bayes 81.41 % 17.59 % 0.814 0.835
Decision Tree (C4.5)
73.63 % 26.37 % 0.727 0.800
The preliminary results from table V show that Decision tree gives the worst result and Naïve Bayes yields the best precision at 81.41%. Although, Nonetheless, SVM gives the best recall. In general, Naïve Bayes gives the best overall result. Therefore, we choose Naïve Bayes for developing our tool.
The example result from our tool is shown in table VI.
TABLE VI. EXAMPLE RESULT FROM TOOL.
Diagnosis
Result (ICD-10 TM)
check Medical
Coder Model
ล่ืนลม้สะโพกซำ้ยกระแทกพ้ืน ปวดสะโพกซำ้ย ยนืลงน ้ ำหนกัไม่ได ้
M17.1 M17.1
ญ 58 yr กระดูกสนัหลงัคด ไม่มีอำกำรอ่ืน ไม่ปวดหลงั
M41.1(5) M41.1
น้ิวเทำ้ท่ี 2 ดำ้นซำ้ยขำด S98.1 S98.1
Chronic Lt Hip dislocation
Operation : Open reduction
and Hip spica Lt.
S73.09 S73.09
Myelogram , Post
myelogram no complication Z09.8 Z09.8
มีรถตดัหนำ้ มอร์เตอร์ไซดล์ม้ ไหล่ขวำกระแทกพ้ืน Prominent Rt distal
clavicle ขยบัล ำบำก ไม่ชำ S49.8 S48.9
VI. CONCLUSIONS AND FUTURE WORK
Machine learning model is methodized to
classifying ICD-10 TM using text mining. The
results show that the proposed approach yields
accuracy up to 81.41% from the initial data. The
advantage of this research is the model can assist
medical coder to reduce the period of work. However,
this analysis is a preliminary result, hence the
accuracy can further be developed. The work can also
be used with other medical departments. We aim at
developing a fully automated ICD-10 TM extraction
from the follow-up note.
ACKNOWLEDGMENT
We would like to thank King Chulalongkorn Memorial hospital for providing the support.
REFERENCES
[1] World Health Organization. [Online]. Available: http://www.who.int/about/en/. [Accessed 5 October 2015].
[2] Advancing the business of Healthcare [Online]. Available: http://www.aapc.com/medical-coding/medical-coding.aspx. [Accessed 3 February 2016].
[3] K. Sangkhawasi, Introduction to ICD-10 [Online]. Available: http://www.slideserve.com/kerry/icd-10. [Accessed 10 December 2016].
[4] W.Wongwilaisakun. “Data Warehouse and Data Mining for Management,” Panyapiwat Journal ,vol.2 no.2, pp.157-165, spacial issue may.
[5] D.F.a.J.Sanger, The Text Mining Handbook.
[6] K. Kesorn, “Cross language (Thai-English) Information Retrieval: Concepts and Challenges,” KKU Sci. J. 41(1), pp.121-133, 2013.
Pornrat Jatunarapit, Kreak Piromsopa and Chris Charoeanlap
6
[7] K. Kesorn, “Semantic Search: The New Idea of Search Engine and The Way for Future Development,” Valaya Alongkorn Review, vol.2.
[8] D.Cutting [Online]. Available: http://lucene.apache.org/. [Accessed 3 February 2016].
[9] TSAI, D. Zhang, J. JP, “Machine Learning and Software Engineering,” Kluwer Academic, 2003.
[10] National Electronics and Computer Technology Center [Online]. Available: http://www.sansarn.com/lexto/. [Accessed 20 December 2016].
[11] T. Z. a. Y.-J. Zhang, “Research on Chinese-English Cross-Language Information Retrieval,” Machine Learning and Cybernetics, 2008 International Conference, vol.5, pp. 2591-2596, 2008.
[12] P.Akewaranukulsiri, “Semantic and Cross-Language Information Retrieval for Thai Herbal and Medicine Using Latent Semantic Analysis,” International Conference on Information Science and Applications, 2013.
[13] K.Phosai, “Latent semantic analysis and machine learning for Thai question answering system,” 2009.
[14] N.Chirawichitchai, “Emotion Classification of Thai Text based Using Term weighting and Machine Learning
Techniques,” 2014 11th International Joint Conference on Computer Science and Software Engineering(JCSSE), 2014.
[15] P.Chen, A.Barrera, C.Rhodes, “Semantic analysis of free text and its application on automatically assigning ICD-9-CM code to patient records,” Cognitive Informatics (ICCI),2010 9th IEEE International Conference, pp.68-74, 2010.
[16] M. Teresa Chiaravalloti, R. Guarasci, V.Lagani, E.Pasceri, R.Trunfio, “A Coding Support System for the ICD-9-CM standard,” International Conference on Healthcare Informatics, pp.71-78, 2014.
[17] R. Farkas and G. Szarvas, “Automatic construction of rule-based ICD-9-CM coding systems,” BMC Bioinformatics, vol.9, no. 3:S10, pp.1-9, 2008.
[18] S.Monthasuwan, P.Tantasanawong, N.Ruangrit, “Development system for searching code ICD-10 for medical record,” Science and Technology Silpakorn University, pp.74-88, 2015.
[19] Weka version 3.6.13 1999-2008, The university of Waikato [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/. [Accessed 16 August 2015].
[20] K. Sangkhawasi, “อกัษรยอ่ท่ีหมอใช,้”.
[21] MongoDB Inc., [Online]. Available: https://www.mongodb.org/. [Acessed 20 March 2016].