Disease Prediction Final
-
Upload
aashis-khanal -
Category
Documents
-
view
31 -
download
4
Transcript of Disease Prediction Final
DISEASE PREDICTION
A Final Year Project
Submitted To
Department of Computer Science and Information Technology, Institute of
Science and Technology
Tribhuvan University
In Partial Fulfillment of the Requirements for the degree of
Bachelor of Science in Computer Science and Information Technology
Submitted By
Aashis Khanal (Roll no. : 725/067)Nasib Thakuri (Roll no. : 735/067)Nitish Shretha (Roll no. : 737/067)Prabhat Dhar Sharma (Roll no. : 739/067)
February 2015
Under the Supervision of
Ashok Kumar Pant(Sr. Biometrics Software Engineer, Tek Tak Nepal Pvt.Ltd)
Tribhuvan University
Institute of Science and Technology
Central Department of Computer Science and Information Technology
We certify that we have read this project work report and in our opinion it is satisfactory in the
scope and quality as a final year project report in the partial fulfillment for the requirement of
Bachelor of Science in Computer Science and Information Technology.
EVALUATION COMMITTEE
---------------------------- ------------------------------
Ashok Kumar Pant Sushant Poudel, Head of Department
Sr. Biometrics Software Engineer Department of CSIT
Tek Tak Nepal Pvt.Ltd Kathford International College
(Supervisor)
------------------------- ------------------------------
(External Examiner) (Internal Examiner)
Date:---------------------------
LETTER OF APPROVAL
This is to certify that the work embodied in this final year project entitled “Disease Prediction”
submitted by Mr. Ashish Khanal (725/067), Mr. Naseeb Thakuri (735/067), Ms. Prabhat Dhar
Sharma (739/067) and Ms. Nitish Shrestha (737\067) , to the Central Department of Computer
Science and Information Technology, is carried out under my supervision.
The project work has been prepared as per the regulations of Tribhuvan University, and I have read and
recommend that this project work be accepted in partial fulfillment of the requirements for the
Bachelor’s degree in Computer Science and Information Technology.
---------------------------------
Mr. Ashok Kumar pant
Sr. Biometrics Software Engineer
Tek Tak Nepal Pvt.Ltd
(Supervisor)
--------------------------------External Examiner
--------------------------------Susant PoudelHead of DepartmentDepartment of Computer Science and Information TechnologyKathford International College
COPYRIGHT
The author has agreed that the Library and Department of Computer Science and Information
Technology of Kathford International College of Engineering and Management may make this report
freely available for inspection. Moreover the author has agreed that permission for extensive copying
of this project report for scholarly purpose may be granted by the supervisors who supervised the
project work recorded here in, or in their absence, by the Head of Department where in the project
report was done. It is understood that the recognition will be given to the author of this report and to the
Department of Computer Science and Information Technology, Kathford International College of
Engineering and Management in any use of the material of this project report. Copying or publication
or the other use of this report for the financial gain without approval of to the Department of Computer
Science and Information Technology, Kathford International College of Engineering and Management
and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in whole or in
part should be addressed to:
Head of Department
Department of Computer Science and Information Technology
Kathford International College of Engineering and Management,
Lalitpur, Nepal
ABSTRACT
Fact that medical reports are rich in disease and their symptoms related information. We, in this paper, are figuring out a way, based on statistical document classification, to predict the most likely disease, provided medical problems of a person.
In spite of having possibility of large volume of data of disease and symptoms, we prefer Naïve Bays Classifier, It assumes that the probability of each term(symptom) occurring in the document is independent of the occurrences of other terms and it is fast and has a good practical reputation on document classification. A vocabulary V is maintained, this vocabulary consist of disease symptom terms. Feature vector, with size |V|, of each document takes Boolean value as its ith dimension. If a vocabulary term present in the document then 1 if not present then 0. For this reason we prefer to follow Bernoulli Distribution. A variant of Naïve Bayes that follows Bernoulli Distribution is Naïve Bernoulli Distribution[1]. The project completion is done on two important phases Learning and Testing.
ACKNOWLEDGEMENT
We would like to take this opportunity to express our sincere thanks to the Department of Computer
Science and Information Technology, Kathford International College of Engineering and Management
for providing us this opportunity to explore our interest and ideas in the field of Computer Science
through this project.
We would like to thank our Supervisor Mr. Ashok Kumar Pant ( Sr. Biometrics Software Engineer) for
his kind support, coordination and valuable supervision for this project.
We would also like to acknowledge and extend our gratitude to everyone for his/her support and
encouragement for this project.
Table of Contents
List of Figures ............................................................................................................................... III
List of Tables ................................................................................................................................ IV
List of Abbreviations ..................................................................................................................... V
Chapter I.......................................................................................................................................... 1
Introduction................................................................................................................................. 1
1.1 Introduction .................................................................................................................. 1
1.2 Motivation .................................................................................................................... 2
1.3 Problem Statement........................................................................................................ 2
1.4 Objectives ..................................................................................................................... 2
1.5 Scope ............................................................................................................................ 3
1.6 Applications.................................................................................................................. 3
Chapter II ........................................................................................................................................ 4
Planning and Analysis................................................................................................................. 4
2.1 Planning ........................................................................................................................ 4
2.2 Requirement Analysis................................................................................................... 4
2.3 Feasibility Study ........................................................................................................... 5
2.4 Data Collection ............................................................................................................. 6
Chapter III....................................................................................................................................... 8
Methodology............................................................................................................................... 8
3.1. Literature review........................................................................................................... 8
3.2. System Design ............................................................................................................ 13
3.3. Project Methodology .................................................................................................. 15
Chapter IV..................................................................................................................................... 21
Implementation and demonstration........................................................................................... 21
4.1 Technology Used........................................................................................................ 21
4.2 Knowledge Base design.............................................................................................. 22
4.3. Testing ........................................................................................................................ 26
Conclusion and Future Enhancement ........................................................................................... 27
Limitations ............................................................................................................................ 27
Future Enhancements............................................................................................................ 27
Appendix....................................................................................................................................... 28
Sample Code ......................................................................................................................... 28
Screenshots ........................................................................................................................... 32
Figure 15: Symptoms................................................................................................................... 37
Bibliography ......................................................................................................................... 38
List of Figures
Figure 3.1 System Architecture ................................................................................................... 13Figure 3.2 Use case diagram........................................................................................................ 14Figure 3.4 NB algorithm(Bernoulli model), training and testing ................................................ 19Figure 4.1 Knowledge Base Design............................................................................................. 22Figure 4.2 A snapshot of Knowledge base. ................................................................................. 23Figure 4.3 Disease training implementation ................................................................................ 24Figure 4.4 Stemming result.......................................................................................................... 25Figure 4.5 Chunking result for training document01................................................................... 25Figure 4.6 Chunking result for training document02................................................................... 25Figure 4.7 Phrasing result for training document03 .................................................................... 25Figure 4.8 Final training data for Tuberculosis ........................................................................... 26
List of TablesTable 2.1The source for the documents.......................................................................................... 6Table 2.2 Diseases and its no. of documents .................................................................................. 7Table 3.1 Stop Words Removal .................................................................................................... 15Table 3.2 Stemming Concept........................................................................................................ 15Table3.3 Tokenization of input..................................................................................................... 16Table 3.4 Chunking of words........................................................................................................ 16Table 3.5 Training matrix for jaundice ......................................................................................... 17Table 5.1 Test result for the two set of symptoms........................................................................ 26
List of Abbreviations
S.N Abbreviations Descriptions
1. ANN Artificial Neural Network2. CSS Cascading Style Sheet3. HTML Hyper Text Markup Language4. HSQLDB HyperSQL DataBase5. JVM Java Virtual Machine6. KNN K-nearest neighbor7. KB Knowledge Base8. NLP Natural Language Processing9. SVM Support Vector Machines
10. Tf-idf Term frequency–inverse document frequency11 MVC Model View Controller12 KB Knowledge Base
1
Chapter I
Introduction
1.1 Introduction
A health conscious person will always have anxieties when he has symptoms related to
some dreaded diseases when actually he may not even have any diseases. However, some people
may be ignorant about the fatality of the disease. Such ignorance can lead to severe stage of a
disease which may cost a lot of money and possibly even death. We are developing a web based
system which can be used by a person via internet. The system will take symptoms from user via
a web form. On the basis of the given symptoms the system will recommend the likely disease.
The system uses the widely used classification method called document classification. In this
project, we are trying to use one of most popular document classification method, Naïve Bayes
Classifier, to match disease and given symptoms to predict the most probable disease. In this
system, the diseases are treated as/ classes which have different set of documents which contains
its symptoms. The classification process is based on the matching of the given input (symptoms).
One of the merits of this system is that a person can save a lot of time in identifying the probable
disease early. Moreover, the system can be used as a support system to refer patients to the
respective departments. The same can be used by a general user to know which specialist he has
to concern with. Overall, the system can be a great tool for developing a medical diagnosis
support system.
2
1.2 Motivation
If we know that the symptoms are referring towards the diseases like sugar, blood cancer,
etc. We can proceed for early treatment that might help in maintaining our health. In some cases,
it can save us from needing to spend a lot of money in the treatment. Dealing with these feelings,
we found out that a computer application that predicts the most probable disease out of given
symptoms would be lot helpful.
1.3 Problem Statement
Most of the time when we feel something unusual in our body we start searching on
Google: “what are the symptoms of Sugar/Blood cancer/Jaundice?” and stuffs like that. It is true
that we all are afraid that we might be getting those deadly diseases. It will be helpful if we
provide information, how we are feeling right now and what are possible diseases for these
conditions. We have lots of disease related information in the internet. In this project we, a group
of students, have tried to process that information to train our system. Once the system is trained
and adjusted for errors it is able to find the most probable disease if anyone provides list of
symptoms.
1.4 Objectives
The main objectives of this project are,
• To predict the most probable diseases forgiven symptoms.
• To develop a doctor assistant system.
3
1.5 Scope
Although this system has wide range of scope, some major areas of scope are
• Anyone who can access internet can be benefited from disease prediction system.
• Online Doctor appointment:
Upon integration with a hospital’s website this system can automatically appoint a doctor based on user symptoms.
• Front Desk assistant in hospitalsDoctor appointment and time slot management in hospital.
• Recommendation system based on disease predicted economic condition, etc.
1.6 Applications
The applications of this project can be listed as:-• Being anywhere but accessible to internet, one can check health condition instantly.• People of remote areas can be made aware of deadly diseases early based on
symptoms. This can someday save someone’s life. • This system can also be used in real world medical assistant.• This system can help a user to get information about hospitals.
4
Chapter II
Planning and Analysis
2.1 Planning
In planning phase a study on required data is done. This system is data based so
collection of reliable data and finer pre-processing of that data is our main objective. We also
need huge amount of data. We concluded that there is no better alternative than internet for data
collection.
A web based application has a greater impact rather than a desktop application. For systems like
Disease Prediction system that needs more enhancements over future should be modular and
scalable. Taking into account we concluded on using JVM framework.
2.2 Requirement Analysis
The requirements are to be collected before starting of the projects’ development life
cycle. The initial requirements are the one that head start the projects development.
2.2.1 Software Requirement Specification
i. Functional Requirement
The main goal of system is to suggest a disease based on the given symptoms. We can use the
application form anywhere with internet access. The system will take natural English phrase
separated by comma as input. Therefore the user interface is simple to use.
ii. Non Functional Requirements
It includes features such as:
Availability:-This system is web based so anyone with internet connectivity can have access to
it.
5
Maintainability;-The system can be trained with disease from UI, this makes the system highly
maintainable.
Portability:-This application is JVM based so any machine running with JVM can operate it.
The use of Bootstrap framework makes it highly responsive to all sorts of devices like android
phone, iphone, etc.
iii. Technical Specifications
• JVM 1.7 or higher
• Apache tomcat 7.0+
• Spring Framework 4.1.3.RELEASE
• Apache maven 3.2.5
• Java Web module 3.0+
2.3 Feasibility Study
Following things are taken under consideration under feasibility study:-
2.3.1 Economical feasibility
Economic analysis is the most frequently used method for evaluating the effectiveness of
a proposed system. It is more commonly known as cost benefit analysis, the procedure to
determine the benefits and saving that are expected from a decision is made to doing and
implement the system. Otherwise make alterations in the proposed system.
Cost associated with the development of computer-based systems is as follows.
i. Procurement costs such as consultation, equipment purchase, installation, furnishing the
size etc.
ii. Start-up costs, user operating system cost, personal search cost etc. Project related costs
such as software purchase, training personnel, data collection, documents preparation
costs etc.
iii. Ongoing costs such as hardware, software maintenance, rental, depreciation of hardware
costs etc. Easy installation and free of cost in use.
6
2.3.2 Operational feasibility
User interface is designed to be user friendly. Any user with simple knowledge on using Internet
can use it. Beside this the app provides user the user friendly system to input the symptoms and
navigate to other pages for further information about diseases.
2.4 Data Collection
Both primary data and secondary data are used in the project.
a. List of Diseases.
b. List of symptoms
In this phase, we will be collecting disease reports that consist of disease symptoms and
respective diseases. As per our objective for 14 diseases, we will search the internet for their
symptoms, and related medical terms. These standards will be implemented as the training data.
Following web resources and locations are used to collection training data
Table 2.1The source for the documents
Web Site No of Documents collected for each Diseasehttp://www.webmd.com/ 4http://www.nhs.uk/ 5http://www.mayoclinic.org/ 3http://www.medicinenet.com/ 4http://www.celiac.org 5
7
Following diseases and documents are trained in present system.
Table 1.2 Diseases and its no. of documents
Disease Name Number Of Documents Trained
Asthma 3Cholera 3Diabetes 3Epilepsy 2Flu 2Jaundice 3Malaria 3Pneumonia 3Rabies 3Sinus 3Smallpox 2Tetanus 3Tuberclosis 3Typhpoid 2
8
Chapter III
Methodology
3.1. Literature review
Document classification or document categorization is a problem in library science,
information science and computer science. The task is to assign a document to one or more
classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The
intellectual classification of documents has mostly been the province of library science, while the
algorithmic classification of documents is mainly in information science and computer science.
The problems are overlapping, however, and there is therefore interdisciplinary research on
document classification.
In this section, we define the past work implemented on the disease prediction systems .The are
few of the disease prediction related systems proposed in the past are mentioned in this section.
A model for Prediction of Different Dermatological Conditions built with the aid of Naïve
Bayesian Classification was proposed by Manjusha K. K,K. Sankaranarayanan and Seena P.
They used Naïve Bayes Classification for the system. Data was collected from various tertiary
health care centers in Kottayam and Alappuzha districts of Kerala and filled out by doctors. The
research developed on the basis of the survey. Prediction of Different Dermatological Conditions
was capable calculating the chances of occurrence of various diseases presented. The
demonstration was carried out for the eight diseases. This model could answer complex queries,
each with its own strength with respect to ease of model interpretation, access to detailed
information and accuracy.
Another model called Heart Disease Prediction System developed using Naïve Bayes and
Jelinek-mercer smoothing proposed by Ms.RupaliR. Patilis implemented as an application in
matlab which can answer user queries, it can discover and extract hidden knowledge (patterns
9
and relationships) associated with heart disease from a historical heart disease database. The
Record set with medical attributes was obtained from the Cleveland Heart Disease database.
The proposed system used 13 attributes of medical diagnosis. The accuracy of the result using
Naïve Bayes was 78% whereas, after using the smoothing, the accuracy jumped to 86%.
The one of the earliest and popular expert system for the disease prediction is MYCIN. It was
used for selection of antibiotics for patients with serious infections. Medical decision making,
particularly in clinical medicine is regarded as an "art form" rather than a "scientific discipline":
this knowledge must be systemized for practical day-to-day use and for teaching and learning
clinical medicine
3.1.1 Data Mining Techniques Used For Prediction
Classification is a very important data mining task, and the purpose of classification is to
propose a classification function or classification model (called classifier).The classification
model can map the data in the database to a specific class. Classification construction methods
include: Decision Tree, Naive Bayes, ANN, KNN, Support Vector Machine, Rough set, Logistic
Regression, Genetic Algorithms (GAs) / Evolutionary Programming (EP), Clustering etc .
i. Decision Tree: The decision tree is a structure that includes root node, branch and leaf node.
Each internal node denotes a test on attribute, each branch denotes the outcome of test and each
leaf node holds the class label. The topmost node in the tree is the root node. The decision tree
approach is more powerful for classification problems. There are two steps in this techniques
building a tree & applying the tree to the dataset. There are many popular decision tree
algorithms CART, ID3, C4.5, CHAID, and J48.
ii. Artificial Neural Network (ANN): is a collection of neuron –like processing units with weight
connections between the units. It maps a set of input data onto a set of appropriate output data. It
consists of 3 layers: input layer, hidden layer & output layer. There is connection between each
layer& weights are assigned to each connection. The primary function of neurons of input layer
is to divide input xi into neurons in hidden layer. Neuron of hidden layer adds input signal xi
with weights wji of respective connections from input layer. The output Yj is function of Yj = f (Σ
wji xi) Where f is a simple threshold function such as sigmoid or hyperbolic tangent function.
10
iii. Naive Bayes: Naive Bayes classifier is based on Bayes theorem. This classifier algorithm uses conditional independence, means it assumes that an attribute value on a given class is independent of the values of other attributes.
The Bayes theorem is as follows: Let X={x1, x2... xn} be a set of n attributes. In Bayesian, X is
considered as evidence and H is some hypothesis
Means, the data of X belongs to specific class C. We have to determine P (H|X), the probability
that the hypothesis H holds given evidence i.e. data sample X. According to Bayes theorem, the
P (H|X) is expressed as
iv. K-Nearest Neighbor: The k-nearest neighbor’s algorithm (K-NN) is a method for classifying
objects based on closest training data in the feature space. K-NN is a type of instance-based
learning. The k-nearest neighbor algorithm is amongst the simplest of all machine learning
algorithms. But the accuracy of the k-NN algorithm can be severely degraded by the presence of
noisy or irrelevant features, or if the feature scales are not consistent with their importance.
v. Logistic Regression: The term regression can be defined as the measuring and analyzing the
relation between one or more independent variable and dependent variable. Regression can be
defined by two categories; they are linear regression and logistic regression. Logistic regression
is a generalized by linear regression. It is mainly used for estimating binary or multi-class
dependent variables and the response variable is discrete, it cannot be modeled directly by linear
regression i.e. discrete variable changed into continuous value. Logistic regression basically is
used to classify the low dimensional data having non-linear boundaries. It also provides the
difference in the percentage of dependent variable and provides the rank of individual variable
according to its importance. So, the main motto of Logistic regression is to determine the result
of each variable correctly.
( | ) = ( | ) ( )( ) (3.1)
11
vi .Rough Sets: A Rough Set is determined by a lower and upper bound of a set. Every member
of the lower bound is a certain member of the set. Every non-member of the upper bound is a
certain non-member of the set. The upper bound of a rough set is the union between the lower
bound and the so-called boundary region. A member of the boundary region is possibly (but not
certainly) a member of the set. Therefore, rough sets may be viewed as with a three-valued
membership function (yes, no, perhaps). Rough sets are a mathematical concept dealing with
Uncertainty in data. They are usually combined with other methods such as rule induction or
clustering methods.
vii. Support Vector Machine (SVM): Support vector machine (SVM) is an algorithm that
attempts to find a linear separator (hyper-plane) between the data points of two classes in
multidimensional space. SVMs are well suited to dealing with interactions among features and
redundant features.
viii. Genetic Algorithms (GAs)/Evolutionary Programming (EP): Genetic algorithms and
evolutionary programming are used in data mining to formulate hypotheses about dependencies
between variables, in the form of association rules or some other internal formalism..
ix. Clustering: Clustering is the process of grouping similar elements. This technique may be
used as a preprocessing step before feeding the data to the classifying model. The attribute values
need to be normalized before clustering to avoid high value attributes dominating the low value
attributes. Further, classification is performed based on clustering.
12
3.1.2 Feature Extraction and Selection
The feature extraction and selection is the process of dimensionality reduction which can
be summarized as transforming the existing features into a lower dimensional space and
selecting a subset of the existing features without a transformation respectively. Few of the
methods to achieve the feature extraction and selection are discussed below.
i. Tf-idf: Short for term frequency–inverse document frequency is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus. It is often
used as a weighting factor in information retrieval and text mining. The tf-idf value increases
proportionally to the number of times a word appears in the document, but is offset by the
frequency of the word in the corpus, which helps to adjust for the fact that some words appear
more frequently in general.
ii. Information Gain: In decision tree learning, information gain ratio is a ratio of information
gain to the intrinsic information. It is used to reduce a bias towards multi-valued attributes by
taking the number and size of branches into account when choosing an attribute. We want to
determine which attribute in a given set of training feature vectors is most useful for
discriminating between the classes to be learned. Information gain tells us how important a given
attribute of the feature vectors is.
iii. Maximum Entropy: In his famous 1957 paper, Ed. T. Jaynes wrote: “Information theory
provides a constructive criterion for setting up probability distributions on the basis of partial
knowledge, and leads to a type of statistical inference which is called the maximum entropy
estimate.”.
It is least biased estimate possible on the given information; i.e., it is maximally noncommittal
with regard to missing information. That is to say, when characterizing some unknown events
with a statistical model, we should always choose the one that has Maximum Entropy. Maximum
Entropy Modeling has been successfully applied to Computer Vision, Spatial Physics, Natural
Language Processing and many other fields. This page will focus on applying Maxent to Natural
Language Processing (NLP).
13
3.2. System Design
The initial design of a system determines the conceptual overview of system. System
design is the framework for system development. The design process involved in this project are
included in this section.
3.2.1 System Architecture
The conceptual overview of the system is as follows:-
Figure 3.1 System Architecture
14
3.2.1 Use Case Diagram
The use actors involved and actions performed of this system are emitted by diagram
below. Admin performs authentication and training of the system. User only has the permission
to provide input and get the predicted result.
Admin
User
Figure 3.2 Use case diagram
Login
Prediction result
Provide symptom
Train disease
Add disease
15
3.3. Project Methodology
The methods and theory applied in this project are included in this section.
3.2.1. Document preprocessing
i. Stop word removal
The common terms (words) that are not of interest are removed in this phase. The terms like is,
the, have, etc. do not contribute on the result of classification.
Table 2.1 Stop Words Removal
S.N. Sentence Stop words1. coughing that lasts longer than 2 weeks with
green mucusThat, than, 2,
2. People with hypothyroidism may experience aches and pains in muscles and joints
With, may, and, in
ii. Stemming
There are varieties of derived words for a single root word. The common practice is all the
processing should be done in root words. To extract root words from derived words stemming is
used.
Table 3.2 Stemming Concept
S.N. Derived Word Root Word1. Pain Pain2. Paining Pain3. Pained Pain4 Fast Fast5 Faster Fast6. Dizziness Dizzy7. Weakness Weak
16
iii. Tokenization
The input is tokenized (separate the medical terms) for further processing. The concept of
tokenization is depicted by following table.
Table 3.3 Tokenization of input
iii. Chunking
The tokens generated from previous phases may not be meaningful unless they come in chunks
(single phrase). The concept of chunking used in this project is depicted in following table.
Table 3.4 Chunking of words
S.N. Sentence Tokens Chunks1 coughing that lasts longer than 2 weeks
with green mucusCough, green, mucus greenmucuscough
2 People with hypothyroidism may experience aches and pains in muscles and joints
Aches, pain, muscles, joints
Muscleache, musclepain, jointache, jointpain
3.2.2. Feature extraction
Initially we have a list of diseases and a list of their symptoms as vocabulary. We use this data to
train our application. The training data is the i*j matrix. Each column is a term in vocabulary and
each row is jth document of class Di. Each cell of training matrix has either 0 or 1 as entry.
More clearly,
C = {Jaundice, Pneumonia, …}.
Say in Jaundice class there are k documents.
S.N. Input Tokens
1. coughing that lasts longer than 2 weeks with green
mucus
Cough, green, mucus
2. People with hypothyroidism may experience aches
and pains in muscles and joints
Aches, pain, muscles, joints
17
Vocabulary has |V| entry, let V = {FEVER, COUGH, HIGH_FEVER, CHEST_PAIN,..}
Training matrix for Jaundice is of size k * |V| and so on for all diseases in C.
Table 3.5 Training matrix for jaundice
Once sufficient training is done then it is ready for testing. In testing phase user gives a textual
input with his/her symptoms. From that text, we extract present vocabulary terms and prepare a
input vector Vi as:
Vi = [1, 0, 0, 1, . . .]
It indicates that the user has FEVER, has no COUGH, has no HIGH_FEVER, and has
CHEST_PAIN.
Now, we calculate P(Jaundice/Vi). We compute this term for all classes in C.
Where,
C = Diseases
Vi = feature vector
3.2.3 Naïve Bayes Document Classification
The Naïve Bayes Classifiers is a family of probabilistic classifiers. It is based on Bayes’
Theorem. The key concept of the prediction is computing probability P(Di/f1, f2, . . . ,f|v|) for all
Di in C = {D1,…, DC} and predicting the one with highest value.
Where,
• Di = ith Disease in C
• f1, f2, . . . ,f|v| are features vector of Document D the feature vector of user given
document.
• The size of feature vector depends on number of terms in vocabulary V, i.e. |V|.
FEVER COUGH HIGH_FEVER CHEST_PAIN . . . JAND1 1 0 1 1JAND2 1 0 0 0JAND3 1 1 1 1
18
The probability that term (feature) fi occurs in document D is independent of present or absent of
other features, this is the “Naïve” feature of Naïve Bayes Classifier. User is expected to give
input as a plain text from which vocabulary terms are extracted and a feature vectoris prepared.
The classification is done as:
( 1, 2 , … , ) = ( = )1
( = | = )(3.2)
3.2.3.1 Bernoulli Model
The presence or absence of symptom in input is likely to follow Bernoulli distribution. The
Likelihood for Bernoulli classifier is,
( , … , | ) = [ 1
( | ) + ( − )( − ( | ))](3.3)
Where,
Fi is 1 or 0, Based on feature vector.
Term P(wi/C) = probability of term wi occurring in document C = Di.
Term 1 - P(wi/C) = probability of term wi not occurring in document C = Di.
Taking log on both sides on equation 3.2b,
( ( , , … , / )) = ∑( + ( − ) − ) (3.4)
Conditional probability is calculated as
++Where,
Nct +1 = Number of documents containing tth word,
Nc= Total number of documents
Nc+ 2 in denominator since, there are two possibilities
- Either tth word is present in Cth document
- Or absent
19
In some cases if Nct happen to be zero then whole result tends to zero. So to prevent such
scenarios we use Laplacian smoothing as in figure
The generalized algorithm that we implement in this project is as follows,
Figure 3.4 NB Algorithm
The training phase has two tasks.
Calculating Prior
Calculating Likelihood
happen to be zero then whole result tends to zero. So to prevent such
scenarios we use Laplacian smoothing as in figure.
The generalized algorithm that we implement in this project is as follows,
lgorithm (Bernoulli model), training and testing[8]
The training phase has two tasks.
Calculating Likelihood
happen to be zero then whole result tends to zero. So to prevent such
(Bernoulli model), training and testing[8]
20
Prior
Prior, P(C=k) = Nk/ N;
Where,
Nk = Number of documents labeled as of class C = k, or Disease Di in our case.
N = Total number of documents.
Likelihood
Likelihood, P(wt /C = k) = nk(wk) / Nk
Where,
nk(wk) = number of documents of class C=k ,or Di containing word wi
The disease with highest value of P(C/F) is most likely to occur.
21
Chapter IV
Implementation and demonstration
4.1 Technology Used
Tools that are to be used in development process are
4.1.1 Application layer
HTML/CSS
Bootstrap framework
JQuery
4.1.2 Business Layer
JVM
Spring Framework
Spring MVC
4.1.3Data Layer
HSQLDB for testing
MySql for deployement
22
4.2 Knowledge Base design
The knowledge base consists of training data and vocabulary words. Such vocabulary
words are kept in four different text files. Each of symptom phrases are entered in these text files
as follows:-
Figure 4.1 Knowledge Base Design
23
The KB designed finally is depicted by following figure. It consist of four separate text files
named
1. Organs.txt
2. Condition.txt
3. Level.txt
4. Symptoms.txt
Figure 4.2A snapshot of Knowledge base.
24
4.3. Demonstration
In this section we are presenting how our system is implemented. The system is tested on
following configurations:-
• 2GB RAM
• Core 2 Duo 2.3Ghz processor
4.3.1. Sample input
As demonstration the training for Tuberculosis was done. The training input is given in the
system as given below:
Figure 4.3 Disease training implementation
25
4.3.2. Pre-processing
Stemming: As pre-processing the first step is to find root words(stemming). Once this is done
vocabulary words are extracted and phrasing is done. The outputs for this step are as following:-
Figure 4.4Stemming result
Phrasing: The stemmed result is merged together to get meaningful and complete symptom
(which is the actual vocabulary we work on).
The result of phrasing/chunking is as follows:-
Figure 4.5 Chunking result for training document01
Figure 4.6 Chunking result for training document02
Figure 4.7 Phrasing result for training document03
26
4.3.3 Feature vector
The feature vector is prepared based on- a particular symptom phrase is present in how many
documents?. The following figure is composite result of stemming, phrasing and feature vector:-
Figure 4.8 Final training data for Tuberculosis
4.3. Testing
The performance testing of the system was done and following result is obtained:-
Table 5.1 Test result for the two set of symptoms
Symptoms provided Predicted result
Shaking, moderate chills, headache,
nausea, vomiting, high fever,
sweating, anemia,coma
Jaundice, Cholera, Malaria, tuberculosis ,typhoid
Weakness and
,fatigue,Drycough,Loss of,
appetite,Abdominalpain,Diarrhea or
constipation,Rash
Typhoid, Malaria, Jaundice,Cholera,Tuberclosis
27
Conclusion and Future Enhancement
Decision Support in Heart Disease Prediction System is developed using both Naive
Bayesian Classification and Laplacian smoothing technique. The prediction of different
Dermatological diseases using Naïve Bayesian Classification in data mining technique gives
possibilities of 14 diseases using patient symptoms. This is effective model to predict diseases
based on the given symptoms by a user. The user input is parsed and processed through NLP
process like tokenization, stemming, etc. to preprocess the data. The system can extract hidden
knowledge from the database. The system is expandable in the sense that more number of
records or attributes can be incorporated and new significant rules can be generated using
underlying technique. We can extend this work with other data mining techniques and other
medical measurements besides the above list
Limitations
• This system predicts top five probable diseases hence one cannot be assured by the
results.
• This system is trained only for 14 diseases.
• This system is highly sensitive to spelling mistakes.
• User should provide input separated by comma character.
Future Enhancements
Disease Prediction System is helpful to identify the possible diseases so it can be addressed
as useful tool for early health care. Most of diseases are curable if identified early. Besides
disease prediction this system has following scopes
• Upgrade the NLP engine so that it identifies complex semantics from user input in
Natural Language.
• Dispense medicines based on statistics of predicted diseases of particular area.
• Recommend hospital based on predicted result, financial status, location, service, etc.
28
Appendix
Sample Code
A. Input pre-processing model
packageorg.aacish.disease_prediction.nlp;importjava.util.ArrayList;importjava.util.List;importorg.aacish.disease_prediction.DAO.VocabDAO;
public class InputTextProcessor implements TextProcessor {privateVocabDAO vocabDAO = null;private List<String> organs;private List<String> conditions;private List<String> levels;
publicVocabDAO getVocabDAO() {returnvocabDAO;
}
public void set VocabDAO(VocabDAO vocabDAO) {this.vocabDAO = vocabDAO;organs = this.vocabDAO.getOrgansList();
conditions = this.vocabDAO.getConditionList();levels = this.vocabDAO.getLevelList();}
public List<String> stemmer(List<String>stemDocs) {ArrayList<String> stemmed = new ArrayList<String>();for(String doc : stemDocs){
String[] tknByDel;String[] tknBySpace;String wholeDoc = "";tknByDel = doc.split(", +|\\. +");
for(String tDel : tknByDel){String sentence = "";tknBySpace = tDel.split(" +");
for(String tSp: tknBySpace){
29
sentence = sentence + EnglishStemmer.stemmer(tSp) + " ";}sentence = sentence.trim() + "," ;wholeDoc = wholeDoc + sentence;
} stemmed.add(wholeDoc);
}return stemmed;
}public List<String> extractKeyword(List<String> fromDocs) {
List<String> result = new ArrayList<String>(); for(String doc : fromDocs){
String singleDocOnlyKeywords = "";String[] tkByDel = doc.split(",");for(String phrase : tkByDel){
String keyword = "";keyword = extractFromSinglePhrase(phrase);singleDocOnlyKeywords = singleDocOnlyKeywords + keyword + " ";
}result.add(singleDocOnlyKeywords);
}System.out.println(result);
return result;}private String extractFromSinglePhrase(String phrase){
String keyword = "";String[] tokens = phrase.split(" +");outOrg:for(String s : tokens){
String s1 = s.toLowerCase().trim().replace(",", "");for(String o : organs){
if(o.equals(s1)){keyword = keyword + s1;breakoutOrg;
}}
}
outCd:for(String s : tokens){
String s1 = s.toLowerCase().trim().replace(",", "");for(String c : conditions){if(c.equals(s1)){
keyword = keyword + s1;breakoutCd;
30
}}
}outSym:for(String s : tokens){
String s1 = s.toLowerCase().trim().replace(",", "");for(String l : levels){
if(l.equals(s1)){keyword = keyword + s1;breakoutSym;
}}
}
return keyword;
}
}
B. Prepare feature vector
packageorg.aacish.disease_prediction.classifier;importjava.util.ArrayList;importjava.util.List;importorg.aacish.disease_prediction.DAO.VocabDAO;
public class PrepareInputParameter implements InputParameter{privateVocabDAOvocabDAO = null;privateArrayList<Integer>featureVector;String[] tknDoc;publicVocabDAOgetVocabDAO() {
returnvocabDAO;}
public void setVocabDAO(VocabDAOvocabDAO) {this.vocabDAO = vocabDAO;
}publicPrepareInputParameter() {}publicArrayList<Integer>prepareFeaturevector(List<String> docs) {
List<String> symptoms = this.vocabDAO.getSymptomsList();featureVector = new ArrayList<Integer>();for(String s: symptoms){
inttokenInNOofDocs = 0;for(String doc: docs){
/* Each doc prepares a feature vector.*/tknDoc = doc.split(" +");for(String tkD : tknDoc){
31
if(s.equals(alphaOnly(tkD))){tokenInNOofDocs++;break;
}}
}featureVector.add(tokenInNOofDocs);
}returnfeatureVector;
}public String alphaOnly(String ip){
String formatted="";ip = ip.toLowerCase();for(int i = 0;i<ip.length();i++){
if(Character.isLetter(ip.charAt(i))){formatted += ip.charAt(i);
}}return formatted;
}}
C. Classification algotihm
packageorg.aacish.disease_prediction.classifier;importjava.util.List;public class BernoulliNaiveBayesClassifier implements ClassificationAlgorithm {public Double classify(Double prior, List<Integer>inputVector, List<Double>conditionalProb){
Double score = Math.log(prior);for(int i = 0; i <inputVector.size(); i++){
if(inputVector.get(i)==1){score += Math.log(conditionalProb.get(i));
}else{
score += Math.log(1-conditionalProb.get(i));}
}return score;
}}
32
Screenshots
Figure 7.a Home page
33
Figure 7.b Set disease
34
Figure 7.c Training interface
35
Figure 13. Disease database snapshotFigure 13. Disease database snapshot
36
Figure 14. Database about
37
Figure 15: Symptoms
38
Bibliography
1. Apache maven, http://maven.apache.org/2. Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised
learning algorithms"3. Craig walls, Spring in Action, 20054. Hiroshi Shimodaira, Text Classification using Naive Bayes , 11 February 2014.5. Jaynes, E. T., 1957, “Information Theory and Statistical Mechanics”, Phys. Rev., 106,
620; as a level 2 postscript file.6. Jaynes, E. T., 1957, “Information Theory and Statistical Mechanics II”, Phys. Rev.,
108, 171; as a level 2 postscript file.7. J. Han and M. Kamber, “Data mining: concepts and techniques”, 2nd Ed. The Morgan
Kaufmann Series,2006.8. Manjusha K. K, K. Sankaranarayanan and Seena P, “Prediction of Different
Dermatological Conditions Using Naïve Bayesian Classification,” International Journal of Advanced Research in Computer Science and Software Engineering, Vol.4, Issue.2, January 2014.
9. McCallum, Andrew; Nigam, Kamal (1998). "A comparison of event models for Naive Bayes text classification".
10. Mitchell, Tom. M. 1997.Machine Learning. New York: McGraw-Hill.11. Ms.RupaliR.Patil, “Heart Disease Prediction System using Naive Bayes and Jelinek-
mercer smoothing,” International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3, Issue 5, May 2014.
12. Naive Bayes Classifier example, Eric Meisner, November 22, 2003.13. Rish, Irina (2001). "An empirical study of the naive Bayes classifier14. Spring Framework, http://projects.spring.io/spring-framework/15. The Bernoulli Model, http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-
model-1.html16. Tom Mitchell, Machine Learning (book), Chapter 6.