Disease Prediction Final

DISEASE PREDICTION

A Final Year Project

Submitted To

Department of Computer Science and Information Technology, Institute of

Science and Technology

Tribhuvan University

In Partial Fulfillment of the Requirements for the degree of

Bachelor of Science in Computer Science and Information Technology

Submitted By

Aashis Khanal (Roll no. : 725/067)Nasib Thakuri (Roll no. : 735/067)Nitish Shretha (Roll no. : 737/067)Prabhat Dhar Sharma (Roll no. : 739/067)

February 2015

Under the Supervision of

Ashok Kumar Pant(Sr. Biometrics Software Engineer, Tek Tak Nepal Pvt.Ltd)

Tribhuvan University

Institute of Science and Technology

Central Department of Computer Science and Information Technology

We certify that we have read this project work report and in our opinion it is satisfactory in the

scope and quality as a final year project report in the partial fulfillment for the requirement of

Bachelor of Science in Computer Science and Information Technology.

EVALUATION COMMITTEE

---------------------------- ------------------------------

Ashok Kumar Pant Sushant Poudel, Head of Department

Sr. Biometrics Software Engineer Department of CSIT

Tek Tak Nepal Pvt.Ltd Kathford International College

(Supervisor)

------------------------- ------------------------------

(External Examiner) (Internal Examiner)

Date:---------------------------

LETTER OF APPROVAL

This is to certify that the work embodied in this final year project entitled “Disease Prediction”

submitted by Mr. Ashish Khanal (725/067), Mr. Naseeb Thakuri (735/067), Ms. Prabhat Dhar

Sharma (739/067) and Ms. Nitish Shrestha (737\067) , to the Central Department of Computer

Science and Information Technology, is carried out under my supervision.

The project work has been prepared as per the regulations of Tribhuvan University, and I have read and

recommend that this project work be accepted in partial fulfillment of the requirements for the

Bachelor’s degree in Computer Science and Information Technology.

---------------------------------

Mr. Ashok Kumar pant

Sr. Biometrics Software Engineer

Tek Tak Nepal Pvt.Ltd

(Supervisor)

--------------------------------External Examiner

--------------------------------Susant PoudelHead of DepartmentDepartment of Computer Science and Information TechnologyKathford International College

COPYRIGHT

The author has agreed that the Library and Department of Computer Science and Information

Technology of Kathford International College of Engineering and Management may make this report

freely available for inspection. Moreover the author has agreed that permission for extensive copying

of this project report for scholarly purpose may be granted by the supervisors who supervised the

project work recorded here in, or in their absence, by the Head of Department where in the project

report was done. It is understood that the recognition will be given to the author of this report and to the

Department of Computer Science and Information Technology, Kathford International College of

Engineering and Management in any use of the material of this project report. Copying or publication

or the other use of this report for the financial gain without approval of to the Department of Computer

Science and Information Technology, Kathford International College of Engineering and Management

and author’s written permission is prohibited.

Request for permission to copy or to make any other use of the material in this report in whole or in

part should be addressed to:

Head of Department

Department of Computer Science and Information Technology

Kathford International College of Engineering and Management,

Lalitpur, Nepal

ABSTRACT

Fact that medical reports are rich in disease and their symptoms related information. We, in this paper, are figuring out a way, based on statistical document classification, to predict the most likely disease, provided medical problems of a person.

In spite of having possibility of large volume of data of disease and symptoms, we prefer Naïve Bays Classifier, It assumes that the probability of each term(symptom) occurring in the document is independent of the occurrences of other terms and it is fast and has a good practical reputation on document classification. A vocabulary V is maintained, this vocabulary consist of disease symptom terms. Feature vector, with size |V|, of each document takes Boolean value as its ith dimension. If a vocabulary term present in the document then 1 if not present then 0. For this reason we prefer to follow Bernoulli Distribution. A variant of Naïve Bayes that follows Bernoulli Distribution is Naïve Bernoulli Distribution[1]. The project completion is done on two important phases Learning and Testing.

ACKNOWLEDGEMENT

We would like to take this opportunity to express our sincere thanks to the Department of Computer

Science and Information Technology, Kathford International College of Engineering and Management

for providing us this opportunity to explore our interest and ideas in the field of Computer Science

through this project.

We would like to thank our Supervisor Mr. Ashok Kumar Pant ( Sr. Biometrics Software Engineer) for

his kind support, coordination and valuable supervision for this project.

We would also like to acknowledge and extend our gratitude to everyone for his/her support and

encouragement for this project.

Table of Contents

List of Figures ............................................................................................................................... III

List of Tables ................................................................................................................................ IV

List of Abbreviations ..................................................................................................................... V

Chapter I.......................................................................................................................................... 1

Introduction................................................................................................................................. 1

1.1 Introduction .................................................................................................................. 1

1.2 Motivation .................................................................................................................... 2

1.3 Problem Statement........................................................................................................ 2

1.4 Objectives ..................................................................................................................... 2

1.5 Scope ............................................................................................................................ 3

1.6 Applications.................................................................................................................. 3

Chapter II ........................................................................................................................................ 4

Planning and Analysis................................................................................................................. 4

2.1 Planning ........................................................................................................................ 4

2.2 Requirement Analysis................................................................................................... 4

2.3 Feasibility Study ........................................................................................................... 5

2.4 Data Collection ............................................................................................................. 6

Chapter III....................................................................................................................................... 8

Methodology............................................................................................................................... 8

3.1. Literature review........................................................................................................... 8

3.2. System Design ............................................................................................................ 13

3.3. Project Methodology .................................................................................................. 15

Chapter IV..................................................................................................................................... 21

Implementation and demonstration........................................................................................... 21

4.1 Technology Used........................................................................................................ 21

4.2 Knowledge Base design.............................................................................................. 22

4.3. Testing ........................................................................................................................ 26

Conclusion and Future Enhancement ........................................................................................... 27

Limitations ............................................................................................................................ 27

Future Enhancements............................................................................................................ 27

Appendix....................................................................................................................................... 28

Sample Code ......................................................................................................................... 28

Screenshots ........................................................................................................................... 32

Figure 15: Symptoms................................................................................................................... 37

Bibliography ......................................................................................................................... 38

List of Figures

Figure 3.1 System Architecture ................................................................................................... 13Figure 3.2 Use case diagram........................................................................................................ 14Figure 3.4 NB algorithm(Bernoulli model), training and testing ................................................ 19Figure 4.1 Knowledge Base Design............................................................................................. 22Figure 4.2 A snapshot of Knowledge base. ................................................................................. 23Figure 4.3 Disease training implementation ................................................................................ 24Figure 4.4 Stemming result.......................................................................................................... 25Figure 4.5 Chunking result for training document01................................................................... 25Figure 4.6 Chunking result for training document02................................................................... 25Figure 4.7 Phrasing result for training document03 .................................................................... 25Figure 4.8 Final training data for Tuberculosis ........................................................................... 26

List of TablesTable 2.1The source for the documents.......................................................................................... 6Table 2.2 Diseases and its no. of documents .................................................................................. 7Table 3.1 Stop Words Removal .................................................................................................... 15Table 3.2 Stemming Concept........................................................................................................ 15Table3.3 Tokenization of input..................................................................................................... 16Table 3.4 Chunking of words........................................................................................................ 16Table 3.5 Training matrix for jaundice ......................................................................................... 17Table 5.1 Test result for the two set of symptoms........................................................................ 26

List of Abbreviations

S.N Abbreviations Descriptions

1. ANN Artificial Neural Network2. CSS Cascading Style Sheet3. HTML Hyper Text Markup Language4. HSQLDB HyperSQL DataBase5. JVM Java Virtual Machine6. KNN K-nearest neighbor7. KB Knowledge Base8. NLP Natural Language Processing9. SVM Support Vector Machines

10. Tf-idf Term frequency–inverse document frequency11 MVC Model View Controller12 KB Knowledge Base

1

Chapter I

Introduction

1.1 Introduction

A health conscious person will always have anxieties when he has symptoms related to

some dreaded diseases when actually he may not even have any diseases. However, some people

may be ignorant about the fatality of the disease. Such ignorance can lead to severe stage of a

disease which may cost a lot of money and possibly even death. We are developing a web based

system which can be used by a person via internet. The system will take symptoms from user via

a web form. On the basis of the given symptoms the system will recommend the likely disease.

The system uses the widely used classification method called document classification. In this

project, we are trying to use one of most popular document classification method, Naïve Bayes

Classifier, to match disease and given symptoms to predict the most probable disease. In this

system, the diseases are treated as/ classes which have different set of documents which contains

its symptoms. The classification process is based on the matching of the given input (symptoms).

One of the merits of this system is that a person can save a lot of time in identifying the probable

disease early. Moreover, the system can be used as a support system to refer patients to the

respective departments. The same can be used by a general user to know which specialist he has

to concern with. Overall, the system can be a great tool for developing a medical diagnosis

support system.

2

1.2 Motivation

If we know that the symptoms are referring towards the diseases like sugar, blood cancer,

etc. We can proceed for early treatment that might help in maintaining our health. In some cases,

it can save us from needing to spend a lot of money in the treatment. Dealing with these feelings,

we found out that a computer application that predicts the most probable disease out of given

symptoms would be lot helpful.

1.3 Problem Statement

Most of the time when we feel something unusual in our body we start searching on

Google: “what are the symptoms of Sugar/Blood cancer/Jaundice?” and stuffs like that. It is true

that we all are afraid that we might be getting those deadly diseases. It will be helpful if we

provide information, how we are feeling right now and what are possible diseases for these

conditions. We have lots of disease related information in the internet. In this project we, a group

of students, have tried to process that information to train our system. Once the system is trained

and adjusted for errors it is able to find the most probable disease if anyone provides list of

symptoms.

1.4 Objectives

The main objectives of this project are,

• To predict the most probable diseases forgiven symptoms.

• To develop a doctor assistant system.

3

1.5 Scope

Although this system has wide range of scope, some major areas of scope are

• Anyone who can access internet can be benefited from disease prediction system.

• Online Doctor appointment:

Upon integration with a hospital’s website this system can automatically appoint a doctor based on user symptoms.

• Front Desk assistant in hospitalsDoctor appointment and time slot management in hospital.

• Recommendation system based on disease predicted economic condition, etc.

1.6 Applications

The applications of this project can be listed as:-• Being anywhere but accessible to internet, one can check health condition instantly.• People of remote areas can be made aware of deadly diseases early based on

symptoms. This can someday save someone’s life. • This system can also be used in real world medical assistant.• This system can help a user to get information about hospitals.

4

Chapter II

Planning and Analysis

2.1 Planning

In planning phase a study on required data is done. This system is data based so

collection of reliable data and finer pre-processing of that data is our main objective. We also

need huge amount of data. We concluded that there is no better alternative than internet for data

collection.

A web based application has a greater impact rather than a desktop application. For systems like

Disease Prediction system that needs more enhancements over future should be modular and

scalable. Taking into account we concluded on using JVM framework.

2.2 Requirement Analysis

The requirements are to be collected before starting of the projects’ development life

cycle. The initial requirements are the one that head start the projects development.

2.2.1 Software Requirement Specification

i. Functional Requirement

The main goal of system is to suggest a disease based on the given symptoms. We can use the

application form anywhere with internet access. The system will take natural English phrase

separated by comma as input. Therefore the user interface is simple to use.

ii. Non Functional Requirements

It includes features such as:

Availability:-This system is web based so anyone with internet connectivity can have access to

it.

5

Maintainability;-The system can be trained with disease from UI, this makes the system highly

maintainable.

Portability:-This application is JVM based so any machine running with JVM can operate it.

The use of Bootstrap framework makes it highly responsive to all sorts of devices like android

phone, iphone, etc.

iii. Technical Specifications

• JVM 1.7 or higher

• Apache tomcat 7.0+

• Spring Framework 4.1.3.RELEASE

• Apache maven 3.2.5

• Java Web module 3.0+

2.3 Feasibility Study

Following things are taken under consideration under feasibility study:-

2.3.1 Economical feasibility

Economic analysis is the most frequently used method for evaluating the effectiveness of

a proposed system. It is more commonly known as cost benefit analysis, the procedure to

determine the benefits and saving that are expected from a decision is made to doing and

implement the system. Otherwise make alterations in the proposed system.

Cost associated with the development of computer-based systems is as follows.

i. Procurement costs such as consultation, equipment purchase, installation, furnishing the

size etc.

ii. Start-up costs, user operating system cost, personal search cost etc. Project related costs

such as software purchase, training personnel, data collection, documents preparation

costs etc.

iii. Ongoing costs such as hardware, software maintenance, rental, depreciation of hardware

costs etc. Easy installation and free of cost in use.

6

2.3.2 Operational feasibility

User interface is designed to be user friendly. Any user with simple knowledge on using Internet

can use it. Beside this the app provides user the user friendly system to input the symptoms and

navigate to other pages for further information about diseases.

2.4 Data Collection

Both primary data and secondary data are used in the project.

a. List of Diseases.

b. List of symptoms

In this phase, we will be collecting disease reports that consist of disease symptoms and

respective diseases. As per our objective for 14 diseases, we will search the internet for their

symptoms, and related medical terms. These standards will be implemented as the training data.

Following web resources and locations are used to collection training data

Table 2.1The source for the documents

Web Site No of Documents collected for each Diseasehttp://www.webmd.com/ 4http://www.nhs.uk/ 5http://www.mayoclinic.org/ 3http://www.medicinenet.com/ 4http://www.celiac.org 5

7

Following diseases and documents are trained in present system.

Table 1.2 Diseases and its no. of documents

Disease Name Number Of Documents Trained

Asthma 3Cholera 3Diabetes 3Epilepsy 2Flu 2Jaundice 3Malaria 3Pneumonia 3Rabies 3Sinus 3Smallpox 2Tetanus 3Tuberclosis 3Typhpoid 2

8

Chapter III

Methodology

3.1. Literature review

Document classification or document categorization is a problem in library science,

information science and computer science. The task is to assign a document to one or more

classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The

intellectual classification of documents has mostly been the province of library science, while the

algorithmic classification of documents is mainly in information science and computer science.

The problems are overlapping, however, and there is therefore interdisciplinary research on

document classification.

In this section, we define the past work implemented on the disease prediction systems .The are

few of the disease prediction related systems proposed in the past are mentioned in this section.

A model for Prediction of Different Dermatological Conditions built with the aid of Naïve

Bayesian Classification was proposed by Manjusha K. K,K. Sankaranarayanan and Seena P.

They used Naïve Bayes Classification for the system. Data was collected from various tertiary

health care centers in Kottayam and Alappuzha districts of Kerala and filled out by doctors. The

research developed on the basis of the survey. Prediction of Different Dermatological Conditions

was capable calculating the chances of occurrence of various diseases presented. The

demonstration was carried out for the eight diseases. This model could answer complex queries,

each with its own strength with respect to ease of model interpretation, access to detailed

information and accuracy.

Another model called Heart Disease Prediction System developed using Naïve Bayes and

Jelinek-mercer smoothing proposed by Ms.RupaliR. Patilis implemented as an application in

matlab which can answer user queries, it can discover and extract hidden knowledge (patterns

9

and relationships) associated with heart disease from a historical heart disease database. The

Record set with medical attributes was obtained from the Cleveland Heart Disease database.

The proposed system used 13 attributes of medical diagnosis. The accuracy of the result using

Naïve Bayes was 78% whereas, after using the smoothing, the accuracy jumped to 86%.

The one of the earliest and popular expert system for the disease prediction is MYCIN. It was

used for selection of antibiotics for patients with serious infections. Medical decision making,

particularly in clinical medicine is regarded as an "art form" rather than a "scientific discipline":

this knowledge must be systemized for practical day-to-day use and for teaching and learning

clinical medicine

3.1.1 Data Mining Techniques Used For Prediction

Classification is a very important data mining task, and the purpose of classification is to

propose a classification function or classification model (called classifier).The classification

model can map the data in the database to a specific class. Classification construction methods

include: Decision Tree, Naive Bayes, ANN, KNN, Support Vector Machine, Rough set, Logistic

Regression, Genetic Algorithms (GAs) / Evolutionary Programming (EP), Clustering etc .

i. Decision Tree: The decision tree is a structure that includes root node, branch and leaf node.

Each internal node denotes a test on attribute, each branch denotes the outcome of test and each

leaf node holds the class label. The topmost node in the tree is the root node. The decision tree

approach is more powerful for classification problems. There are two steps in this techniques

building a tree & applying the tree to the dataset. There are many popular decision tree

algorithms CART, ID3, C4.5, CHAID, and J48.

ii. Artificial Neural Network (ANN): is a collection of neuron –like processing units with weight

connections between the units. It maps a set of input data onto a set of appropriate output data. It

consists of 3 layers: input layer, hidden layer & output layer. There is connection between each

layer& weights are assigned to each connection. The primary function of neurons of input layer

is to divide input xi into neurons in hidden layer. Neuron of hidden layer adds input signal xi

with weights wji of respective connections from input layer. The output Yj is function of Yj = f (Σ

wji xi) Where f is a simple threshold function such as sigmoid or hyperbolic tangent function.

10

iii. Naive Bayes: Naive Bayes classifier is based on Bayes theorem. This classifier algorithm uses conditional independence, means it assumes that an attribute value on a given class is independent of the values of other attributes.

The Bayes theorem is as follows: Let X={x1, x2... xn} be a set of n attributes. In Bayesian, X is

considered as evidence and H is some hypothesis

Means, the data of X belongs to specific class C. We have to determine P (H|X), the probability

that the hypothesis H holds given evidence i.e. data sample X. According to Bayes theorem, the

P (H|X) is expressed as

iv. K-Nearest Neighbor: The k-nearest neighbor’s algorithm (K-NN) is a method for classifying

objects based on closest training data in the feature space. K-NN is a type of instance-based

learning. The k-nearest neighbor algorithm is amongst the simplest of all machine learning

algorithms. But the accuracy of the k-NN algorithm can be severely degraded by the presence of

noisy or irrelevant features, or if the feature scales are not consistent with their importance.

v. Logistic Regression: The term regression can be defined as the measuring and analyzing the

relation between one or more independent variable and dependent variable. Regression can be

defined by two categories; they are linear regression and logistic regression. Logistic regression

is a generalized by linear regression. It is mainly used for estimating binary or multi-class

dependent variables and the response variable is discrete, it cannot be modeled directly by linear

regression i.e. discrete variable changed into continuous value. Logistic regression basically is

used to classify the low dimensional data having non-linear boundaries. It also provides the

difference in the percentage of dependent variable and provides the rank of individual variable

according to its importance. So, the main motto of Logistic regression is to determine the result

of each variable correctly.

( | ) = ( | ) ( )( ) (3.1)

11

vi .Rough Sets: A Rough Set is determined by a lower and upper bound of a set. Every member

of the lower bound is a certain member of the set. Every non-member of the upper bound is a

certain non-member of the set. The upper bound of a rough set is the union between the lower

bound and the so-called boundary region. A member of the boundary region is possibly (but not

certainly) a member of the set. Therefore, rough sets may be viewed as with a three-valued

membership function (yes, no, perhaps). Rough sets are a mathematical concept dealing with

Uncertainty in data. They are usually combined with other methods such as rule induction or

clustering methods.

vii. Support Vector Machine (SVM): Support vector machine (SVM) is an algorithm that

attempts to find a linear separator (hyper-plane) between the data points of two classes in

multidimensional space. SVMs are well suited to dealing with interactions among features and

redundant features.

viii. Genetic Algorithms (GAs)/Evolutionary Programming (EP): Genetic algorithms and

evolutionary programming are used in data mining to formulate hypotheses about dependencies

between variables, in the form of association rules or some other internal formalism..

ix. Clustering: Clustering is the process of grouping similar elements. This technique may be

used as a preprocessing step before feeding the data to the classifying model. The attribute values

need to be normalized before clustering to avoid high value attributes dominating the low value

attributes. Further, classification is performed based on clustering.

12

3.1.2 Feature Extraction and Selection

The feature extraction and selection is the process of dimensionality reduction which can

be summarized as transforming the existing features into a lower dimensional space and

selecting a subset of the existing features without a transformation respectively. Few of the

methods to achieve the feature extraction and selection are discussed below.

i. Tf-idf: Short for term frequency–inverse document frequency is a numerical statistic that is

intended to reflect how important a word is to a document in a collection or corpus. It is often

used as a weighting factor in information retrieval and text mining. The tf-idf value increases

proportionally to the number of times a word appears in the document, but is offset by the

frequency of the word in the corpus, which helps to adjust for the fact that some words appear

more frequently in general.

ii. Information Gain: In decision tree learning, information gain ratio is a ratio of information

gain to the intrinsic information. It is used to reduce a bias towards multi-valued attributes by

taking the number and size of branches into account when choosing an attribute. We want to

determine which attribute in a given set of training feature vectors is most useful for

discriminating between the classes to be learned. Information gain tells us how important a given

attribute of the feature vectors is.

iii. Maximum Entropy: In his famous 1957 paper, Ed. T. Jaynes wrote: “Information theory

provides a constructive criterion for setting up probability distributions on the basis of partial

knowledge, and leads to a type of statistical inference which is called the maximum entropy

estimate.”.

It is least biased estimate possible on the given information; i.e., it is maximally noncommittal

with regard to missing information. That is to say, when characterizing some unknown events

with a statistical model, we should always choose the one that has Maximum Entropy. Maximum

Entropy Modeling has been successfully applied to Computer Vision, Spatial Physics, Natural

Language Processing and many other fields. This page will focus on applying Maxent to Natural

Language Processing (NLP).

13

3.2. System Design

The initial design of a system determines the conceptual overview of system. System

design is the framework for system development. The design process involved in this project are

included in this section.

3.2.1 System Architecture

The conceptual overview of the system is as follows:-

Figure 3.1 System Architecture

14

3.2.1 Use Case Diagram

The use actors involved and actions performed of this system are emitted by diagram

below. Admin performs authentication and training of the system. User only has the permission

to provide input and get the predicted result.

Admin

User

Figure 3.2 Use case diagram

Login

Prediction result

Provide symptom

Train disease

Add disease

15

3.3. Project Methodology

The methods and theory applied in this project are included in this section.

3.2.1. Document preprocessing

i. Stop word removal

The common terms (words) that are not of interest are removed in this phase. The terms like is,

the, have, etc. do not contribute on the result of classification.

Table 2.1 Stop Words Removal

S.N. Sentence Stop words1. coughing that lasts longer than 2 weeks with

green mucusThat, than, 2,

2. People with hypothyroidism may experience aches and pains in muscles and joints

With, may, and, in

ii. Stemming

There are varieties of derived words for a single root word. The common practice is all the

processing should be done in root words. To extract root words from derived words stemming is

used.

Table 3.2 Stemming Concept

S.N. Derived Word Root Word1. Pain Pain2. Paining Pain3. Pained Pain4 Fast Fast5 Faster Fast6. Dizziness Dizzy7. Weakness Weak

16

iii. Tokenization

The input is tokenized (separate the medical terms) for further processing. The concept of

tokenization is depicted by following table.

Table 3.3 Tokenization of input

iii. Chunking

The tokens generated from previous phases may not be meaningful unless they come in chunks

(single phrase). The concept of chunking used in this project is depicted in following table.

Table 3.4 Chunking of words

S.N. Sentence Tokens Chunks1 coughing that lasts longer than 2 weeks

with green mucusCough, green, mucus greenmucuscough

2 People with hypothyroidism may experience aches and pains in muscles and joints

Aches, pain, muscles, joints

Muscleache, musclepain, jointache, jointpain

3.2.2. Feature extraction

Initially we have a list of diseases and a list of their symptoms as vocabulary. We use this data to

train our application. The training data is the i*j matrix. Each column is a term in vocabulary and

each row is jth document of class Di. Each cell of training matrix has either 0 or 1 as entry.

More clearly,

C = {Jaundice, Pneumonia, …}.

Say in Jaundice class there are k documents.

S.N. Input Tokens

1. coughing that lasts longer than 2 weeks with green

mucus

Cough, green, mucus

2. People with hypothyroidism may experience aches

and pains in muscles and joints

Aches, pain, muscles, joints

17

Vocabulary has |V| entry, let V = {FEVER, COUGH, HIGH_FEVER, CHEST_PAIN,..}

Training matrix for Jaundice is of size k * |V| and so on for all diseases in C.

Table 3.5 Training matrix for jaundice

Once sufficient training is done then it is ready for testing. In testing phase user gives a textual

input with his/her symptoms. From that text, we extract present vocabulary terms and prepare a

input vector Vi as:

Vi = [1, 0, 0, 1, . . .]

It indicates that the user has FEVER, has no COUGH, has no HIGH_FEVER, and has

CHEST_PAIN.

Now, we calculate P(Jaundice/Vi). We compute this term for all classes in C.

Where,

C = Diseases

Vi = feature vector

3.2.3 Naïve Bayes Document Classification

The Naïve Bayes Classifiers is a family of probabilistic classifiers. It is based on Bayes’

Theorem. The key concept of the prediction is computing probability P(Di/f1, f2, . . . ,f|v|) for all

Di in C = {D1,…, DC} and predicting the one with highest value.

Where,

• Di = ith Disease in C

• f1, f2, . . . ,f|v| are features vector of Document D the feature vector of user given

document.

• The size of feature vector depends on number of terms in vocabulary V, i.e. |V|.

FEVER COUGH HIGH_FEVER CHEST_PAIN . . . JAND1 1 0 1 1JAND2 1 0 0 0JAND3 1 1 1 1

18

The probability that term (feature) fi occurs in document D is independent of present or absent of

other features, this is the “Naïve” feature of Naïve Bayes Classifier. User is expected to give

input as a plain text from which vocabulary terms are extracted and a feature vectoris prepared.

The classification is done as:

( 1, 2 , … , ) = ( = )1

( = | = )(3.2)

3.2.3.1 Bernoulli Model

The presence or absence of symptom in input is likely to follow Bernoulli distribution. The

Likelihood for Bernoulli classifier is,

( , … , | ) = [ 1

( | ) + ( − )( − ( | ))](3.3)

Where,

Fi is 1 or 0, Based on feature vector.

Term P(wi/C) = probability of term wi occurring in document C = Di.

Term 1 - P(wi/C) = probability of term wi not occurring in document C = Di.

Taking log on both sides on equation 3.2b,

( ( , , … , / )) = ∑( + ( − ) − ) (3.4)

Conditional probability is calculated as

++Where,

Nct +1 = Number of documents containing tth word,

Nc= Total number of documents

Nc+ 2 in denominator since, there are two possibilities

- Either tth word is present in Cth document

- Or absent

19

In some cases if Nct happen to be zero then whole result tends to zero. So to prevent such

scenarios we use Laplacian smoothing as in figure

The generalized algorithm that we implement in this project is as follows,

Figure 3.4 NB Algorithm

The training phase has two tasks.

Calculating Prior

Calculating Likelihood

happen to be zero then whole result tends to zero. So to prevent such

scenarios we use Laplacian smoothing as in figure.

The generalized algorithm that we implement in this project is as follows,

lgorithm (Bernoulli model), training and testing[8]

The training phase has two tasks.

Calculating Likelihood

happen to be zero then whole result tends to zero. So to prevent such

(Bernoulli model), training and testing[8]

20

Prior

Prior, P(C=k) = Nk/ N;

Where,

Nk = Number of documents labeled as of class C = k, or Disease Di in our case.

N = Total number of documents.

Likelihood

Likelihood, P(wt /C = k) = nk(wk) / Nk

Where,

nk(wk) = number of documents of class C=k ,or Di containing word wi

The disease with highest value of P(C/F) is most likely to occur.

21

Chapter IV

Implementation and demonstration

4.1 Technology Used

Tools that are to be used in development process are

4.1.1 Application layer

HTML/CSS

Bootstrap framework

JQuery

4.1.2 Business Layer

JVM

Spring Framework

Spring MVC

4.1.3Data Layer

HSQLDB for testing

MySql for deployement

22

4.2 Knowledge Base design

The knowledge base consists of training data and vocabulary words. Such vocabulary

words are kept in four different text files. Each of symptom phrases are entered in these text files

as follows:-

Figure 4.1 Knowledge Base Design

23

The KB designed finally is depicted by following figure. It consist of four separate text files

named

1. Organs.txt

2. Condition.txt

3. Level.txt

4. Symptoms.txt

Figure 4.2A snapshot of Knowledge base.

24

4.3. Demonstration

In this section we are presenting how our system is implemented. The system is tested on

following configurations:-

• 2GB RAM

• Core 2 Duo 2.3Ghz processor

4.3.1. Sample input

As demonstration the training for Tuberculosis was done. The training input is given in the

system as given below:

Figure 4.3 Disease training implementation

25

4.3.2. Pre-processing

Stemming: As pre-processing the first step is to find root words(stemming). Once this is done

vocabulary words are extracted and phrasing is done. The outputs for this step are as following:-

Figure 4.4Stemming result

Phrasing: The stemmed result is merged together to get meaningful and complete symptom

(which is the actual vocabulary we work on).

The result of phrasing/chunking is as follows:-

Figure 4.5 Chunking result for training document01

Figure 4.6 Chunking result for training document02

Figure 4.7 Phrasing result for training document03

26

4.3.3 Feature vector

The feature vector is prepared based on- a particular symptom phrase is present in how many

documents?. The following figure is composite result of stemming, phrasing and feature vector:-

Figure 4.8 Final training data for Tuberculosis

4.3. Testing

The performance testing of the system was done and following result is obtained:-

Table 5.1 Test result for the two set of symptoms

Symptoms provided Predicted result

Shaking, moderate chills, headache,

nausea, vomiting, high fever,

sweating, anemia,coma

Jaundice, Cholera, Malaria, tuberculosis ,typhoid

Weakness and

,fatigue,Drycough,Loss of,

appetite,Abdominalpain,Diarrhea or

constipation,Rash

Typhoid, Malaria, Jaundice,Cholera,Tuberclosis

27

Conclusion and Future Enhancement

Decision Support in Heart Disease Prediction System is developed using both Naive

Bayesian Classification and Laplacian smoothing technique. The prediction of different

Dermatological diseases using Naïve Bayesian Classification in data mining technique gives

possibilities of 14 diseases using patient symptoms. This is effective model to predict diseases

based on the given symptoms by a user. The user input is parsed and processed through NLP

process like tokenization, stemming, etc. to preprocess the data. The system can extract hidden

knowledge from the database. The system is expandable in the sense that more number of

records or attributes can be incorporated and new significant rules can be generated using

underlying technique. We can extend this work with other data mining techniques and other

medical measurements besides the above list

Limitations

• This system predicts top five probable diseases hence one cannot be assured by the

results.

• This system is trained only for 14 diseases.

• This system is highly sensitive to spelling mistakes.

• User should provide input separated by comma character.

Future Enhancements

Disease Prediction System is helpful to identify the possible diseases so it can be addressed

as useful tool for early health care. Most of diseases are curable if identified early. Besides

disease prediction this system has following scopes

• Upgrade the NLP engine so that it identifies complex semantics from user input in

Natural Language.

• Dispense medicines based on statistics of predicted diseases of particular area.

• Recommend hospital based on predicted result, financial status, location, service, etc.

28

Appendix

Sample Code

A. Input pre-processing model

packageorg.aacish.disease_prediction.nlp;importjava.util.ArrayList;importjava.util.List;importorg.aacish.disease_prediction.DAO.VocabDAO;

public class InputTextProcessor implements TextProcessor {privateVocabDAO vocabDAO = null;private List<String> organs;private List<String> conditions;private List<String> levels;

publicVocabDAO getVocabDAO() {returnvocabDAO;

}

public void set VocabDAO(VocabDAO vocabDAO) {this.vocabDAO = vocabDAO;organs = this.vocabDAO.getOrgansList();

conditions = this.vocabDAO.getConditionList();levels = this.vocabDAO.getLevelList();}

public List<String> stemmer(List<String>stemDocs) {ArrayList<String> stemmed = new ArrayList<String>();for(String doc : stemDocs){

String[] tknByDel;String[] tknBySpace;String wholeDoc = "";tknByDel = doc.split(", +|\\. +");

for(String tDel : tknByDel){String sentence = "";tknBySpace = tDel.split(" +");

for(String tSp: tknBySpace){

29

sentence = sentence + EnglishStemmer.stemmer(tSp) + " ";}sentence = sentence.trim() + "," ;wholeDoc = wholeDoc + sentence;

} stemmed.add(wholeDoc);

}return stemmed;

}public List<String> extractKeyword(List<String> fromDocs) {

List<String> result = new ArrayList<String>(); for(String doc : fromDocs){

String singleDocOnlyKeywords = "";String[] tkByDel = doc.split(",");for(String phrase : tkByDel){

String keyword = "";keyword = extractFromSinglePhrase(phrase);singleDocOnlyKeywords = singleDocOnlyKeywords + keyword + " ";

}result.add(singleDocOnlyKeywords);

}System.out.println(result);

return result;}private String extractFromSinglePhrase(String phrase){

String keyword = "";String[] tokens = phrase.split(" +");outOrg:for(String s : tokens){

String s1 = s.toLowerCase().trim().replace(",", "");for(String o : organs){

if(o.equals(s1)){keyword = keyword + s1;breakoutOrg;

}}

}

outCd:for(String s : tokens){

String s1 = s.toLowerCase().trim().replace(",", "");for(String c : conditions){if(c.equals(s1)){

keyword = keyword + s1;breakoutCd;

30

}}

}outSym:for(String s : tokens){

String s1 = s.toLowerCase().trim().replace(",", "");for(String l : levels){

if(l.equals(s1)){keyword = keyword + s1;breakoutSym;

}}

}

return keyword;

}

}

B. Prepare feature vector

packageorg.aacish.disease_prediction.classifier;importjava.util.ArrayList;importjava.util.List;importorg.aacish.disease_prediction.DAO.VocabDAO;

public class PrepareInputParameter implements InputParameter{privateVocabDAOvocabDAO = null;privateArrayList<Integer>featureVector;String[] tknDoc;publicVocabDAOgetVocabDAO() {

returnvocabDAO;}

public void setVocabDAO(VocabDAOvocabDAO) {this.vocabDAO = vocabDAO;

}publicPrepareInputParameter() {}publicArrayList<Integer>prepareFeaturevector(List<String> docs) {

List<String> symptoms = this.vocabDAO.getSymptomsList();featureVector = new ArrayList<Integer>();for(String s: symptoms){

inttokenInNOofDocs = 0;for(String doc: docs){

/* Each doc prepares a feature vector.*/tknDoc = doc.split(" +");for(String tkD : tknDoc){

31

if(s.equals(alphaOnly(tkD))){tokenInNOofDocs++;break;

}}

}featureVector.add(tokenInNOofDocs);

}returnfeatureVector;

}public String alphaOnly(String ip){

String formatted="";ip = ip.toLowerCase();for(int i = 0;i<ip.length();i++){

if(Character.isLetter(ip.charAt(i))){formatted += ip.charAt(i);

}}return formatted;

}}

C. Classification algotihm

packageorg.aacish.disease_prediction.classifier;importjava.util.List;public class BernoulliNaiveBayesClassifier implements ClassificationAlgorithm {public Double classify(Double prior, List<Integer>inputVector, List<Double>conditionalProb){

Double score = Math.log(prior);for(int i = 0; i <inputVector.size(); i++){

if(inputVector.get(i)==1){score += Math.log(conditionalProb.get(i));

}else{

score += Math.log(1-conditionalProb.get(i));}

}return score;

}}

32

Screenshots

Figure 7.a Home page

33

Figure 7.b Set disease

34

Figure 7.c Training interface

35

Figure 13. Disease database snapshotFigure 13. Disease database snapshot

36

Figure 14. Database about

37

Figure 15: Symptoms

38

Bibliography

1. Apache maven, http://maven.apache.org/2. Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised

learning algorithms"3. Craig walls, Spring in Action, 20054. Hiroshi Shimodaira, Text Classification using Naive Bayes , 11 February 2014.5. Jaynes, E. T., 1957, “Information Theory and Statistical Mechanics”, Phys. Rev., 106,

620; as a level 2 postscript file.6. Jaynes, E. T., 1957, “Information Theory and Statistical Mechanics II”, Phys. Rev.,

108, 171; as a level 2 postscript file.7. J. Han and M. Kamber, “Data mining: concepts and techniques”, 2nd Ed. The Morgan

Kaufmann Series,2006.8. Manjusha K. K, K. Sankaranarayanan and Seena P, “Prediction of Different

Dermatological Conditions Using Naïve Bayesian Classification,” International Journal of Advanced Research in Computer Science and Software Engineering, Vol.4, Issue.2, January 2014.

9. McCallum, Andrew; Nigam, Kamal (1998). "A comparison of event models for Naive Bayes text classification".

10. Mitchell, Tom. M. 1997.Machine Learning. New York: McGraw-Hill.11. Ms.RupaliR.Patil, “Heart Disease Prediction System using Naive Bayes and Jelinek-

mercer smoothing,” International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3, Issue 5, May 2014.

12. Naive Bayes Classifier example, Eric Meisner, November 22, 2003.13. Rish, Irina (2001). "An empirical study of the naive Bayes classifier14. Spring Framework, http://projects.spring.io/spring-framework/15. The Bernoulli Model, http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-

model-1.html16. Tom Mitchell, Machine Learning (book), Chapter 6.

Disease Prediction Final

Documents

Transcript of Disease Prediction Final