In this presentation, you will be introduced to data mining and the...

In this presentation, you will be introduced to data mining and the relationship

with meaningful use.

Data mining refers to the art and science of intelligent data analysis. It is the

application of machine learning algorithms to large data sets with the primary

aim of discovering meaningful insights and knowledge from that data.

Data mining essentially is the construction of data models that instantiate a

machine learning algorithm on specific data elements. The model captures the

essence of the discovered knowledge and helps us in our understanding of the

world. Often times, these models are predictive. For instance, data mining

models have been applied to healthcare data to predict readmissions, risk of

disease, and efficacy of medications.

Modeling is the process of turning all that data into some structured form or

model that reflects the supplied data in useful way. The aim of modeling is to

explore the data to address a specific problem by modeling or mimicking the

real world. For instance, a lot research has been done in modeling the way in

which we make decisions. Machine learning algorithms that use artificial

intelligence develop models that closely represent how a human would make a

decision. The same methods can be applied to healthcare data were we

attempt to model decision making. For instance, we might want to develop a

model to predict drug relapse in patients with a history of drug addiction. The

machine learning algorithms, using artificial intelligence, would look at all of the

data elements to come up with a decision on the likelihood of whether a

patient will relapse. Unfortunately, no model can perfectly represent the world.

For instance, we might find that our model predicts a patient will relapse even

if the patient does not have a history of drug addiction. In the real world, we

would never make this mistake, but due to the rules governing the machining

learning algorithm, such mistakes are possible.

To ensure that the model is constructed in such a way to limit such mistakes

and represent the real world as closely as possible, there are a set of 8 steps

that can be followed. First, you must have a clear understanding of the data

and the business of healthcare. If you do not know what the data mean, it is

likely that your model will not make sense. Second, you must partition your

data into training, validation, and testing datasets when building, tuning, and

evaluating your model. This way, three different set of data are used to

validate your model. Third, build multiple models and compare their

performance. You may find that you favor one model, such as a neural

network, but that model may not be the most effective. Therefore, comparing

the performance of multiple models will yield the most effective end product.

Fourth, if you end up developing a perfect model, something went wrong.

Healthcare data is messy and complex. It’s unlikely that you will develop a

model that makes perfect decisions. The laws of probability suggest otherwise

that your model will at times make mistakes. Fifth, don’t overlook how the

model is to be deployed. Some of the algorithms are very difficult to employ.

For instance, neural networks are a black box and difficult to automate into a

system. However, rule based algorithms just as decision trees are very simple

to deploy. Sixth, when constructing your models they should be repeatable and

efficient. That is, if you were to take a different set of data and apply your

model, you should get similar results. Also, your model shouldn’t take 3 days

to run. It should be almost instantaneous otherwise it’s unlikely that it can be

implemented in a healthcare setting where everything is fast-paced. Seventh,

let the data talk to you but no mislead you. If you are certain that the results of

your analysis are doubtful, you should question the results. Don’t assume that

the results are the truth. Test it, test it again, and again. Lastly, after you

constructed your model and tested it, communicate your discoveries effectively

and visually.

There are many tools available for data mining and constructing models. One

of the most popular tools include SAS enterprise miner. The platform is

powerful and relatively easy to use. Weka is an open-source platform that

supports the development of a variety of different algorithms. Rattle is a

package available in the open-source analytics environment R and is also very

powerful and diverse. Rattle also supports predictive markup modeling

language (PMML) for deploying data mining models. There are many other

applications available.

Data mining has some terminologies that should be understood. A dataset is a

collection of data. Often times, a dataset will have multiple columns and many

rows. In mathematical terms, this is referred to as a matrix while in database

terms this would be referred to as a table. The observations make up the rows

of data while the variables make up the columns. The dimension of a dataset

is the number of observations, or rows, by the number of variables, or

columns.

Input variables include the measured data items. This can take on many

different forms, either text, numbers in ordinal, nominal, interval, or ratio. Other

names for input variables include predictors, covariates, independent

variables, observed variables, or descriptive variables. An example would be

systolic blood pressure, diastolic blood pressure, medications, weight, age,

gender, and so on.

Output variables are those that are influence by the input variables. They are

also known as target, response, or dependent variables. An example might be

a diagnosis of hypertension.

We build models to predict the output variables in terms of the input variables.

So if we were given data that includes systolic blood pressure, diastolic blood

pressure, medications, weight, age, and gender, we could use that data as

inputs for predicting the output of a diagnosis of hypertension.

There’s one caveat. Some data mining models may not have any output

variables. These are referred to as descriptive models and an example is

clustering. We will get to these in a moment.

Identifiers are unique variables for a particular observation. They may include

a patient’s name, or a patient ID. Categorical variables are one that take on a

single value and are discrete. They can be nominal where there is not order to

them (for example eye color) or ordinal where there is natural order (for

example age groups). Numeric variables, also known as continuous variables,

are values that are integers or real numbers (for example weight).

There are three datasets that are used when constructing a model: training,

validation, and testing datasets. The training dataset is the data that you use to

build the initial models. The validation dataset assess the model’s performance

that you develop using the training dataset. This step helps fine tune the model

as appropriate. The testing dataset, applies the refined model and assesses

expected performance on future datasets.

When developing a data mining model, you start with one large dataset and

partition that into training, validation, and testing datasets. The partitioning is

done by randomly selecting observations to one of the three datasets. The

training set typically has more data than the other datasets. For instance, if we

take a large dataset we can partition our three datasets as follows: 70% of the

observations go to the training dataset, 15% to the validation, and 15% to the

testing dataset.

The data mining process that is widely accepted is known as CRISP-DM, or

CRoss Industry Standard Process for Data Mining. The process includes 6

steps from understanding the business all the way to deploying a model.

The slide on your screen shows a description of the six steps. The first step

emphasizes the business understanding for planning your data mining project

so that it aligns with the organizations goals. The second is data understanding

so that you can assess the quality of the data and define each data element.

Data preparation is next where you select the relevant data, clean the data up,

carry out basic descriptive statistics, and reformat the data as necessary.

Modeling is next where you construct a data model or several models.

Evaluation is the step where you evaluate the performance of each of the

models constructed and choose the best performing model. Last is

deployment where you determine how you will deploy your model and present

the findings to the necessary parties.

The CRISP-DM process relates very well to specific data mining tasks. For

instance, business understanding relates to developing questions about the

data and data selection. The data understanding step is where we explore the

data. The data preparation step is where the data is transformed. The

modeling step is where we choose and build a model. The evaluation step is

where we validate and test our model. Finally, deployment is where we export

the model.

When building a model, there are two main categories. The first is descriptive models also known as unsupervised learning. These are models that are

constructed when we do not have a target variable. Providing a representation of the knowledge discovered without necessarily modeling a specific outcome. An example of a descriptive model is a clustering

analysis. Predictive models, or supervised learning, are those that can be developed when we have a target variable. We can predict the target variable with our given set of input variables. The goal of a predictive model is to

extract knowledge from historic data and represent it in such a

form that we can apply the resulting model to new situations. In that way, we are predicting the occurrence of an event of interest. The historic data will already be associated with the outcome and we can learn to make this association on future data. Common predictive algorithms include decision trees, boost, and

neural networks.

If the model is found effective and ready for use in real time, the next step is deployment. One method to deploy models is through the use of a language called predictive modeling markup language (PMML). It is an XML-based standard that is supported by many major commercial data mining vendors and many open source data mining tools.

Descriptive and/or predictive models can be used on specific datasets.

Different models and algorithms have advantages and disadvantages.

Therefore, it is recommended to construct multiple models and choose the

best. Deployment of a successful model can be simple using PMML.

When considering the role of Health IT and Meaningful Use and the

implications for data mining, the use of data mining techniques can have great

potential for the development of clinical decision support systems and

outbreak detection to foster better patient outcomes. Also, as the government

invests more into health IT, the adopting of data mining approaches will

become more of a priority. New ways of analyzing and interpreting the data will

be sought after and it is anticipated that data mining will be center stage.

In this presentation, you will be introduced to data mining and the...

Documents

Transcript of In this presentation, you will be introduced to data mining and the...