Business Intelligence NAME: Tutorial for Rapid Miner...
Transcript of Business Intelligence NAME: Tutorial for Rapid Miner...
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-1
Business Intelligence NAME: _______________
Professor Chen Due Date: _____________
Tutorial for Rapid Miner
(Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)
Tutorial Summary Objective: Richard would like to figure out which customers he could expect to buy the new eReader and on what time
schedule, based on the company’s last release of a high-profile digital reader.
How:
The decision tree has enabled him to predict that and to determine how reliable the predictions are. He has also
been able to determine which attributes are the most predictive of eReader adoption, and to find greater
granularity (by switching Decision Tree’s criteria from “gain_ratio” to “gini_index”
PART I. Decision Trees with Two Datasets
1. Import two csv data set (training and scoring datasets)
a. Be sure change from Semi-colon to Comma (“,”)
2. Training dat aset with an additional attribute (i.e., eReader_Adoption) in comparison with Scoring data set.
3. Add “Set Role” operator
a. Add a “Set Role” operator to both datasets and change from “User_ID” to “id”
b. Add another “Set role” operator to Training dataset and change the attribute of “eReader_Adoption” to
“label” (target attribute)
4. Add “Decision Tree” operator to Training dataset. Run it and obtain the solution of Graphic View (with
three good predictors –nodes with gray oval shapes; and four label attributes – leaves with multicolored end
points).
PART II. MARKET SEGMENTATION by Combining Two DataSets
1. Add a new “Apply Model” operator and combining two datasets
2. Run again.
3. The solution of Graphic View remains the same; however, other results are obtained – tree has been applied
to the scoring data. It means that confidence attributes have been added.
4. From the Meta Data View: with logistic regression, four (4) confidence attributes have been created by
RapidMiner, along with a prediction attribute (i.e., eReader_Adoption) (Fig. 10-a on the Tutorial or Figure
10-9 in the North’s text)
5. Switch to Data View and examine Row. 14:
a. RapidMiner is very (but not 100%) convinced that person 77373 (Row 14, Fig. 10-10) is going to be a
member of the “early majority (88.9%).
b. Despite some uncertainly, RapidMiner is completely sure that this person is not going to be an early
adopter (0%).
* Information is from Data Mining for the Masses by Matthew North (chapter 10)
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-2
CRISP-DM Methodology
• CRISP-DM stands for cross-industry process for data mining. The CRISP-DM methodology provides a
structured approach to planning a data mining project. It is a robust and well-proven methodology.
• We are evangelists of its powerful practicality, its flexibility and its usefulness when using analytics to
solve thorny business issues. It is the golden thread than runs through almost every client engagement.
The CRISP-DM model is shown below.
Six Phases of CRISP-DM Model
We will explore this data mining example (with market segmentation) using CRIPS-DM methodology.
[1] [2]
[3]
[4]
[5]
[6]
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-3
Phase 1. Business Scenario (Business Understanding)
• Richard works for a large online retailer. His company is launching a next-generation eReader soon, and
they want to maximize the effectiveness of their marketing. They have many customers, some of whom
purchased one of the company’s previous generation digital readers. Richard has noticed that certain
types of people were the most anxious to get the previous generation device, while other folks seemed to
content to wait to buy the electronic gadget later. He’s wondering what makes some people motivated to
buy something as soon as it comes out, while others are less driven to have the product.
• Richard’s employer helps to drive the sales of its new eReader by offering specific products and
services for the eReader through its massive web site - for example, eReader owners can use the
company’s web site to buy digital magazines, newspapers, books, music, and so forth.
• The company also sells thousands of other types of media, such as traditional printed books and
electronics of every kind. Richard believes that by mining the customers’ data regarding general
consumer behaviors on the web site, he’ll be able to figure out which customers will buy the new
eReader early, which ones will buy next, and which ones will buy later on.
• He hopes that by predicting when a customer will be ready to buy the next-gen eReader, he’ll be able to
time his target marketing (i.e., market segmentation) to the people most ready to respond to
advertisements and promotions.
Organizational Understanding and Diffusion of Innovation Theory **
Diffusion of Innovation Theory
Definition: The process by which an innovation is communicated through certain channels over time among the
members of a social system.
Four elements:
• Innovation: an idea, practice, or object that is perceived as new by an individual or other unit of adoption.
• Communication channels: the means by which messages get from one individual to another.
• Time: a) innovation-decision process
b) relative time with which an innovation is adopted
c) innovation’s rate of adoption
Social system: a set of interrelated units that are engaged in joint problem solving to accomplish a common goal.
** By Everett Rogers in his book “Diffusion of Innovations.”
The diffusion of innovations according to Rogers. With successive groups of consumers adopting the new technology
(shown in blue), its market share (yellow) will eventually reach the saturation level. In mathematics the S curve is known as
the logistic function.
(Late
Adopters)
Numbers of adopters by group.
Cumulative number of adopters
over time.
Join
when it
is new
Join when
they
perceive a
benefit
Join when there
is a productivity
gain
Join when there
is a plenty of help
and support
Join
when
they have
to
Critical Mass
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-4
Phase 2. Data Understanding
Two Data Sets are listed below:
1. Scoring Data Set
– Online_Retailer_DataSet_Scoring.csv
2. Training Data Set
– Online_Retailer_DataSet_Training.csv
Data file: Online_Retailer_DataSet_Scoring.csv (473 records)
Data file: Online_Retailer_DataSet_Training.csv (661 records)
Q: What is the difference (attributes) between these two data sets?
Four possible values of eReader_Adoption:
[1] Innovator: purchase within the 1st week
[2] Early Adopter: purchase within 2 or 3 weeks
[3] Early Majority: purchase > 3weeks but <= 2 months
[4] Late Majority: purchase after the first two months
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-5
Meta Data on Data Sets
Business Scenario – Summary and Goal
• Richard has a list of customers and their probable adoption timings for the next-gen eReader. These
customers are identifiable by the User_ID that was retained in the results perspective data but not used as a
predictor in the model.
• Goal:
– He wants to ________ these customers and begin a process of target marketing that is timely and
relevant to each individual.
• The criteria and suggestions for segmenting customers are as follows:
[1] Those who are most likely to purchase immediately (predicted innovators) can be contacted and
encouraged to go ahead and buy as soon as the new product comes out. They may even want the option
to pre-order the new device.
On the other hand, perhaps very little marketing is needed to the predicted innovators, since
they are predicted to be the most likely to buy the eReader in the first place.
[2] Those who are probably likely to purchase earlier (often are opinion leaders, predicted early adopter)
should be …
[3] Those who are less likely (predicted early majority) might need some persuasion, perhaps a free
digital book or two with eReader purchase or a discount on digital music playable on the new eReader.
[4] The least likely (predicted late majority), can be marketed to passively, or perhaps not at all if
marketing budgets are tight and those dollars need to be spent incentivizing the most likely customers to
buy.
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-6
Phase 3. Data Preparation - Steps to Achieve the Goal
• Step A: Decision Tree
• Step B: Market Segmentation and HOW? (details please see Phase 4)
Step A: Decision Trees
• Decision trees are excellent predictive models when the target attributes is categorical in nature (e.g.,
eRader_Adoption with five possible values), and when the data set is of mixed types.
• In some cases, decision trees are better than more statistics-based approaches at handling attributes that
have missing or inconsistent values that are not handled as decision trees will work around such data
and still generate usable results.
Decision Trees – with Training and Scoring Data Sets
• Decision trees are made of nodes and leaves (connected by labeled branch arrows), representing the best
predictor attributes in a data set.
• The nodes and leaves lead to confidence percentages in the training data set, and can then be applied to
similarly structured scoring data in order to generate predictions for the scoring observation.
• Decision tress provides us:
a) What is predicted,
b) How confident we can be in the prediction, and
c) How we arrived at the prediction? shown in a graphical view.
1. Import the first data set (i.e., Online_Retailer_DataSet_Training.csv) into RapidMiner repository as shown in
Fig. 1-a thru Fig. 1-g.
a) Import Data Read CSV (since they both are *.csv files)
b) Read CSV operator is created (by double click or Drag and drop).
c) Make sure on Step 2 (of 4) change from Semicolon “;” to Comma “,” on Column Separation box
since the file is with csv format.
d) Change the name of the operator (Read CSV) to “Training” as shown in Fig 1-f and 1-g.
Fig 1-a
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-7
Fig 1-b
Fig 1-c
Fig 1-d
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-8
Fig 1-e
Fig 1-f
Fig 1-g
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-9
2. Repeat the step to import the second data set (i.e., Online_Retailer_DataSet_Scoring.csv) into RapidMiner
repository and the result is shown in Fig. 2
Fig 2
3. RUN the model to examine the data and familiarize with the attributes.
Hint: make sure connect the “out” port to the “res” port on the process area; otherwise, the result can’t
be produced.
Fig 3
4. Data View for the Two Data Sets are illustrated in Fig 4-a and 4-b
Fig 4-a
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-10
Fig 4-b
5. SAVE the two data sets and process into the “Local Repository” as shown in Fig 5-a and Fig 5-b. Why?
a) Return to “Design” perspective
b) Click on “Training” operator
c) Select “File” then “Save Process As”
d) Click “data” under “Local Repository” on the Repository Browser box
d) Enter “Online Retailer Training Data Set”
e) Repeat the same step for entering another data set of “Online Retailer Scoring Data Set”
Note that saving the data or process is under same option of “Save Process As” you then select one of the choice
of “data” or “process”
Fig 5-a
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-11
Fig 5-b
6. Finally SAVE the process into the Local Repository (Fig 6)
a) Return to “Design” perspective
b) Click anywhere on the “Process” area
c) Select “File” then “Save Process As”
d) Select “process” under “Local Repository” on the Repository Browser box
e) Enter “Online Retailer eReader Adoption Project”
Fig 6
7. Our first goal (Step A) is to find a solution using “Decision Tree” as shown in Fig 7-a to Fig 7-c.
a) While there are no missing or apparently inconsistent values in the data set, there is still some data
preparation yet to do. First of all, the User_ID is an arbitrarily assigned value for each customer. Usesr_ID
should not be included in the model as an independent variable.
b) Rather than removing the attribute (User_ID) we will try a new way of handling a non-predictive attribute.
This is accomplished using the Set Role operator. Using the search field in the Operators tab, find and add Set
Role operators to both your training and scoring streams. Be sure to re-connect the ports from Set Role
operators to “res”.
c) In the Parameters area on the right hand side of the screen, set the role of the User_ID attribute to ‘id’. This
will leave the attribute in the data set throughout the model, but it won’t consider the attribute as a predictor
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-12
for the label attribute. Be sure to do this for both the training and scoring data sets, since the User_ID attribute
is found in both of them.
Fig 7-a
Fig 7-b
Fig 7-c
7. cont.
d) One of the nice side-effects of setting an attribute’s role to ‘id’ rather than removing it using a Select
Attributes operator is that it makes each record easier to match back to individual people later, when viewing
predictions in results perspective.
e) Before adding a Decision Tree operator, we still need to do another data preparation step by adding another
Set Role operator. The Decision Tree operator, as with other predictive model operators we’ve used to this
point, expects the training stream to supply a ‘label’ attribute. For this example, we want to predict which
adopter group Richard’s next-gen eReader customers are likely to be in. So our label will be
eReader_Adoption and it should be “reset” in the Set Role operator (Fig 7-d)
f) Next add ‘Decision Tree” operator to your training stream as it is in Figure 7-e
g) Finally, Run the model and switch to the Tree (Decision Tree) tab in results perspective. You will see our
preliminary tree in Figure 8-a.
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-13
Fig 7-d
Fig 7-e
8. Interpretation of the Decision Tree view
• In the Decision Tree view (Figure 8-a) we can see what are referred to as nodes and leaves. The nodes are
the gray oval shapes. They are attributes which serve as good predictors for our label attribute. The leaves
are the multicolored end points that show us the distribution of categories from the label attribute that follow
the branch of the tree to the point of that leaf. We can see in this tree that Website_Activity (on the top of the
decision tree) is our best predictor of whether or not a customer is going to adopt (buy) the company’s new
eReader. If the person’s activity is frequent or regular, we see that they are likely to be an Innovator or
Early Adopter, respectively.
• If however, they seldom use the web site, then whether or not they’ve bought digital books becomes the
next best predictor of their eReader adoption category. If they have not bought digital books through the
web site in the past, Age is another predictive attribute which forms a node, with younger folks adopting
sooner than older ones. This is seen on the branches for the two leaves coming from the Age node in Fig 8-a.
• Those who seldom use the company’s website, have never bought digital books on the site, and are older
than 25 ½ are most likely to land in the Late Majority category, while those with the same profile but are
under 25 ½ are bumped to the Early Majority prediction. In this example you can see how you read the
nodes, leaves and branch labels as you move down through the tree.
Fig 8-a
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-14
8. Cont.
What information is still missed in Fig 8-b? This implies that we need to move to Phase 4.
Fig 8-b
Phase 4. Modeling
Step B of achieving the Goal of Market Segmentation
1. With our predictor attributes prepared, we are now ready to move on to Step 4, Modeling. Save the process if
needed.
2. Return to design perspective. In the Operators tab search for and add an Apply Model operator, bringing the
training and scoring streams together. Ensure that both the lab (labelled attribute) and mod ports are connected
to res ports in order to generate our desired outputs (Figure 9). Furthermore, exa (example) port in Set Role (2)
operator and unl (unlabelled) port in Apply Model should be reconnected.
Fig 9
3. Run the model. You will see familiar results—the tree remains the same, for now. Click on the ExampleSet
tab next to the Tree tab. Our tree has been applied to our scoring data. As was the case with logistic regression,
confidence attributes (shown in Meta Data View) have been created by RapidMiner, along with a prediction
attribute (Figure 10-a and 10-b)
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-15
Fig 10-a
Fig 10-b
Phase 5. Evaluation
1. Switch to Data View using the radio button. We see in Figure 10-b the prediction for each customer’s
adoption group, along with confidence percentages for each prediction. There are four confidence attributes,
corresponding to the four possible values in the label (eReader_Adoption). We interpret these the same way that
we did with the other models though - the percentages add to 100%, and the prediction is whichever category
yielded the highest confidence percentage. RapidMiner is very (but not 100%) convinced that person 77373
(Row 14, Figure 11) is going to be a member of the early majority (88.9%). Despite some uncertainty,
RapidMiner is completely sure that this person is not going to be an early adopter (0%).
• Row no. 14 can be interpreted as: person 77373 is going to be a member of the early majority (88.9%).
Despite some uncertainty, RapidMiner is completely sure that this person is not going to be an early adopter
(0%).
Question: What is the business “implication”?
Answer: _______ ____________ • However, other persons are difficult to categorize accurately which segmentation they belong to? Why?
• How to resolve this issue? See further process after ‘Deployment’.
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-16
Fig 11
Phase 6. Deployment
• Richard’s original desire was to be able to figure out which customers he could expect to buy the new
eReader and on what time schedule, based on the company’s last release of a high-profile digital reader.
• The decision tree has enabled him to predict that and to determine how reliable the predictions are and the
likelihood of buying for each group. He’s also been able to determine which attributes are the most
predictive of eReader adoption.
• In order to produce better results for market segmentation we need to find greater detail, or greater
granularity in the model by using gini_index as the tree’s underlying algorithm.
Revisiting … We are now re-visiting the following phases:
• Phase 4 Modeling
• Phase 5 Evaluation
• Phase 6 Deployment
Phase 4. Modeling – Revisiting
1. Remember that CRISP-DM is cyclical in nature, and that in some modeling techniques, especially those with
less structured data, some back and forth trial-and-error can reveal more interesting patterns in data.
2. Switch back to design perspective, click on the Decision Tree operator, and in the Parameters area, change the
‘criterion’ parameter from ‘gain_ratio’ to ‘gini_index’, as shown in Figure 12.
3. Re-RUN the model.
Fig. 12
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-17
Phase 5. Evaluation – Revisiting
1. We see in this tree (Figure 13-a) that there is much more detail, more granularity in using the Gini algorithm
as our parameter for our decision tree. We could further modify the tree by going back to design view and
changing the minimum number of items to form a node (size for split) or the minimum size for a leaf. Even
accepting the defaults for those parameters though, we can see that the Gini algorithm alone is much more
sensitive than is the Gain Ratio algorithm in identifying nodes and leaves.
2. Take a minute to explore around this new tree model. We will find that it is extensive, and that we will to use
both the Zoom and Mode tools to see it all. We should find that most of our other independent variables
(predictor attributes) are now being used, and the granularity with which Richard can identify each customer’s
likely adoption category is much greater.
3. How active the person is on Richard’s employer’s web site is still the single best predictor, but gender, and
multiple levels of age have now also come into play. We will also find that a single attribute is sometimes used
more than once in a single branch of the tree. Decision trees are a lot of fun to experiment with, and with a
sensitive algorithm like Gini generating them, they can be tremendously interesting as well.
Fig. 13-a
4. Switch to the ExampleSet tab in Data View. We see here (Figure 13-b) that changing our tree’s underlying
algorithm has, in some cases, also changed our confidence in the prediction.
5. Remember in the previous discussion that most of persons other than 77373 (Row.14) are difficult to
categorize accurately which segmentation they belong to since many of them (e.g., those are in ‘Early
Adopter’) are calculated as having at least some percentage chance of landing in any one of the four adopter
categories.
Fig. 13-b
Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-18
• Interpretation: Let’s take the person on Row 1 (ID 56031) as an example.
• Under the Gain Ratio algorithm, we were 41% sure he’d be an early adopter, but almost 32% sure he
might also turn out to be an innovator.
• In other words, we feel confident he’ll buy the eReader early on, but we’re not sure how early.
Gain_Ratio algorithm
Fig. 14-a
• Maybe that matters to Richard, maybe not. However, he will have to decide during the deployment
phase. But perhaps using Gieni_Index, we can help him decide.
Gini_Index algorithm
Fig. 14-b
• In Figure 14-b, this same man is now shown to have a 60% chance of being an early adopter and only a
20% chance of being an innovator. The odds of him becoming part of the late majority crowd under the
Gini model have dropped to zero.
• We know he will adopt (or at least we are predicting with 100% confidence that he will adopt), and that
he will adopt early. Why?
• While he may not be at the top of Richard’s list when deployment rolls around, he’ll probably be higher
than he otherwise would have been under gain_ratio.
• Note that while Gini has changed some of our predictions, it hasn’t affected all of them. Re-check
person ID 77373 briefly. There is no difference in this person’s predictions under either algorithm -
RapidMiner is quite certain in its predictions for this young man.
Phase 6. Deployment – Revisiting
• Richard now has a tree that shows him which attributes matter most in determining the likelihood of
buying for each group.
• New marketing campaigns can use this information to focus more on increasing web site activity level,
or on connecting general electronics that are for sale on the company’s web site with the eReaders and
digital media more specifically.
• These types of cross-categorical promotions can be further honed to appeal to buyers of a specific
gender or in a given age range.
• Richard has much that he can use in this rich data mining output as he works to promote the next-gen
eReader.