OLAP Exercise

19
On-Line Analytical Processing (OLAP) OLAP is a query based methodology that supports data analysis in a multidimensional environment. OLAP is a valuable tool for verifying or refuting human generated hypotheses and for performing manual data mining. An OLAP engine logically structures multidimensional data in the form of cube like the one shown in Figure 1. The cube displays three dimensions - purchase category, time in month and region. As an OLAP cube is designed for a specific purpose, it is not unusual to have several cubes structures from the data in a single data warehouse. The design of a data cube includes decisions about which attributes to include in the cube as well as the granularity of each attribute. A well designed cube is configured so as to avoid situations where data cells do not contains useful information. For example, a cube with two time dimensions, one for month and second for fiscal quarter(Q1,Q2,Q3,Q4) is a poor choice because cell combinations such as (January,Q4) or (December, Q1) will always be empty. A useful OLAP systems interface allows the user to display the data from different perspective, perform statistical computations and tests, query the data into successively higher and /or lower levels details, cross-tabulate the data, and view the data with graphs and charts. I 1

description

 

Transcript of OLAP Exercise

  • 1. On-Line Analytical Processing (OLAP) OLAP is a query based methodology that supports data analysis in a multidimensional environment. OLAP is a valuable tool for verifying or refuting human generated hypotheses and for performing manual data mining. An OLAP engine logically structures multidimensional data in the form of cube like the one shown in Figure 1. The cube displays three dimensions - purchase category, time in month and region. As an OLAP cube is designed for a specific purpose, it is not unusual to have several cubes structures from the data in a single data warehouse. The design of a data cube includes decisions about which attributes to include in the cube as well as the granularity of each attribute. A well designed cube is configured so as to avoid situations where data cells do not contains useful information. For example, a cube with two time dimensions, one for month and second for fiscal quarter(Q1,Q2,Q3,Q4) is a poor choice because cell combinations such as (January,Q4) or (December, Q1) will always be empty. A useful OLAP systems interface allows the user to display the data from different perspective, perform statistical computations and tests, query the data into successively higher and /or lower levels details, cross-tabulate the data, and view the data with graphs and charts. I 1
  • 2. Figure 1. A multidimensional cube for credit card purchases. Category = Retails Month = January Region = Four Amount = 67090 Count = 120 Jan Feb Mar Apr May Month June July Aug Sep Oct Nov Dec Four Three Two Retails Travel Supermarket Entertainment One Region Category OLAP Example In Figure 1, the cube contains 12 X 4 X 4 = 192 cells. Stored within each cell is the total amount spend within a given category by all credit card customers for a specific month and region. If an average purchase amount is to be computed, the cube will also contain a count representing the total number of purchases for each month, category and region. The arrow in Figure 1 points to a cube holding the total amount and the total number of retails purchase in region four for the month of January. I 2
  • 3. Each attribute of an OLAP cube may have one or more associated concept hierarchy. A concept hierarchy defines a mapping that allows the attributes to be viewed from varying levels of detail. Figure 1.2 displays a concept hierarchy for the attribute location. As you can see, region holds the highest level of generality within the hierarchy. The second level of hierarchy tells us that one or more states make up a region. The third and fourth levels show us that one or more cities are contained in a state and one or more addresses are found within a city. Lets create a scenario where our OLAP cube together with the concept hierarchy of Figure 1.2, will be assistance in a decision making process. Figure 1.2 A concept hierarchy for location Region States City Street Address Suppose we wish to determine a best situation for offering a luggage and a hand bag promotion for travel. Our goal is to determine when and where the promotional offering will have its greatest impact on customer response. We can do this by finding those times and locations where relatively large amounts have been previously spent on travel. Once determined, we then designate the best regions and times for the promotional offering so as to take advantage of the likelihood of ensuing travel purchases. Here is the list of common OLAP operations together with a few examples for our travel promotion problem: 1. The SLICE operation select data on a single dimension of OLAP cube. For the cube in Figure 1, a slice operation leaves two of the three dimensions intact, while a selection on the remaining dimension creates a sub cube from the original cube. The two queries for the slice operator are: a. Provide a spreadsheet of month and region information for cells pertaining to travel b. Select all cells where purchase category = retails or supermarket 2. The DICE operation extracts a sub cube from the original cube by performing a select operation on two or more dimensions. Here are three queries requiring one or more dice operations: a. Identify the month of peak travel expenditure for each region. b. Is there a significant variation in total dollars spent for travel and entertainment by customers in each region? c. Which month shows the greatest amount of total dollars spent on travel and entertainment for all regions? 3. ROLL-UP or aggregation, is combining of cells for one of the dimensions defined within a cube. One form of roll-up uses the concept hierarchy associated with a dimension to achieve a higher level of generalization. For this example, this is illustrated in Figure 1.3 where the roll-up is on the time dimension. The cell pointed I 3
  • 4. to in the figure contains region one supermarket data for the month October, November and December. A second type of roll-up operator actually eliminates an entire dimension. For our example, suppose we choose to eliminate the location dimension. The end result is a spreadsheet of total purchases delineated by month and category type. Category = Supermarket Month = Jan,Feb,March Region = One Q1 Q2 Time Q3 Q4 Four Three Retails Travel One Two Supermarket Entertainment Regions Category 4. DRILL-DOWN is the reverse of a roll-up and involves examining data at some level of greater details. Drilling down on region in Figure 1, results in a new cube where each cell highlights a specific category, month and state. 5. ROTATION or pivoting, allows us to view the data from a new perspective. For our example, we may find it easier to view the OLAP cube in Figure 1 by having months displayed on the horizontal axis and purchase category on the vertical axis. General Considerations a. Most useful strategies for analyzing a cube require a sequence of two or more operations b. MS Excel provides an interface that allows us to view OLAP cubes created from data stored in a relational database c. The information contained in the cube can be displayed and manipulated in Excel as a pivot table. I 4
  • 5. Excel Pivot Table For Data Analysis. Creating a Simple Pivot Table We start with a simple example using credit card promotion database to show how pivot tables summarize data for the attribute income range. 1. To begin, load the CreditCardPromotion.xls file into an Excel spreadsheet. 2. Delete the second and first rows of the spreadsheet data as they are not relevant to our analysis. 3. Make sure the cursor is positioned in one of the cell containing data. Proceed to the Data dropdown menu and select PivotTable and PivotChart Report. 4. Select the Microsoft Excel list or database radio button. This indicates that the data to be analyzed is housed within an Excel spreadsheet. Select the PivotTable option and click next to continue. 5. In step 2 we are asked for the data range parameters to be used for creating the pivot table. As we initially placed the cursor in a cell containing the data, the data range should be correct. Click next to continue step 3. 6. In step 3 we specify the location of the pivot table. Select New worksheet radio button and click finished to continue. Figure 2 : A Pivot Table Template I 5
  • 6. Let us use the toolbars together with the data drop area to generate a summary report for attribute income range. 1. Use your mouse to drag income range from the toolbar into the area specified by Drop Field Here. Next, return to toolbar and drag income range into the area specified by Drop Data Item here. Figure 3: A summary report for income range The report tells us, among others thing, that the majority of credit card customers have an income ranging between $30000 and $40000 dollars. Now lets change the output format for the total column (currently a count) to a percent: 1. Single click on count of income range 2. Single click on the field settings square located in the top right portion of the pivot table toolbar. A Pivot Table Field box will appear. 3. Single click on option >> and examine the options in the Show data as: dropdown menu. 4. Select % of column and click Ok. The data in the total column will now appear as a percent. Finally lets make a pie chart to complement the table output: 1. Begin by highlighting the percentage score for the four income range values. I 6
  • 7. 2. Next, single click on the ChartWizard located in the top left portion of the pivot table toolbar. A bar chart representing the four income range values will appear. However we wish to have a pie chart showing the value. To accomplish this, single click on the Chart Wizard a second time. 3. Choose one of the pie chart types and click on Finish Figure 4: A pie chart for income range Next we use the pivot table drill down feature to display the records of those individuals in a particular salary range: 1. Click on Sheet4 in the bottom tray to display the pivot table. 2. Double click on the cell containing the percent for the desired salary range (20-30K). All instances within the chosen salary range will appear in a new spreadsheet. 3. To return to the pivot table, click on sheet4. This completes our first example. I 7
  • 8. Pivot Table for Hypotheses Testing The ACME Credit Card Company has decided to solicit by telephone select cardholders who received their credit cards within last year and who did not purchase credit cards insurance with their initial mail-in application. Their data analyst believes that there is a relationship between a cardholders age and whether the cardholder has credit card insurance. Specifically, the analyst wishes to test the hypotheses that younger cardholders purchase credit card insurance whereas more senior cardholders do not. If the hypothesis is true, only those cardholders under certain age will be selected for the telephone solicitation. To test the hypothesis we will use a pivot table and our imagination and assume that the credit card promotion database contains a much larger sampling of cardholders. The following steps test the hypothesis claiming a relationship between age and credit card insurance status: 1. To begin, make sure the cursor is positioned in one of the cells of sheet1 that contains data. Proceed to the Data Dropdown Menu and select Pivot Table and Pivot Chart report and select finish. 2. Move age to the area labeled Drop Row Fields Here. Move credit card insurance to the area labeled Drop Column Fields Here. 3. Move credit card insurance to the area labeled Drop Data Items Here. The resultant pivot table is given in Figure 5 Figure 5: A pivot showing age and credit card insurance choice The pivot table is informative in that it tells us that very few individuals currently have credit card insurance. However the distribution of ages is such that it is difficult to make I 8
  • 9. any conclusions about a relationship between age and credit card insurance. We can use the group function to develop a clearer picture about any possible relationship between the two attributes. The method is as follows: 1. Single click on the age attribute within the pivot table 2. Single click on the Data dropdown menu. 3. Mouse to Group and Outline and then to Group. Single click on Group. A grouping box that allows you to select a Starting at, Ending at, and By value will appear. 4. Click OK to select the default values. Figure 6: Grouping the credit card promotion data by age The new pivot table is displayed in Figure 6. Although our data set is too small to draw valid conclusions, grouping the data by age allows us to obtain a clearer picture of the relationship between the two attributes. A second method for determining if a relationship between age and credit card insurance exists. This method computes the average ages for those individuals with and without credit card insurance. Instead of starting with the original credit card promotion database, well modify the current pivot table by invoking the Pivot Table Wizards from the toolbar as follows. 1. Locate the Pivot Table Wizards in the top row of the toolbar. 2. Single click on the wizard. This action invokes the step of 3 display of the Pivot Table Wizard. 3. Locate and left click on the layout. The current pivot table layout is displayed within the Pivot Table Wizard. Figure 7, shows the current layout. 4. Use your mouse to remove attribute age from the Row area and drag it to the age button located on the far right of the layout display window. Next, drag credit card insurance from the Column area to the Row area. 5. Remove Count of Credit Card Insurance from the data area and place age in the data area. I 9
  • 10. 6. Double click on Sum of Age within the data area. A PivotTable Field box will appear. 7. Single click on Average within the Summarize by: box. Click on OK. This returns you to the PivotTable Layout Wizards 8. Click on OK from within the PivotTable Layout Wizard. Finally, click on Finish within the step 3 display of the PivotTable Wizard. Figure7: PivotTable Layout Wizard The resultant pivot table shows the average age for credit card insurance = no is approximately 41.42, whereas the average age for credit card insurance = yes is approximately 32.33. Figure 8: Age Summary I 10
  • 11. Creating a Multidimensional Pivot Table For this example, we will use a pivot table to investigate relationships between the magazine, watch and life insurance promotions relative to customer gender and income range. We will do this by creating a three-dimensional cube like the one shown in Figure 9. Each cell of the cube contains a count of the number of customers who either did or did not take part in the promotional offerings. Figure 9: A credit card promotion cube Watch Promo = No Life Insurance Promo = Yes Magazine Promo = Yes Promo Watch No Yes Yes No Magazine Promo Yes No The arrow in Figure 9 points to the cell holding the total number of customers who took advantage of life insurance promotion and the magazine promotion, but who did not take advantage of the watch promotion. We include sex and income range in our analysis by designating these attributes as page variables. Heres the procedure. 1. To begin, make sure the cursor is positioned in one of the cells of sheet1 that contains data. Proceed to the Data dropdown menu and select PivotTable and PivotChart Report and then Finish. 2. Use the mouse to drag watch promotion and life insurance promotion to the area labeled Drop Row Field Here. Drag magazine promotion to the area labeled Drop Column Fields Here. 3. Drag life insurance promotion, watch promotion and magazine promotion to the area labeled Drop Data Items Here 4. Finally. Drag sex and income range to the area labeled Drop Page Fields Here. I 11
  • 12. The resultant pivot table appears in Figure 10. The 24 highlighted cells correspond to the cells of the cube shown in Figure 9. In addition to the 24 cells representing the cube, the pivot table also shows total yes and no counts for each promotion together with summary total. Lets use the pivot table to help us determine relationships among the three promotions. Figure 10: A pivot table with page variables for credit card promotions First well use the table to find the customer count for the cell designated in Figure 10: 1. Find the area to the far left within the pivot table that shows life insurance promotion = yes. This is given in Figure 10 by rows 15 through 20. 2. Within this same area, locate the sub region that has watch promotion=no 3. Finally. Follow this sub region to the right until you reach the column for magazine promotion = yes The contents of the cell show a 2 for all three promotions. This tells us that a total of two customers took advantage of the life insurance and magazine promotions but did not purchase the watch promotion. We can drill down to examine the individual records represented by the cell. Simply double click on any of the cells containing the value 2. By default, the records will be displayed in sheet5. Next lets look at the paging feature. In the upper left corner of the pivot table, you will see the paging variables sex and income range specified with the table definition. We can I 12
  • 13. use the page feature to answer questions about the relationship between the attributes given as page variables and the promotional offerings. For example, lets say we wish to examine the relationship between income range and promotional offerings for female customers. The procedure as follows: 1. Single-click on the dropdown menu for sex, highlight female, and click OK 2. Single-click on the drop menu for income range, highlight $20-$30000 and click OK The pivot table displays the promotional summary data for females making between $20000-$30000 dollars. The table shows two female customers within the specified income range. Neither female took advantage of the watch or magazine promotions, but one female did purchase the life insurance promotional offering. By examining the remaining income range data, you will see that females with annual salary between $30000 and $40000 dollars have traditionally been the best candidates for promotional offerings. It is obvious that the paging feature adds more dimension to the analysis capabilities of Excel pivot table. I 13