LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will...

22
LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms, and boxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on Lab 1 assignment. 1. Summarizing and Displaying Categorical Data The categorical variables such as gender (possible values: males, females) or marital status (possible values: never married, married, divorced) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category (distinct value of the categorical variable). In order to demonstrate the graphical and numerical tools in StatCrunch we will use the Framingham Heart Study data file introduced in Introductory Lab. However, we will add one more column, Smoker (column 4) to the introlabdata.txt data file. The new variable is defined below. For your convenience, we will also provide the definitions of the other three variables in the data file: Column Variable Description of Variable 1 Gender M-Male, F-Female, 2 Age 30-64 years, 3 Systolic Systolic blood pressure (82-300 mm), 4 Smoker 0 if not a current smoker, 1 if current smoker. The extended data file is given in the table below: Gender Age Systolic Smoker F 59 170 1 M 35 130 0 M 46 136 0 F 43 96 0 M 53 120 0 M 50 110 0 M 33 100 0 M 57 145 1 F 41 132 0 F 40 112 0 M 54 140 0 M 53 148 1 F 53 165 1 M 49 100 0 Add the entries in the last column (Smoker) to the introlabdata.txt data file used in Introductory Lab. 1

Transcript of LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will...

Page 1: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

LAB 1 INSTRUCTIONS

DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms, and boxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on Lab 1 assignment. 1. Summarizing and Displaying Categorical Data The categorical variables such as gender (possible values: males, females) or marital status (possible values: never married, married, divorced) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category (distinct value of the categorical variable). In order to demonstrate the graphical and numerical tools in StatCrunch we will use the Framingham Heart Study data file introduced in Introductory Lab. However, we will add one more column, Smoker (column 4) to the introlabdata.txt data file. The new variable is defined below. For your convenience, we will also provide the definitions of the other three variables in the data file: Column Variable Description of Variable 1 Gender M-Male, F-Female, 2 Age 30-64 years, 3 Systolic Systolic blood pressure (82-300 mm), 4 Smoker 0 if not a current smoker, 1 if current smoker. The extended data file is given in the table below:

Gender Age Systolic Smoker F 59 170 1 M 35 130 0 M 46 136 0 F 43 96 0 M 53 120 0 M 50 110 0 M 33 100 0 M 57 145 1 F 41 132 0 F 40 112 0 M 54 140 0 M 53 148 1 F 53 165 1 M 49 100 0

Add the entries in the last column (Smoker) to the introlabdata.txt data file used in Introductory Lab.

1

Page 2: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

(a) Summaries for Categorical Data: Frequency and Contingency Tables Select the Tables option in Stat menu.

Frequency and Relative Frequency Table Then click the Frequency option. The feature (in its default setting) provides the frequency and relative frequency for each distinct value within selected columns. One frequency and relative frequency table will be produced for each column (variable) selected. For example, to obtain the frequencies and relative frequencies of females and males with the systolic blood pressure exceeding 135, you should fill in the dialog box as folows:

Select the columns to be used in the analysis

Specify the data rows to be included in the analysis

The frequency and relative frequency table will be displayed for the two gender groups. Notice that if you ignore Next button in the above dialog box and click Calculate button directly, the frequency table for the default options will be obtained. Contingency Table The association between two categorical variables can be summarized with a contingency table. The rows in the table list the categories of one variable and the columns list the categories of the other variable. Each cell in the table is the frequency of observations for theparticular combination of values of the two variables. The contingency table can be obtained using raw data (Contingency Table with data) or summary data (Contingency Table with summary). Select the Contingency with data option in Tables menu. Fill in the corresponding dialog box as follows:

2

Page 3: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Click the Next button to specify the information to be displayed in each cell of the contingency table: Row Percent, and Column Percent. Leave the remaining boxes unchecked. Click Next. The following output will be generated:

Select the column which values will be categorized across the rows

Select the column which values will be categorized across the columns

Specify the data rows to be included in the computation in the Where entry box (optional)

Select an optional Group By column. A separate contingency table will be obtained for each distinct value of the column

The contingency table of “Gender” and “Smoker” variables

In order to obtain a contingency table when summaries are available (the above instructions apply to the situation when data are available), select the columns that contain the summary counts (0 and 1 in the above

3

Page 4: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

example) and the column that contains the row labels (Gender). Then enter the name for the column variable (Smoker). Click Next button, specify the additional information (row, column, and total percents) and click Calculate buton to display the results. (b) Graphs for Categorical Data Bar Plot Bar chart uses vertical bars to display the frequency or relative frequency for all distinct values (categories) of selected columns. The length of each bar is equal to the frequency or relative frequency for the corresponding value (category). Bar charts can be used to examine the association between two categorical variables like gender and smoking status. You may either obtain a bar chart when data are available (bar plot with data) or when counts for each category are provided (bar plot with summary). For example, in order to explore the association between Gender and Smoker variables in the Framingham Heart Study data file, we can obtain a bar plot for the variable Smoker for each gender category.

A separate bar plot will be generated for each column selected

Select an optional Group By column. The frequency or relative frequency of each distinct value of the selected column will be displayed with a bar.

Use an optional Where clause to specify the data rows to be included in the analysis

You may choose Split bars if you wish to obtain two bar graphs back-to-back, one for each gender.

If you ignore Next button and click directly Create Graph! button, the default options will be applied to your graph. Click the Next button to obtain the following dialog box:

4

Page 5: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Choose between plotting the frequency or the relative frequency for each distinct value.

Click Next button. The following dialog box that allows you to specify axis labels and the title of the bar plot will appear:

Click Next button to obtain a dialog box that would allow you to customize the appearance of the bar plot. This dialog box is common to all graphical procedures in StatCrunch and usually appears as the last dialog screen before producing the graph. In particular, you can specify the number of rows and columns per page. A page is defined here by the visible width and height of a browser window. By default, the number of rows and the number of columns per page is one, so one graph per page is produced.

5

Page 6: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

You may obtain two graphs, one beside each other by entering the number “1” as the number of rows per page and entering “2” as the number of columns. In similar way, you can obtain two graphs in one column (one below the other) by entering the two parameters as “2” and “1”, respectively.

With the settings, one graph per page will be produced

You may also change the colour scheme if required. Now click Create Graph button to obtain the following graph:

6

Page 7: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Click the Options button in the above output window.

Click the option to edit the above graph (change the graph layout, change the axes labels or the graph title)

To copy the graph into the Clipboard so it can be easily pasted into your report without saving

To save the graph as a GIF file

As most of the graphics in StatCrunch, the bar plot is interactive. To interact with the plot chart (in general, with any StatCrunch graph), draw a rectangle within a desired object (for example, a bar in the bar chart) or around the desired object (a point in a scatterplot) in the graph by clicking and dragging the mouse. The objects will be highlighted in the graph as well all other interactive graphs obtained for the data. Moreover, the corresponding observations in the data table will also be highlighted. Draw a small rectangle in any of the four bars in the above plot to explore the interactivity. Now we will demonstrate how to use bar charts to compare the proportions of smokers among females and males. Click Bar plot with data option and fill the feature dialog box as follows:

The Relative Frequency option should be selected in the subsequent dialog box. Moreover, “2” columns per page and “1” row per page should be requested in the graph layout dialog box. The following output will be obtained:

7

Page 8: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Pie Chart Pie chart consists of several slices corresponding to all distinct values of a categorical variable and the size of each slice corresponds to the percentage (relative frequency) of observations in the category. You may either obtain a pie chart when data are available (pie chart with data) or when counts for each category are provided (pie chart with summary). Select the Pie Chart with data option in the menu and fill in the corresponding dialog box as follows:

A separate chart will be obtained for each column selected; the slices in each chart correspond to the distinct values of the column

You may enter an optional Where statement to specify the data rows to be included in the analysis

Select an optional Group By column to obtain a separate pie chart for each distinct value of this column

Click Next> button. The following dialog box will appear:

8

Page 9: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

For each category (displayed as a slice in the pie chart), the following three numbers will be provided (separated by comas), respectively: category name, number of observations in the category, the percentage of observations falling into this category. If you ignore Next> button and click directly on Create Graph! button, the default options will be applied to your pie chart.

In the “0” smoker category (nonsmokers), there are 3 females and they constitute 60% of females (three out of five).

If you click the Next> button, you will obtain a pie chart for the variable Smoker for males.

9

Page 10: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

2. Summarizing and Displaying Quantitative Data Now you will learn how to obtain the measures of center and spread for quantitative data and how to display the data with histograms and boxplots. (a) Summaries for Quantative Data StatCrunch provides several descriptive statistics for single variables (the columns selected) as well the measures that indicate the extent to which two variables co-vary (tend to rise or fall together). The first are produced by Columns and Rows options, the latter by the Correlation and Covariance options in the Summary Stats submenu.

Columns The Columns option provides the following descriptive statistics for the columns selected: sample size, mean, variance, standard deviation (Std. Dev.), standard error (Std. Err.), median, range, minimum, maximum, first quartile (Q1) and third quartile (Q3). Moreover, additional percentiles can also be requested by the user. Click Columns option. The following dialog box will be displayed:

Select the columns for which summary statistics will be computed

Select an optional Group By column to group results.

Enter an optional Where clause to specify the data rows to be included in the computation

If a Group By column is selected, the output will be displayed in separate tables for each column selected (default). If you wish to have the output displayed for each group, choose the other radio button.

Notice that the two radio buttons in the dialog box are provided to allow the user to organize the output in the most desirable way; they do not affect the data analysis process.

10

Page 11: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Suppose that we wish to obtain the summaries for the variable Systolic for non-smokers for each gender. In this case, the dialog box should be filled in as shown below:

Notice that as “Table groups for each column” is selected, the summaries for males and females will be provided in separate tables. Click the Next button. The following dialog box will appear:

11

Page 12: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Enter the requested percentiles separated by comas or spaces, i.e. 90, 99

Check the option to have the output placed in the data table

All summary statistics to be computed are selected by default (all entries in the left pane are selected). If you wish not to compute some of the statistics, click the statistics to be removed in the left pane and they will be dropped from the list in the right pane. The statistics in the right pane will be displayed in the output (from right to left) in order in which they are listed in the right pane. Finally click the Calculate button to obtain the summaries. Rows The Rows option can be very useful when the entries in the columns in the data table for each row refer to the same object or subject. For example, sales data of each of the four salespersons in a sales department over the last six months (the columns represent the sales figures for each of the six months) or the number of customers on each of the 30 days for several postal outlets in a city. Consider the sales data example. Copy the following data into your StatCrunch data table.

12

Page 13: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Suppose now that we wish to obtain the summaries of sales for each of the four salesmen over the January-March period. We may wish to compare the summaries with those for the April-June period for each of the four salespersons. Click the Row option in the Summary Stats submenu. The following dialog box will appear:

Now click the Next> button. The next dialog box will allow you to specify the summary statistics to be computed for each row. The default summaries are: count, sum, mean, variance, standard deviation, minimum, median and maximum. Finally, the output will be displayed in the following form:

(a) Displaying Quantative Data: Histograms and Boxplots Now we will discuss the graphical tools to display quantitative data. Histogram Histogram is the most important statistical tool to display the quantitative data. In order to obtain a histogram, we divide the range of data into non-overlapping intervals of equal width (called class intervals), count the number of observations falling into each class interval and erect a bar with height equal to the frequency (frequency histogram) or relative frequency (relative frequency histogram) over each class

13

Page 14: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

interval. The heights of bars in the density histogram are calculated by dividing relative frequency by class width so that the total area of the bars equals 1. We assume that the left endpoint of each class interval is included, the right endpoint is excluded. The endpoints of class intervals are called bins. The bins specify uniquely the class intervals if the starting bin and the common class interval width are provided. Suppose we would like to compare the distributions of systolic blood pressure for the two gender groups in the Framingham Heart Study example. Click the Histogram option in the Graphics menu and fill in the dialog box as follows:

Select the columns (variables) to be displayed in the plot. A separate histogram will be produced for each column selected.

Specify the data rows to be included in the analysis. The clause is optional- if you do not enter anything into the box, the histogram will be obtained for all rows.

Select an optional Group By column to obtain a histogram for each distinct value of the column (variable)

Click the Next> button. In the next dialog box you will specify your class intervals by specifying the bin starting

Select the Frequency, Relative Frequency or Density histogram

The two entries are optional. If you don’t enter anything, the bins will be generated automatically

14

Page 15: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Click the Next> button to open the dialog box displayed below. This dialog box allows you to superimpose one of the well-known statistical density functions upon the histogram of the data. For example, you might wish to see how well your data fits the density of a normal distribution. If you select the option, you will be required to enter the parameters of the density function. Leave “optional” in our case and proceed to the next dialog box to specify histogram layout options.

Leave “optional” in this entry box for our data

In order to obtain the two histograms (for females and males) displayed side by side, “2” columns per page and “1” row per page should be requested in the graph layout dialog box. The histograms are displayed below.

15

Page 16: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Boxplot Boxplot is a graph of the five-number summary: minimum value, first quartile Q1, median, third quartile Q3, and the maximum value. The distance from Q1 to Q3 is called the interquartile range (IQR). We will demonstrate the feature using the Framingham Heart Study data. Click the Boxplot option in the Graphics menu. Suppose we wish to obtain side-by-side boxplots of systolic blood pressure for males and females.

To obtain side-by-side boxplots for males and females

To obtain separate boxplots for the two gender groups

Click the Next > button. You may choose to use fences when constructing the boxplots (optional). The inner fences are located a distance of 1.5 times the IQR to the left and right of Q1 and Q3, respectively. The outer fences are located a distance of 3 times the IQR to the left and right of Q1 and Q3, respectively.

You may choose to use fences when constructing the boxplots (optional). A point beyond an inner fence on either side is considered an outlier. A point beyond an outer fence is considered an extreme outlier.

You may choose to have the boxes corresponding to two genders (groups) displayed vertically (default option) or horizontally.

16

Page 17: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Remember that you may always change your graph appearance by clicking the Options button and edit the appropriate dialog boxes. 3. Displaying Data over Time: Time Plots A data set collected over time is called a time series. Time plot is a graph of time series. It plots each observation, on the vertical scale, against the time it was measured, on the horizontal scale. Time series show trends or changes in data over a period of time. The information obtained by examining a time plot is especially meaningful when the time points at which the variable of interest is being measured are equally spaced. In this case we may label them with the consecutive integer numbers 1, 2, 3, …. (a) Index Plots Index plot displays the values of a column (variable) versus the corresponding row index number. The row index numbers usually displays the order in which the data have been collected. Consecutive points in the plot are connected with lines. In order to illustrate the tool we consider the following example. The sales for two department stores (in millions) from 1998 to 2005 are shown in the following table:

Year Department 1 Department 2 1998 64 60 1999 66 64 2000 69 67 2001 70 72 2002 71 74 2003 75 77 2004 81 80 2005 85 84

17

Page 18: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

We will compare the sales for the two department stores with a time plot using the Index Plot feature. Click the Index Plot in the Graphics pull-down menu.

As we wish to compare the sales data for the two department stores, we leave the Separate graph for each column check box unchecked. Click the Next> button and define labels for the axis, assign the title and specify the axis options.

In the next dialog box you will specify the graph layout options. Finally, you will obtain the following graph:

Check the box if each column is to be displayed in a single plot.

Select the columns (variables) to be displayed in the plot. If more than one is selected, each of them will be color-coded and displayed in a single plot (default)

18

Page 19: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

The first tick on the horizontal axis is at the 1 point. Notice that you cannot change the labels (consecutive integers) below the ticks. You will be able to specify any axis labels you wish by using the Scatter Plot tool to be discussed below. (b) Scatter Plot with Lines The Scatter Plot tool available in the Graphics pull-down menu allows the user to obtain a plot of one quantitative variable versus the other quantitative variable. We will discuss scatterplots in StatCrunch in detail in Lab 2 Instructions. Here we will use the Scatter Plot tool in a special case when the variable plotted on the horizontal axis is time (in various units like minutes, hours, days, months, or years) and the other variable plotted on the vertical axis is any quantitative variable varying over time. The points in this kind of scatterplot are connected by lines. The axis labels below the tick marks on the horizontal axis correspond to the values (numerical or categorical) specified in the appropriate column in the data. We will demonstrate how to construct a time plot using the sales data for the two departments. However, to obtain a single time plot with the two variables, we had to rearrange the data as follows:

19

Page 20: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Click Scatter Plot option in the Graphics pull-down menu.

Time is plotted on the horizontal (X) axis

Time series values are plotted on the vertical (Y) axis.

Enter an optional Where statement to specify the data rows to be included in the time plot. You may exclude some observations using the option.

Click the Next> button and select the “Lines” option (the consecutive points in the plot will be connected by straight lines). If you wish to have the points in the plot marked clearly you may select both “Points” and “Lines” options. Click again the Next> button and specify the graph layout, title and the axes titles.

20

Page 21: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

Finally the following plot will be obtained:

21

Page 22: LAB 1 INSTRUCTIONS - ualberta.caLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data

(c) Multi Plot The Multi Plot tool available in the Graphics menu allows you to plot multiple pairs of points on the same graph or separate graphs. Pairs may be plotted as points, connected with lines or both plotted with points and connected with lines.

Click Add button to add the pairing to the plot. The pairing will then be displayed in the selection box. To delete the pairing, select it and click on Delete.

In the next dialog box you will specify the graph layout options. Finally, you will obtain a graph similar to the one on page 21.

22