Post on 11-Feb-2016
description
Data and Introduction to Probability
k
Krannert School of ManagementPurdue University
Pre-Session
2
Using Data for Decisions We deal with data everyday to make decisions.
• Dow Jones averages for investment decisions
• Sales and inventory data for production planning
• Customer preference survey for new product introduction
Very large data sets are not uncommon.• Wal-Mart: over 20M transactions per day in
a 10 terabytes (1012) database.• AT&T: 100M customers and 300M long-
distance calls per day.
3
Statistics: The Science of Data
Collecting Data• Surveys• Experiments
Presenting Data• Charts & Tables
Characterizing Data• Averages• Variances
Data Analysis
Decision-Making
© 1984-1994 T/Maker Co.
Why?
© 1984-1994 T/Maker Co.
4
Categorical and Numerical Variables Categorical variables:
• divide data into two or more categories.
• also called qualitative variables
Numerical variables:• indicate how much or
how many using numbers
• also called quantitative variables
Why distinguish? • To determine what
computation makes sense
Examples:• gender (male/female), • hair color (brown, blonde,
brunette, …),• economic status (low,
medium, high), • satisfaction (1-10).
Examples: • temperature• age• income
Examples: • What is average hair color? • Is 80F twice as hot as 40F?
5
Cross-Sectional and Time Series Data
Cross-sectional data are collected at the same or approximately the same point in time.• Example: data detailing the number of
building permits issued in June 2004 in each of the counties of Indiana
Time series data are collected over several time periods.• Example: data detailing the number of
building permits issued in Tippecanoe County, Indiana in each of the last 36 months
6
Year
Y-Da
ta
2004200320022001200019991998199719961995
125
120
115
110
105
100
95
90
Variable
JapanFranceGermanyItalyUK
U.S.Canada
Industrial Production versus Year
Department of Commerceand Council of Economic Advisors Adjusted so that 1997=100
7
Example: Firestone Tires
Reported accidents that involved Firestone tires
DATEA STATE INJURIES TIRE_MODEL TIRE_SIZE VEH_MFR VEH_MODEL
2/28/2000 TX NA WILDERNESS AT NA FORD EXPLORER3/30/2000 TX 3 ATX P235/75R15 FORD EXPLORER3/22/2000 TX 1 ATX P235/75R15 FORD EXPLORER5/22/2000 CA 3 ATX P235/75R15 FORD EXPLORER8/15/2000 LA 0 WILDERNESS P225/70R14 FORD RANGER7/11/2000 FL 0 ATX P235/75R15 FORD EXPLORER7/17/2000 MI 0 WILDERNESS P235/75R15 FORD EXPLORER7/25/2000 FL 1 WILDERNESS P235/75R15 FORD EXPLORER8/2/2000 FL 2 ATX NA FORD EXPLORER8/1/2000 FL 0 WILDERNESS AT UKN FORD EXPLORER8/2/2000 FL FIRESTONE N/A NISSAN FRONTIER8/2/2000 IL 1 FIRESTONE NA NISSAN 4X48/2/2000 CA/AR 1 ATX P235/75R15 FORD EXPLORER8/2/2000 TX 1 ATX P235/75R15 FORD EXPLORER8/2/2000 CO 4 ATX 30XR9.50R15LT FORD BRONCO II
8
Frequency Distribution
Tabular summary of data showing the frequency (or number) of items in each of several nonoverlapping classes
Provides insights about the data that cannot be quickly obtained by looking only at the original data.
Use COUNTIF function in Excel.
9
Example: Firestone Tires
Frequency Distribution
Tire Model Count PercentATX 554 18.7Firehawk 38 1.3Firestone 29 1.0Firestone ATX 106 3.6Firestone Wilderness 131 4.4Radial ATX 48 1.6Wilderness 1246 42.0Wilderness AT 709 23.9Wilderness HT 108 3.6Total 2969 100
84.6%
10
Bar Graph
A graphical device for depicting categorical data that have been summarized in a frequency distribution.
On the horizontal axis we specify the labels that are used for each of the classes.
A frequency, relative frequency, or percent frequency scale can be used for the vertical axis.
11
Example: Firestone Tires
Pie Chart
12
Pareto Chart
A bar graph whose categories are ordered from most frequent to least frequent.
Identifies “vital few” categories that contain most of the observations.
13
Example: Unemployment Rates
Unemployment rates for each of the 50 states and Puerto Rico in December 2000
14
Frequency Distribution
Divide the range of data into classes of equal width. The number of classes is usually between 5 and 20.
Count the number of elements in each class.
Use FREQUENCY function in Excel. Unemployment in States
Class Frequency1.0 to 1.9 22.0 to 2.9 133.0 to 3.9 214.0 to 4.9 105.0 to 5.9 36.0 to 6.9 17.0 to 7.9 08.0 to 8.9 1
15
Histogram A graphical presentation of numerical
data. The variable of interest is placed on the
horizontal axis and the frequency is placed on the vertical axis.
16
Cumulative Distribution Shows the number of items with values
less than or equal to the upper limit of each class.
Cumulative percent frequency distribution
Class Frequency Cumulative Frequency
Cumulative Percent
Frequency1.0 to 1.9 2 2 3.9%2.0 to 2.9 13 15 29.4%3.0 to 3.9 21 36 70.6%4.0 to 4.9 10 46 90.2%5.0 to 5.9 3 49 96.1%6.0 to 6.9 1 50 98.0%7.0 to 7.9 0 50 98.0%8.0 to 8.9 1 51 100.0%
Unemployment rate in states
17
Ogive
A graph of a cumulative distribution. The data values are shown on the horizontal axis. Shown on the vertical axis are the:
• cumulative frequencies, or• cumulative percent frequencies
Unemployment Rates
0
5
10
15
20
25
1.0 to 1.9
2.0 to 2.9
3.0 to 3.9
4.0 to 4.9
5.0 to 5.9
6.0 to 6.9
7.0 to 7.9
8.0 to 8.9
Freq
uenc
y
0.0%20.0%40.0%60.0%80.0%100.0%120.0%
18
Time Plot
Plots a variable against time at which it was measured.
Always put time on the horizontal axis. Connecting the data points by lines helps
emphasize any change over time.
-0.5
0
0.5
1
1.5
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990Year
EPS
Johnson&Johnson EPS
19
Summary of Tabular and Graphical Methods
DataNumerical Data
TabularMethods
TabularMethods
Graphical Methods
Graphical Methods
•Frequency Distribution•Rel. Freq. Dist.
•Bar Graph•Pie Chart•Pareto Chart
•Frequency Distribution•Rel. Freq. Dist.•Cum. Freq. Dist.•Cum. Rel. Freq. Distribution •Stem-and-Leaf Display
•Histogram•Ogive•Time Plot
Categorical Data
20
The Five-Number Summary and Box-and-Whisker Plot
The five-number summary of a distribution consists of:
Minimum Q1 Median Q3 Maximum
A box-and-whisker plot is a graph of the five-number summary.• A central box spans the quartiles.• A line in the box marks the median.• Lines extend from the box out to the
smallest and largest observations.
21
Example: Apartment Rents
For the following data on 70 apartment rents compute the five number summary
Five-Number SummaryLowest Value = 425 First
Quartile = 450 Median = 475
Third Quartile = 525 Largest Value = 615425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615
22
Coefficient of Variation
Measure of relative dispersion Always a % Shows variation relative to mean Used to compare two or more groups Sample coefficient of variation:
Population coefficient of variation:
)100(xsCV
)100(
CV
23
Example: Coefficient of Variation
DataGroup 1: 1 2 3Group 2: 100 200 300
Coefficient of Variations
Group 1:
Group 2:
%50)100(21)100(
xsCV
100(100) (100) 50%200
sCVx
24
Using MinitabDescriptive Statistics Tool
25
Chebyshev’s Theorem
At least (1 - 1/k2) of the items in any data set will be within k standard deviations of the mean, where k > 1.
Examples:• At least 75% of the items must be within
k = 2 standard deviations of the mean.• At least 89% of the items must be within
k = 3 standard deviations of the mean.• At least 94% of the items must be within
k = 4 standard deviations of the mean.
26
Example: Apartment Rents
Chebyshev’s Theorem
Let k = 2.0 with = 490.80 and s = 54.74
At least (1 - 1/22) = 0.75 or 75% of the rent values must be between
= 490.80 – 2.0(54.74) = and = 490.80 + 2.0(54.74) =
skx
skx
x
27
Examining Relationships
Thus far we have focused on methods that are used to summarize the data for a single variable.
Often a manager is interested in the relationship between two or more variables.
Methods of examining relationships• Scatter Diagrams• Correlation and Covariance• Least-Square Regression• Crosstabulations
28
Example: Heating a House
One-degree day is accumulated for each degree a day’s average temperature falls below 65°F, e.g., an average temperature of 20°F corresponds to 45 degree-days.
29
Example: Heating a House
Scatter Diagram
30
Adding Categorical Variables
Natural-gas consumption against degree-days before (red dots) and after (blue dots) installing solar panel
31
Lurking Variables
Lurking in the background, if not measured, they can• falsely suggest a strong relationship
between two variables, or• hide a relationship that is really there.
32
Example: False Association
Studies show that men who complain of chest pain are more likely to get detailed test and aggressive treatment than are women with similar complaints.Is this association due to discrimination?
Perhaps not. Women develop heart problems on the average between 10 and 15 years older than men do. Aggressive treatments are more risky for older patients, so doctors may hesitate to recommend them.
Lurking variables – the patient’s age and condition – may explain the false association.
33
Example: Hidden Association
Correlation between x and y = 0.08
34
Crosstabulations
Jointly show observations of two categorical variables (or categorized numerical variables).
Often referred to as contingency tables. Help us to find the relationship between
the two variables.
35
Examining Relationship in Crosstabulations
No single graph (such as a scatter plot) or single numerical measure ( such as the correlation) summaries the strength of the association between categorical variables.
Describing relationships• Marginal Distribution: Distribution of
each variable alone.• Conditional Distribution: Distribution of
one variable for a given class of the other variable
36
Example: Marital Status and Job Level
Marginal distribution of the job level
37
Example: Marital Status and Job Level
Conditional distributions of the job level
38
Pivot Table in Excel
Allows us to construct crosstabulations in a variety of ways.
Provides more variety and flexibility than most other statistical software packages.
39
Example: Movie Stars
Amount the stars ask for a movie
Note: Rows 9-67 are not shown.
A B C D
1 Name Gender Amount asking for a movie ($ million)
2 Angela Bassett F 2.53 Jessica Lange F 2.54 Winona Ryder F 45 Michelle Pfeiffer F 106 Whoopi Goldberg F 107 Emma Thompson F 38 Julia Roberts F 12
40
Example: Movie Stars
PivotTable
A B C D123 Count of Amount asking for a movie ($ million)Gender4 Amount asking for a movie ($ million)F M Grand Total5 2-5 10 9 196 5-8 1 17 187 8-11 4 8 128 11-14 3 3 69 14-17 3 310 17-20 8 811 Grand Total 18 48 66
41
Simpson’s Paradox
An association that holds for all several groups can reverse direction when the data are combined to form a single group.
This reversal is called Simpson’s paradox.
42
Promotion of Managers
Manager promotions in MDP Inc.
Who gets more promotions?
MBA Degree No MBA DegreePromoted 125 155Not Promoted 875 845Total 1000 1000Promoted (in
percent) 12.5% 15.5%
43
Lurking Variable: Experience
Data broken down by fresh hire/experienced
MBA promotions are higher in each category Experience is the lurking variable
MBA Degree Without MBA Degree Pro.1 Not Pro.2 %Promote Pro.1 Not Pro.2 % Promote
Fresh 90 810 10.0% 15 285 5.0%Experience 35 65 35.0% 140 560 20.0%
1: Promote2: Not Promoted
44
Example: Flight Arrivals
One month’s data for flights from several western cities for two airlines:
Which airline is on time?
Alaska Airline America WestOn time 3274 6438Delayed 501 787Total 3775 7225
% Delayed 13.3% 10.9%
45
Example: Flight Arrivals
Data broken down by cities
Alaska Airline beats America West in all 5 cities.
City of origin is a lurking variable.
On time Delayed % Del. On time Delayed % Del.Los Angeles 497 62 11.1% 694 117 14.4%Pheonix 221 12 5.2% 4840 415 7.9%San Diego 212 20 8.6% 383 65 14.5%San Francisco 503 102 16.9% 320 129 28.7%Seattle 1841 305 14.2% 201 61 23.3%
Alaska Airline America West
Introduction to Probability
47
Experiment, Outcome, and Sample Space
Experiment : process of obtaining an observation, outcome, or information.
Outcome : result of an experiment
Sample space : collection (or set) of all possible outcomes, defined by experimenter
Event : any collection (or set) of outcomes
Roll a dice
Number that appears on top
{1, 2, 3, 4, 5, 6}
Get an odd number
48
Examples of Experiments and Their Sample Spaces
Experiment
Toss a coin, and note a face.
Inspect a part, and note quality.
Toss a coin twice, and note faces.
Purchase a soft drink twice, and note brand, C (Coke), O (all others).
Sample Space
{Defective, Good}
{CC, CO, OC, OO}
{Head, Tail}
{HH, HT, TH, TT}
49
Probability of An Event
Numerical measure of the likelihood that an event will occur.
Lies between 0 (impossible) and 1 (certain).
A probability of 0.5 indicates the occurrence of the event is just as likely as it is unlikely.
P(Sample Space) = 1
50
Event Properties Mutually exclusive events:
The events can not occur at the same time
Example: Suppose we want to assess the probability that the Federal Reserve Bank will change the prime rate this month. Define the following events:
A: the prime rate is raisedB: the prime rate is loweredC: the prime rate is unchanged
Events A, B, and C are mutually exclusive. A B
C
51
Collectively exhaustive• One out of a set of events
must occur• Constitute the sample space
Example: In the prime rate example, Events A, B, and C are collectively exhaustive.
Example: Consider two coin tosses of a fair coin:A: at least one headsB: at least one tails
Are A and B collectively exhaustive?Are they mutually exclusive?
YesNo. Consider HT or TH.
AB
C
52
Complement of an Event
The complement of event A is the event consisting of all outcomes that are not in A.
The complement of A is denoted by Ac. The Venn diagram below illustrates the
concept of a complement.
Event A Ac
Sample Space
53
The union of events A and B is the event containing all sample points that are in A or B.
The union is denoted by A B The union of A and B is illustrated below.
Union of Two Events
Event A
Sample Space
Event B
54
Intersection of Two Events
The intersection of events A and B is the set of all sample points that are in both A and B.
The intersection is denoted by A The intersection of A and B is illustrated
below.
Event A
Sample Space
Event B
Intersection
55
Addition Law: Relating Probabilities of Unions and Intersections
Provides a way to compute the probability of the union of events. The law is written as:
When the events are mutually exclusive:
In rolling a dice, what is the probability of getting an odd number or a number less than 5?
0.5+2/3-1/3 = 5/6
< 5: {1,2,3,4}Odd and < 5: {1,3}
Odd or < 5: {1,2,3,4,5}Odd: {1,3,5}
( ) ( ) ( ) ( )P A B P A P B P A B
( ) ( ) ( )P A B P A P B
Sample Space
Event A Event B