Topic1_Summarizing_and_Visualizing_Data.pdf

29
Summarizing and Visualizing Data Ravindra S. Gokhale IIM Indore 1

description

SAS data

Transcript of Topic1_Summarizing_and_Visualizing_Data.pdf

Summarizing and

Visualizing Data

Ravindra S. GokhaleIIM Indore

1

Some Examples of Business Situations addressed using Statistics

Ø A business school A claims that their average starting offers are

more than that of another business school B. Is the claim true?

Ø A plant X has two assembly lines. Employees face one of the three

kinds of accidents – sprain, cut, burns. Do the accident patterns

with respect to their type differ in the two assembly lines?

Ø A bank wants to assess the credit worthiness of its applicant.

Should it pass the loan or reject it?

2

Some Examples of Business Situations addressed using Statistics

Ø A company is planning two types of prices (Rs. 25000 versus Rs.

35000) for its software. It plans two different promotion

campaigns (speed of the product versus computational power of

the product). It proposes to give two types of incentives for the

customer (30 day free trial versus a free gift of related software).

What strategy should be taken in order to get maximum sales?

Ø An HR manager wants to verify the claim that increased

compensation increase the motivation of the employees

irrespective of their age, gender, experience level, etc. Is the

claim true?

3

Approaches for Statistical ProblemsØ Two approaches – they are used jointly in most cases

• Descriptive statistics

• Inferential statistics

Ø Descriptive statistics

• Used to describe main features of a collection of data quantitatively

• Aim to summarize a data set quantitatively without employing a

probabilistic formulation

Ø Inferential statistics

• Aims making conclusions using data that is subject to random

variation

• Used for: Estimation; Hypothesis testing; Predicting/forecasting

4

Some Basic Concepts

5

Some Basic ConceptsØ Types of Variables

Ø Scales of Measurement

Ø Measures of central tendency

Ø Measures of dispersion

Ø Skewness and Kurtosis

Ø Quartiles

Ø Histogram

Ø Box plot

Ø Scatter plot

Ø Stem and Leaf Display

Ø Pie Chart and Bar Graph

6

Types of VariablesØ Quantitative Variables

A quantitative variable can be described by a number for

which arithmetic operations such as averaging make sense.

Ø Qualitative Variables

A qualitative (or categorical) variable simply records a

quality.

If a number is used for distinguishing members of different

categories of a qualitative variable, the number assignment

is arbitrary.

7

Scales of MeasurementØ Nominal Scale

e.g. North = 1, East = 2, South = 3, West = 4

Ø Ordinal Scale

e.g. Very good = 4, Good = 3, Fair = 2, Unacceptable = 1

Ø Interval Scale

v measurement the value of zero is assigned arbitrarily and therefore we

cannot take ratios of two measurements.

v but we can take ratios of intervals.

v e.g. 100 deg C. is not twice as hot as 50 deg C.

Ø Ratio Scale

v we can take ratios of those measurements.

v the zero in this scale is an absolute zero.

v e.g. money - a sum of Rs.100 is twice as large as Rs. 50.8

Measures of Central Tendency

9

Measures of Central Tendency

A statistician had his head in an oven and

his feet in ice, and he said that on the

average he feels fine.

10

Measures of Central Tendency

Ø Relates to the way in which quantitative data tend to cluster

around some value – the central value

Ø Most commonly used measures: Mean (μ), Median, Mode

Ø All have the same unit as that of the data.

What is the purpose of different measures of central tendency?

11

Measures of Central Tendency

Ø An ‘outlier’ affects the mean but does not affect the median

12

Measures of DispersionØ Dispersion indicates the ‘variability’ or ‘spread’ in a variable

Ø Most commonly used measures – Variance; Standard deviation;

Inter-quartile range

Ø Variance - describes how far the values lie from the mean

13

N

)(xN

1i

2i∑

=

−=

μσ2

Measures of DispersionØ Standard deviation – square root of the variance

• Denoted by σ

• Standard deviation is given by:

• Low σ è data points tend to be very close to the mean;

High σ è data is spread out over a large range of values

• Sample standard deviation (s) is used as an estimator of σ

[what is a ‘sample’ ? What is an ‘estimator’ ?? why ‘N-1’ in the denominator

instead of ‘N’ ???]

14

N

)(xN

1i

2i∑

=

−=

μσ

1-N

)x(xN

1i

2i∑

=

−=s

Skewness and KurtosisØ Skewness – a measure of asymmetry of the data

• Can be positive or negative or undefined

• Negative skew è tail is to the left

• Positive skew è tail is to the right

Ø Kurtosis - a measure of the "peakedness" of the data

• High kurtosis è sharper peak and longer fatter tails

• Low kurtosis è rounded peak and shorter thinner tails

15

Quartiles

Ø Quartiles are the three values which divide the sorted data set

into four equal parts

• Each part then represents one fourth of the sampled data

Ø The three quartiles:

• First quartile (Q1) = lower quartile è cuts off lowest 25% of data

• Second quartile (Q2) = median è cuts the data set into half

• Third quartile (Q3) = upper quartile è cuts off highest 25% of data

16

Quartiles

Ø No universal method for calculating quartiles

Ø One method:

Lk = N x (k / 4); where k = 1 for Q1, 2 for Q2, 3 for Q3

v If Lk is a whole number, then Qk = average of the values corresponding

to the positions Lk and Lk+1

v If Lk is a decimal, the Qk = value corresponding to the position rounded

upto the higher whole number position

Ø Other method:

v Median of the data set gives Q2

v Divide data set into two. [In case of odd data points in original set

include median in both the halves]. The median of upper and lower

halves gives Q3 and Q1 respectively

17

Quartiles

Ø Interquartile range è IQR = Q3 – Q1

IQR is a more robust measure for variability

It does not get affected much by skewness or outliers

18

Histogram

Ø A graphical display of tabular frequencies

• Shown in the form of adjacent rectangles

• Provides a good visual representation of the distribution of data

Ø Important factor to be considered while construction:

• What is the ideal number of bins (denoted by ‘k’) and the corresponding

(equal) width of the interval (denoted by ‘h’)?

v Small bin widthè Too many bins è Difficult to interpret

v Large bin widthè Less number of bins è Loss of information

• Some rules of thumb:

v Sturges’ formula: k = ceiling[log2N + 1] = ceiling[(ln N/ln 2) + 1]

v Scott’s choice: h = (3.5 x sample standard deviation) / (cube root of N)

19

Histogram

Ø Identifying relation between mean, median, and shape of a

histogram

• Symmetric: mean ≈ median

• Left (or negatively) skewed: mean < median (generally)

• Right (or positively) skewed: mean > median (generally)

20

Box PlotØ Also known as ‘box-and-whisker’ plot

Ø Provides a five number summary:

• The smallest observation (minimum)

• Lower quartile (Q1)

• Median (Q2)

• Upper quartile (Q3)

• Largest observation (maximum)

Ø Also indicates ‘outlier’ observations, if any

Ø Spacings between the different parts of the box help indicate: the

degree of dispersion (spread) and skewness in the data, and identify

outliers

Box plots are very effective while comparing values in two or more

categories21

Box Plot (cont…)

22

Q3

Q1

Q2

min(Xmax , {Q3+[1.5 x IQR]})

max(Xmin , {Q1- [1.5 x IQR]})

Outliers*

*Note: Generally this is the value beyond which readings are considered as

outliers. However, there is no universal definition.

Scatter PlotØ Displays values for two variables for a set of data

Ø Provides a visual representation of relationship between two variables

23

Stem and Leaf Display

Ø Contains features of a histogram.

Ø More informative than a histogram.

Ø e.g. The stem and leaf display for 11, 12, 12, 13, 15, 15, 15, 16, 17,

20, 21, 21, 21, 22, 22, 22, 23, 24, 26, 27, 27, 27, 28, 29, 29, 30, 31,

32, 34, 35, 37, 41, 41, 42, 45, 47, 50, 52, 53, 56, 60, 62 will be given

as:

1 | 1223555672 | 01112223467778993 | 0124574 | 112575 | 02366 | 02

24

Pie Chart and Bar Graph

Ø A pie chart is the most illustrative way of displaying quantities as

percentages of a given total.

Ø Pie charts are used to present frequencies for categorical data.

Ø The scale of measurement may be nominal or ordinal

Ø Bar graphs are often used to display categorical data where there is

no emphasis on the percentage of a total represented by each

category.

Ø The scale of measurement is nominal or ordinal.

25

An Illustration

Ø NAME: Car Data

Ø TYPE: Multiple Regression

Ø SIZE: 804 observations, 12 variables

Ø DESCRIPTIVE ABSTRACT:

Data collected from Kelly Blue Book for several hundred 2005

used GM cars allows students to develop a multivariate

regression model to determine their car value based on a variety

of characteristics such as miles driven, make, model, engine

size, interior style, cruise control, etc.

26

An Illustration

Ø SOURCES:

For this data set, a representative sample of over eight hundred,

2005 GM cars were selected, then an algorithm was developed

following the 2005 Central Edition of the Kelly Blue Book to

estimate retail price.

27

An Illustration

Ø VARIABLE DESCRIPTIONS:

v Price: suggested retail price of the used 2005 GM car in excellent

condition. The condition of a car can greatly affect price. All cars in

this data set were less than one year old when priced and considered

to be in excellent condition.

v Miles: number of miles the car has been driven

v Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet

v Model: specific models for each car manufacturer such as Ion, Vibe,

Cavalier

v Trim (of car): specific type of car model such as SE Sedan 4D, Quad

Coupe 2D

28

An Illustration

Ø VARIABLE DESCRIPTIONS:

v Type: body type such as sedan, coupe, etc.

v Cylinder: number of cylinders in the engine

v Liter: a more specific measure of engine size

v Doors: number of doors

v Cruise: indicator variable representing whether the car has cruise

control (1 = cruise)

v Sound: indicator variable representing whether the car has upgraded

speakers (1 = upgraded)

v Leather: indicator variable representing whether the car has leather

seats (1 = leather)

29