Topic1_Summarizing_and_Visualizing_Data.pdf
-
Upload
vikas-sarangal -
Category
Documents
-
view
17 -
download
2
description
Transcript of Topic1_Summarizing_and_Visualizing_Data.pdf
Some Examples of Business Situations addressed using Statistics
Ø A business school A claims that their average starting offers are
more than that of another business school B. Is the claim true?
Ø A plant X has two assembly lines. Employees face one of the three
kinds of accidents – sprain, cut, burns. Do the accident patterns
with respect to their type differ in the two assembly lines?
Ø A bank wants to assess the credit worthiness of its applicant.
Should it pass the loan or reject it?
2
Some Examples of Business Situations addressed using Statistics
Ø A company is planning two types of prices (Rs. 25000 versus Rs.
35000) for its software. It plans two different promotion
campaigns (speed of the product versus computational power of
the product). It proposes to give two types of incentives for the
customer (30 day free trial versus a free gift of related software).
What strategy should be taken in order to get maximum sales?
Ø An HR manager wants to verify the claim that increased
compensation increase the motivation of the employees
irrespective of their age, gender, experience level, etc. Is the
claim true?
3
Approaches for Statistical ProblemsØ Two approaches – they are used jointly in most cases
• Descriptive statistics
• Inferential statistics
Ø Descriptive statistics
• Used to describe main features of a collection of data quantitatively
• Aim to summarize a data set quantitatively without employing a
probabilistic formulation
Ø Inferential statistics
• Aims making conclusions using data that is subject to random
variation
• Used for: Estimation; Hypothesis testing; Predicting/forecasting
4
Some Basic ConceptsØ Types of Variables
Ø Scales of Measurement
Ø Measures of central tendency
Ø Measures of dispersion
Ø Skewness and Kurtosis
Ø Quartiles
Ø Histogram
Ø Box plot
Ø Scatter plot
Ø Stem and Leaf Display
Ø Pie Chart and Bar Graph
6
Types of VariablesØ Quantitative Variables
A quantitative variable can be described by a number for
which arithmetic operations such as averaging make sense.
Ø Qualitative Variables
A qualitative (or categorical) variable simply records a
quality.
If a number is used for distinguishing members of different
categories of a qualitative variable, the number assignment
is arbitrary.
7
Scales of MeasurementØ Nominal Scale
e.g. North = 1, East = 2, South = 3, West = 4
Ø Ordinal Scale
e.g. Very good = 4, Good = 3, Fair = 2, Unacceptable = 1
Ø Interval Scale
v measurement the value of zero is assigned arbitrarily and therefore we
cannot take ratios of two measurements.
v but we can take ratios of intervals.
v e.g. 100 deg C. is not twice as hot as 50 deg C.
Ø Ratio Scale
v we can take ratios of those measurements.
v the zero in this scale is an absolute zero.
v e.g. money - a sum of Rs.100 is twice as large as Rs. 50.8
Measures of Central Tendency
A statistician had his head in an oven and
his feet in ice, and he said that on the
average he feels fine.
10
Measures of Central Tendency
Ø Relates to the way in which quantitative data tend to cluster
around some value – the central value
Ø Most commonly used measures: Mean (μ), Median, Mode
Ø All have the same unit as that of the data.
What is the purpose of different measures of central tendency?
11
Measures of DispersionØ Dispersion indicates the ‘variability’ or ‘spread’ in a variable
Ø Most commonly used measures – Variance; Standard deviation;
Inter-quartile range
Ø Variance - describes how far the values lie from the mean
13
N
)(xN
1i
2i∑
=
−=
μσ2
Measures of DispersionØ Standard deviation – square root of the variance
• Denoted by σ
• Standard deviation is given by:
• Low σ è data points tend to be very close to the mean;
High σ è data is spread out over a large range of values
• Sample standard deviation (s) is used as an estimator of σ
[what is a ‘sample’ ? What is an ‘estimator’ ?? why ‘N-1’ in the denominator
instead of ‘N’ ???]
14
N
)(xN
1i
2i∑
=
−=
μσ
1-N
)x(xN
1i
2i∑
=
−=s
Skewness and KurtosisØ Skewness – a measure of asymmetry of the data
• Can be positive or negative or undefined
• Negative skew è tail is to the left
• Positive skew è tail is to the right
Ø Kurtosis - a measure of the "peakedness" of the data
• High kurtosis è sharper peak and longer fatter tails
• Low kurtosis è rounded peak and shorter thinner tails
15
Quartiles
Ø Quartiles are the three values which divide the sorted data set
into four equal parts
• Each part then represents one fourth of the sampled data
Ø The three quartiles:
• First quartile (Q1) = lower quartile è cuts off lowest 25% of data
• Second quartile (Q2) = median è cuts the data set into half
• Third quartile (Q3) = upper quartile è cuts off highest 25% of data
16
Quartiles
Ø No universal method for calculating quartiles
Ø One method:
Lk = N x (k / 4); where k = 1 for Q1, 2 for Q2, 3 for Q3
v If Lk is a whole number, then Qk = average of the values corresponding
to the positions Lk and Lk+1
v If Lk is a decimal, the Qk = value corresponding to the position rounded
upto the higher whole number position
Ø Other method:
v Median of the data set gives Q2
v Divide data set into two. [In case of odd data points in original set
include median in both the halves]. The median of upper and lower
halves gives Q3 and Q1 respectively
17
Quartiles
Ø Interquartile range è IQR = Q3 – Q1
IQR is a more robust measure for variability
It does not get affected much by skewness or outliers
18
Histogram
Ø A graphical display of tabular frequencies
• Shown in the form of adjacent rectangles
• Provides a good visual representation of the distribution of data
Ø Important factor to be considered while construction:
• What is the ideal number of bins (denoted by ‘k’) and the corresponding
(equal) width of the interval (denoted by ‘h’)?
v Small bin widthè Too many bins è Difficult to interpret
v Large bin widthè Less number of bins è Loss of information
• Some rules of thumb:
v Sturges’ formula: k = ceiling[log2N + 1] = ceiling[(ln N/ln 2) + 1]
v Scott’s choice: h = (3.5 x sample standard deviation) / (cube root of N)
19
Histogram
Ø Identifying relation between mean, median, and shape of a
histogram
• Symmetric: mean ≈ median
• Left (or negatively) skewed: mean < median (generally)
• Right (or positively) skewed: mean > median (generally)
20
Box PlotØ Also known as ‘box-and-whisker’ plot
Ø Provides a five number summary:
• The smallest observation (minimum)
• Lower quartile (Q1)
• Median (Q2)
• Upper quartile (Q3)
• Largest observation (maximum)
Ø Also indicates ‘outlier’ observations, if any
Ø Spacings between the different parts of the box help indicate: the
degree of dispersion (spread) and skewness in the data, and identify
outliers
Box plots are very effective while comparing values in two or more
categories21
Box Plot (cont…)
22
Q3
Q1
Q2
min(Xmax , {Q3+[1.5 x IQR]})
max(Xmin , {Q1- [1.5 x IQR]})
Outliers*
*Note: Generally this is the value beyond which readings are considered as
outliers. However, there is no universal definition.
Scatter PlotØ Displays values for two variables for a set of data
Ø Provides a visual representation of relationship between two variables
23
Stem and Leaf Display
Ø Contains features of a histogram.
Ø More informative than a histogram.
Ø e.g. The stem and leaf display for 11, 12, 12, 13, 15, 15, 15, 16, 17,
20, 21, 21, 21, 22, 22, 22, 23, 24, 26, 27, 27, 27, 28, 29, 29, 30, 31,
32, 34, 35, 37, 41, 41, 42, 45, 47, 50, 52, 53, 56, 60, 62 will be given
as:
1 | 1223555672 | 01112223467778993 | 0124574 | 112575 | 02366 | 02
24
Pie Chart and Bar Graph
Ø A pie chart is the most illustrative way of displaying quantities as
percentages of a given total.
Ø Pie charts are used to present frequencies for categorical data.
Ø The scale of measurement may be nominal or ordinal
Ø Bar graphs are often used to display categorical data where there is
no emphasis on the percentage of a total represented by each
category.
Ø The scale of measurement is nominal or ordinal.
25
An Illustration
Ø NAME: Car Data
Ø TYPE: Multiple Regression
Ø SIZE: 804 observations, 12 variables
Ø DESCRIPTIVE ABSTRACT:
Data collected from Kelly Blue Book for several hundred 2005
used GM cars allows students to develop a multivariate
regression model to determine their car value based on a variety
of characteristics such as miles driven, make, model, engine
size, interior style, cruise control, etc.
26
An Illustration
Ø SOURCES:
For this data set, a representative sample of over eight hundred,
2005 GM cars were selected, then an algorithm was developed
following the 2005 Central Edition of the Kelly Blue Book to
estimate retail price.
27
An Illustration
Ø VARIABLE DESCRIPTIONS:
v Price: suggested retail price of the used 2005 GM car in excellent
condition. The condition of a car can greatly affect price. All cars in
this data set were less than one year old when priced and considered
to be in excellent condition.
v Miles: number of miles the car has been driven
v Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet
v Model: specific models for each car manufacturer such as Ion, Vibe,
Cavalier
v Trim (of car): specific type of car model such as SE Sedan 4D, Quad
Coupe 2D
28
An Illustration
Ø VARIABLE DESCRIPTIONS:
v Type: body type such as sedan, coupe, etc.
v Cylinder: number of cylinders in the engine
v Liter: a more specific measure of engine size
v Doors: number of doors
v Cruise: indicator variable representing whether the car has cruise
control (1 = cruise)
v Sound: indicator variable representing whether the car has upgraded
speakers (1 = upgraded)
v Leather: indicator variable representing whether the car has leather
seats (1 = leather)
29