Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT...
Transcript of Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT...
Chapter 1 Overview and Descriptive Statistics
Seungchul Baek
STAT 355 Introduction to Probability and Statistics for Scientists andEngineers
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 1
Have you ever learned statistics?
learned, and still remember some
learned, but have forgotten everything
Heard but never learned
What is statistics?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 2
What Is Statistics?
Statistics measures uncertainty in real life.
Statistics is the science of data; how to interpret data, analyze data,and design studies to collect data.
Statistics is used in all disciplines; not just in engineering and sciences.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 3
Statistics Examples
In a reliability (time to failure) study, engineers are interested indescribing the time until failure for a electronic device.
In an agricultural experiment, researchers want to know which of fourfertilizers produces the highest corn yield.
In a clinical trial, physicians want to determine which of two drugs ismore effective for treating HIV in the early stages of the disease.
In a social network analysis, researchers want to know the grouppatterns among all the users.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 4
What Statisticians Do Is. . .
Statisticians use their skills in mathematics and computing to formulatestatistical models and analyze data for a specific problem at hand.
Models are then used to estimate important quantities of interest, totest the validity of proposed conjectures, and to predict futurebehavior.
Being able to identify and model sources of variability is an importantpart of statistics.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 5
Definitions
Subject: entities that we measure in a study
Population: the total set of subjects in which we are interested in
Sample: the subset of the population for whom we have data, oftenrandomly selected
Variable: any characteristic that is observed for the subject
Statistic: numerical summary of a sample (we know) ex. mean, median,etc.
Parameter: numerical summary of a population (we don’t know) ex. mean,median, variance, etc.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 6
Example
Old McDonald’s farm has 5000 turkeys and we’re interested in estimatingthe average weight of all the turkeys. Instead of weighing all 5000, we onlyweigh 100 randomly selected turkeys.
What is subject, population, sample, and variable in this example?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 7
Example
Last semester there were 243 STAT 355 students. We wanted toapproximate the average height of a STAT 355 student. So we looked at 40students and measured their height. It showed that the average height ofthe 40 students was 165 cm. After that, we found that the mandatoryphysicals record of all students, in which the average height of all 243 STAT355 students was 172 cm.
What is subject, population, sample, variable, statistic, parameter?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 8
Major Components to Statistics
Descriptive Statistics
What summary can help us answer the question?
Inferential Statistics (or Statistical Inference)
Can we predict or draw conclusions based on the data we have?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 9
Types of Variables
Variable: any characteristic that is observed for the subject. There are twotypes of variables, categorical variable and quantitative variable.
Categorical: Observations that belong to a set of categories.
ex. hair color, gender, zip code, etc.
Quantitative: Observations that take on numerical values
ex. height, weight, income, etc.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 10
Types of Variables
Quantitative: Observations that take on numerical values
Discrete: measured by a whole number
ex. number of books, children, money, etc
Continuous: measured on an interval
ex. time, weight, distances
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 11
How to Compare Discrete and Continuous
If you think of time: going from 1 min to 2 min we have to hit all of thetimes, e.g. 1.5 min or 1 min 30 sec
If you think of weight: going from 150 lbs to 140 lbs we have to be everyweight between 140 and 150, e.g. 144 lbs
If you think of the number of books and children, we jump from onenumber to the next, 2.5 books, 1.5 children means nothing.
Time and weight are continuous variables. Books and children are discretevariables.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 12
Example
Let’s consider a random sample of five residents of Ellicott City.
Days Piercings Gym Type Age Gender1 2 0 No Neither 46 F2 3 1 Yes Run 21 F3 1 0 Yes Run 64 M4 6 2 Yes Both 18 M5 0 0 No Neither 19 F
Days: Number of days spent on workout weekly
Piercings: Number of body piercings
Gym: Do they go to the gym or not?
Type: Do they lift, run, neither or both?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 13
Example
Which variables are categorical?
Which variables are quantitative (discrete)?
Which variables are quantitative (continuous)?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 14
Categorical Summary: Frequency Table
Let’s say we had 160 people in our sample instead of the 5 in the previousexample and we want to get a better look at the type of workout that aresident of Ellicott City has.
Type Frequency1 Lift 322 Run 643 Both 164 Neither 485 Total 160
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 15
Categorical Summary: Frequency Table
Type Frequency Relative.Frequency1 Lift 32 0.22 Run 64 0.43 Both 16 0.14 Neither 48 0.35 Total 160 1
The relative frequency is the percent of the total sample, of 160, that hadthe data point we’re looking at.
Relative Frequency = # of subjects in each casetotal # of subjects
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 16
Graphs
Stem-and-Leaf Plot: Okay when the data is small, retains actual datavalues
Dot Plot: Okay when the data is small and there are relatively few distinctdata values
Pie Chart: Useful when there are a small number of categories
Bar Graph: Useful when there are many categories of the variable, anduseful to compare groups
Histograms: Good for large data and for showing the shape of distribution
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 17
Shape of Distributions
Symmetric if the right and left sides of the histogram are approximatelymirror images of each other
Skewed to the right (positively skewed) if the right “tail” extends muchfarther out than the left tail
Skewed to the left (negatively skewed) if the left “tail” extends muchfarther out than the right tail
Uniform if all bars are the same height
Bimodal if two (2) bars are higher than others
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18
Measures of Central Tendency: Mean
The sample mean X̄ of observations x1, . . . , xn is given by
X̄ =∑n
i=1 xin
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 19
Measures of Central Tendency: Median
The median X̃ is the midpoint of the observations when they are orderedfrom the smallest to largest.
X̃ =
X(m), if n is oddX(m)+X(m+1)
2 , if n is even,
where m = (n + 1)/2 when n is odd and m = n/2 when n is even. X(m)stands for the m-th observation.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 20
Measures of Central Tendency: Mode
The mode is the observation that shows up the most in the data set. Modedoes’t necessary exist when we meet tie.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 21
Example
We have a date whose size is 14.
2, 7, 7, 11, 12, 15, 14, 20, 5, 6, 15, 12, 12, 20
Mean?
Median?
Mode?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 22
Measures of Variability: Range
The range is the difference between the maximum and minimumobservations
It is easy to calculate but relies on only two values, which may beoutliers.
Range = maxi
(xi )−mini
(xi )
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 23
Measures of Variability: Variance
The sample variance s2 is the average, squared deviation of eachobservation from the mean.
The idea is that it measures the spread of the data about the mean.
It is difficult to interpret because it’s in squared units, cannot benegative and is only zero when all data points are equal.
s2 =∑n
i=1(xi − X̄ )2
n − 1
The sample standard deviation s is the positive square root of thevariance,
s =√
s2
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 24
Computing Formula for s2
It is not hard to show that
n∑i=1
(xi − X̄ )2 =n∑
i=1x2
i − nX̄ 2
We will encounter a similar thing later:
var(X ) = E{X − E (X )}2 = E (X 2)− {E (X )}2
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 25
Proposition
We have a sample of x1, . . . , xn and let c be a constant.
If yi = xi + c for i = 1, . . . , n, then s2y = s2
x .
If yi = cxi for i = 1, . . . , n, then s2y = c2s2
x .
These are in fact from the following. We will learn in later chapters.
var(X + c) = var(X )
var(cX ) = c2var(X )
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 26
Example
We have a data as follows:
0.2, 0.7, 1.1, 1.2, 1.8, 2.3, 9.8, 19.7
What kind of data type is this?
Draw a dot plot.
Draw a stem-and-leaf plot.
Draw a histogram?
Mean, median, and mode?
Range, variance, and standard deviation?
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 27
Percentiles
Percentile: the p-th percentile is a value such that p percentage of theobservations fall below or at the value.
Consider an ordered population of 10 data values
3, 6, 7, 8, 8, 10, 13, 15, 16, 20
What are the 70th and 15th percentile?
70th percentile = (0.7 * 10)th position = 7th position = 13
15th percentile = (0.15 * 10)th position = 1.5 th position < 2nd position= 6
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 28
Percentile and Quartile
The term quartile is used because one will be able to divide the data intoquarters
Q1: the observation at the 25th percentile
Q2: the observation at the 50th percentile (Median)
Q3: the observation at the 75th percentile
IQR (Interquartile range)=Q3-Q1: another measure to assess variability.
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 29
ExampleWe have two samples, one of which is 3, 6, 7, 8, 8, 10, 13, 15, 16, 20 and theother is 3, 6, 7, 8, 8, 10, 13, 15, 16, 20, 40.
1 2
1020
3040
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 30
Shape of Distributions and Boxplots
Bell shape
Skewed to the right
Skewed to the left
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 31
Location of Mean, Median, and Mode
Bell shape
Skewed to the right
Skewed to the left
Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 32