Statt is Tics

40
1 the statistical analysis of data by Dr. Dang Quang A & Dr. Bui The Hong Hanoi Institute of Information Technology

description

hh

Transcript of Statt is Tics

Page 1: Statt is Tics

1

the statistical analysis of data

by Dr. Dang Quang A & Dr. Bui The Hong

Hanoi Institute of Information Technology

Page 2: Statt is Tics

2

Preface• Statistics is the science of collecting, organizing and interpreting

numerical and nonnumerical facts, which we call data. • The collection and study of data are important in the work of

many professions, so that training in the science of statistics is valuable preparation for variety of careers. , for example economists and financial advisors, businessmen, engineers, farmers

• Knownedge of probability and statistical methods also are useful for informatic specialists of various fields such as data mining, knowledge discovery, neural network, fuzzy system and so on.

• Whatever else it may be, statistics is, firsrt and foremost, a collection of tools used for converting raw data into information to help decision makers in their works. The science of data - statistics - is the subject of this course.

Page 3: Statt is Tics

3

Audience and objective• Audience This tutorial as an introductory course to statistics is intended mainly for users

such as engineers, economists, managers,...which need to use statistical methods in their work and for students. However, it will be in many aspects useful for computer trainers.

• Objectives• Understanding statistical reasoning• Mastering basic statistical methods for analyzing data such as descriptive

and inferential methods • Ability to use methods of statistics in practice with the help of computer

softwares in statistics

• Entry requirements High school algebra course (+elements of calculus) Skill of working with computer

Page 4: Statt is Tics

4

ContentsPrefaceChapter 1 Introduction………….Chapter 2 Data presentation…...Chapter 3 Data characteristics...

descriptive summary statisticsChapter 4 Probability: Basic…...

concepts …………………….Chapter 5 Basic Probability

distributions ………………...Chapter 6 Sampling Distributions

……………….Chapter 7 Estimation………….Chapter 8 General Concepts of

Hypothesis Testing …………..Chapter 9 Applications of Hypothesis

Testing …………..

Chapter 10 Categorical Data ….... Analysis and Analysis of variance

Chapter 11 Simple Linear ……… regression and correlation ……

Chapter 12 Multiple regression …Chapter 13 Nonparametric statistics

…………………………ReferencesAppendix AAppendix BAppendix CAppendix DIndex

•[Back]

Page 5: Statt is Tics

5

Chapter 1 Introduction1.1 What is Statistics

– Whatever else it may be, statistics is, first and foremost, a collection of tools used for converting raw data into information to help decision makers in their works.

1.2. Populations and samples– A population is a whole, and a sample is a fraction of the whole.– A population is a collection of all the elements we are studying and about which

we are trying to draw conclusions. Such a population is often referred to as the target population.

– A sample is a collection of some, but not all, of the elements of the population

1.3. Descriptive and inferential statistics– Descriptive statistics is devoted to the summarization and description of data

(population or sample) .– Inferential statistics uses sample data to make an inference about a population .

1.4. Brief history of statistics 1.5. Computer softwares for statistical analysis

•[Back]

Page 6: Statt is Tics

6

Chapter 2 Data presentation2.1 Introduction• The objective of data description is to summarize the characteristics of a data set. Ultimately, we want

to make the data set more comprehensible and meaningful. In this chapter we will show how to construct charts and graphs that convey the nature of a data set. The procedure that we will use to accomplish this objective depends on the type of data.

2.2 Types of data• Quantitative data are observations measured on a numerical scale. Nonnumerical data that can only be classified into categories are said to be qualitative data..

2.3 Qualitative data presentation• Category frequency = the number of observations that fall in that category.• Relative frequency = the proportion of the total number of observations that fall in that category• Percentage for a category = Relative frequency for the category x 100%

2.4 Graphical description of qualitative dataBar graphs and pie charts

•[Back]•[Contents]

Page 7: Statt is Tics

7

Chapter 2 (continued 1)

2.5 Graphical description of quantitative data: Stem and Leaf displays

• Stem and leaf display is widely used in exploratory data analysis when the data set is small• Steps to follow in constructing a Stem and Leaf Display• Advantages and disdvantage of a stem and leaf display

2.6 Tabulating quantitative data: Relative frequency distributions

• Frequency distribution is a table that organizes data into classes• Class frequency = the number of observations that fall into the class.• Class relative frequency = Class frequency/ Total number of observations Relative class percentage = Class relative frequency x 100% 2.7 Graphical description of quantitative data: histogram and

polygon• frequency histogram, relative frequency histogram and percentage histogram.• frequency polygon, relative frequency polygon and percentage polygon

2.8 Cumulative distributions and cumulative polygons2.9 Exercises

•[Back]

Page 8: Statt is Tics

8

Chapter 3 Data characteristics: descriptive summary statistics

3.1 Introduction3.2 Types of numerical descriptive measures3.3 Measures of central tendency3.4 Measures of data variation3.5 Measures of relative standing3.6 Shape 3.7 Methods for detecting outlier3.8 Calculating some statistics from grouped data3.9 Computing descriptive summary statistics using computer softwares3.10 Exercises

•[Back]

•[Contents]

Page 9: Statt is Tics

9

Chapter 3 (continued 1)

3.2 Types of numerical descriptive measures: Location, Dispersion, Relative standing and Shape

3.3 Measures of location ( or central tendency) 3.3.1 Mean 3.3.2 Median

3.3.3 Mode 3.3.4 Geometric mean

3.4 Measures of data variation

3.4.1 Range

3.1.2 Variance and standard deviation

Uses of the standard deviation: Chebyshev’s Theorem,

The Empirical Rule 3.4.3 Relative dispersion: The coefficient of variation

•[Back]

Page 10: Statt is Tics

10

Chapter 3 (continued 2)

3.5 Measures of relative standingDescriptive measures that locate the relative position of an observation in relation to the other observations are called measures of relative standing The pth percentile is a number such that p% of the observations of

the data set fall below and (100-p)% of the observations fall above it. – Lower quartile = 25th percentile Mid- quartile,= 50th percentile.– Upper quartile = 75th percentile, Interquartile range, z-score

3.6 Shape 3.6.1 Skewness

3.6.2 Kurtosis3.7 Methods for detecting outlier3.8 Calculating some statistics from grouped data

•[Back]

Page 11: Statt is Tics

11

Chapter 4. Probability: Basic concepts

4.1 Experiment, Events and Probability of an Event4.2 Approaches to probability4.3 The field of events4.4 Definitions of probability4.5 Conditional probability and independence4.6 Rules for calculating probability4.7 Exercises

•[Back]•[Contents]

Page 12: Statt is Tics

12

Chapter 4 (continued 1)

4.1 Experiment, Events and Probability of an EventThe process of making an observation or recording a measurement under

a given set of conditions is a trial or experiment.Outcomes of an experiment are called events. We denote events by capital letters A, B, C,…The probability of an event A, denoted by P(A), in general, is the chance

A will happen.4.2 Approaches to probability

.Definitions of probability as a quantitative measure of the “degree of certainty” of the observer of experiment.

.Definitions that reduce the concept of probability to the more primitive notion of “equal likelihood” (the so-called “classical definition “).

.Definitions that take as their point of departure the “relative frequency” of occurrence of the event in a large number of trials (“statistical” definition).

•[Back]

Page 13: Statt is Tics

13

Chapter 4 (continued 2)

4.3 The field of events• Definitions and relations between the events: A implies B, A and B are

equivalent (A=B), product or intersection of the events A and B (AB), sum or union of A and B (A+B), difference of A and (A-B or A\B), certain (or sure) event, impossible event, complement of A, mutually exclusive events, simple (or elementary), sample space.

• Ven diagrams

• Field of events 4.4 Definitions of probability

4.4.1 The classical definition of probability 4.4.2 The statistical definition of probability4.4.3 Axiomatic construction of the theory of probability (optional)

4.5 Conditional probability and independenceDefinition, formula, multiplicative theorem, independent and

dependent events

•[Back]

Page 14: Statt is Tics

14

Chapter 4 (continued 3)

4.5 Conditional probability and independence4.6 Rules for calculating probability

4.6.1 The addition rule– for pairwise mutually exclusive events

P(A1+ A2+ ...+An)= P(A1)+P(A2)+ ...+P(An)– for two nonmutually exclusive events A and B

P(A+B) = P(A) + P(B) – P(AB).4.6.2 Multiplicative rule

P(AB) = P(A) P(B|A) = P(B) P(A|B).4.6.3 Formula of total probabilityP(B)= P(A1)P(B|A1)+P(A2)P(B|A2)+ ...+P(An)P(B|An).

•[Back]

Page 15: Statt is Tics

15

Chapter 5 Basic Probability distributions5.1 Random variables5.2 The probability distribution for a discrete random

variable5.3 Numerical characteristics of a discrete random variable5.4 The binomial probability distribution5.5 The Poisson distribution5.6 Continuous random variables: distribution function and

density function5.7 Numerical characteristics of a continuous random

variable5.8 The normal distribution5.9 Exercises

•[Back]•[Contents]

Page 16: Statt is Tics

16

Chapter 5 (continued 1)

5.1 Random variables • A random variable is a variable that assumes numerical values

associated with events of an experiment.• Classification of random variables: A discrete random variable

and continuous random variable

5.2 The probability distribution for a discrete random variable

• The probability distribution for a discrete random variable x is a table, graph, or formula that gives the probability of observing each value of x.

• Properties of the probability distribution

•[Back]

Page 17: Statt is Tics

17

Chapter 5 (continued 2)

5.3 Numerical characteristics of a discrete random variable5.3.1 Mean or expected value: =E(X)= xp(x)

5.3.2 Variance and standard deviation 2=E[(X- )2] 5.4 The binomial probability distribution• Model (or characteristics) of a binomial random variable• The probability distribution • mean and variance for a binomial random variable5.5 The Poisson distribution• Model (or characteristics) of a Poisson random variable• The probability distribution • mean and variance for a Poisson random variable

•[Back]

Page 18: Statt is Tics

18

Chapter 5 (continued 3)

5.6 Continuous random variables: distribution function and density function

• Cumulative distribution function F(x)=P(X<x)

• Density probability function f(x) = F’(x) 5.7 Numerical characteristics of a continuous random

variable• Mean or expected value =E(X)= xp(x)dx• Variance and standard deviation

5.8 The normal distribution• The density function, mean and variance for a normal random variable , 2 and 3 rules• The normal distribution as an approximation to binomial probability

distribution

•[Back]

Page 19: Statt is Tics

19

Chapter 6 Sampling Distributions6.1 Why the method of sampling is important6.2 Obtaining a Random Sample6.3 Sampling Distribution6.4 The sampling distribution of sample mean: the

Central Limit Theorem6.5 Summary6.6 Exercises

•[Back]•[Contents]

Page 20: Statt is Tics

20

Chapter 6 (continued 1) 6.1 Why the method of sampling is important

• two samples from the same population can provide contradictory information about the population

• Random sampling eliminates the possibility of bias in selecting a sample and, in addition, provides a probabilistic basic for evaluating the reliability of an inference

6.2 Obtaining a Random Sample• A random sample of n experimental units is one selected in such a

way that every different sample of size n has an equal probability of selection

• procedures for generating a random sample

•[Back]

Page 21: Statt is Tics

21

Chapter 6 (continued 2)6.3 Sampling Distribution• A numerical descriptive measure of a population is called a parameter. A quantity

computed from the observations in a random sample is called a statistic.• A sampling distribution of a sample statistic (based on n observations) is the relative

frequency distribution of the values of the statistic theoretically generated by taking repeated random samples of size n and computing the value of the statistic for each sample.

• Examples of computer-generated random samples 6.4 The sampling distribution of sample mean: the Central Limit

Theorem• If the size is sufficiently large, the mean of a random sample from a population has a

sampling distribution that is approximately normal, regardless of the shape of the relative frequency distribution of the target population

• Mean and standard deviation of the sampling distribution 6.5 Summary

•[Back]

Page 22: Statt is Tics

22

Chapter 7. Estimation7.1 Introduction7.2 Estimation of a population mean: Large-sample case

7.3 Estimation of a population mean: small sample case

7.4 Estimation of a population proportion7.5 Estimation of the difference between two population

means: Independent samples7.6 Estimation of the difference between two population

means: Matched pairs7.7 Estimation of the difference between two population

proportions7.8 Choosing the sample size7.9 Estimation of a population variance7.10 Summary

•[Back]•[Contents]

Page 23: Statt is Tics

23

Chapter 7 (continued 1)

7.2 Estimation of a population mean: Large-sample case

• Point estimate for a population mean: • Large-sample (1-) 100% Confidence interval for a population

mean ( use the fact that For sufficient large sample size n>=30, the sampling distribution of the sample mean, , is approximately normal).

7.3 Estimation of a population mean: small sample case (n<30)

• Problems arising for small sample sizes and Assumption: the population has an approximate normal distribution.

• (1-) 100% Confidence interval using t-distribution.

7.4 Estimation of a population proportion• For sufficiently large samples, the sampling distribution of the

proportion p-hat is approximately normal.• Large-sample (1-) 100% Confidence interval for a population

proportion

•[Back]

Page 24: Statt is Tics

24

Chapter 7 (continued 2)

7.5 Estimation of the difference between two population means: Independent samples

• For sufficiently large sample size (n1 and n2 >= 30), the sampling distribution of 1 - 2 based on independent random samples

from two populations, is approximately normal • Small sample sizes under some assumptions on populations7.6 Estimation of the difference between two population

means: Matched pairsAssumption: the population of paired differences is normally distributed Procedure

7.7 Estimation of the difference between two population proportions

• For sufficiently large sample size (n1 and n2 >= 30), the sampling distribution of p1 - p2 based on independent random samples from two populations, is approximately normal

• (1-) 100% Confidence interval for p1 - p2

•[Back]

Page 25: Statt is Tics

25

Chapter 8. General Concepts of Hypothesis Testing8.1 Introduction The procedures to be discussed are useful in situations, where we are interested in

making a decision about a parameter value rather then obtaining an estimate of its value

8.2 Formulation of Hypotheses• A null hypothesis H0 is the hypothesis against which we hope to gather evidence. The

hypothesis for which we wish to gather supporting evidence is called the alternative hypothesises Ha

• One-tailed (directional) test and two-tailed test

8.3 Conclusions and Consequences for a Hypothesis Test• The goal of any hypothesis-testing is to make a decision based on sample information:

whether to reject H0 in favor of Ha we make one of two types of error.• A Type I error occurs if we reject H0 when it is true. The probability of committing a

Type I error is denoted by (also called significance level)• A Type II error occurs if we do not reject H0 when it is false. The probability of

committing a Type II error is denoted by .

•Contents

•[Back]

Page 26: Statt is Tics

26

Chapter 8 (continued 1) 8.4 Test statistics and rejection regions• The test statistic is a sample ststistic, upon which the decision

concerning the null and alternative hypotheses is based.• The rejection region is the set of possible values of the test statistic for

which the null hypotheses will be rejected.

• Steps for testing hypothesis• Critical value =boundary value of the rejection region

8.5 Summary8.6 Exercises

•[Back]

Page 27: Statt is Tics

27

Chapter 9. Applications of Hypothesis Testing

9.1 Diagnosing a hypothesis test9.2 Hypothesis test about a population mean9.3 Hypothesis test about a population proportion9.4 Hypothesis tests about the difference between two population

means9.5 Hypothesis tests about the difference between two proportions9.6 Hypothesis test about a population variance

9.7 Hypothesis test about the ratio of two population variances9.8 Summary

9.9 Exercises

•[Back]•[Contents]

Page 28: Statt is Tics

28

Chapter 9 (continued 1)

9.2 Hypothesis test about a population mean• Large- sample test (n>=30):

– the sampling distribution of is approximately normal and s is a good approximation of .

– Procedure for large- sample test • Small- sample test:

– Assumption: the population ha aaprox. Normal distribution.– Procedure for small- sample test (using t-distribution)\

9.3 Hypothesis test about a population proportion Large- sample test

9.4 Hypothesis tests about the difference between two population means

• Large- sample test :– Assumptions: n1>=30, n2>=30; samples are selected randomly and

independently from the populations• Small- sample test

•[Back]

Page 29: Statt is Tics

29

Chapter 9 (continued 2)

9.5 Hypothesis tests about the difference between two proportions:Assumptions, Procedure

9.6 Hypothesis test about a population variance– Assumption: the population has an approx. nornal distr.– Procudure using chi-square distribution

9.7 Hypothesis test about the ratio of two population variances (optional)– Assumptions: Populations has approx. nornal distr., random

samples are independent.– Procudure using F- distribution

•[Back]

Page 30: Statt is Tics

30

Chapter 10. Categorical Data Analysis and Analysis of Variance

10.1 Introduction10.2 Tests of goodness of fit10.3 The analysis of contingency tables10.4 Contingency tables in statistical software packages10.5 Introduction to analysis of variance10.6 Design of experiments10.7 Completely randomized designs10.8 Randomized block designs10.9 Multiple comparisons of means and confidence regions10.10 Summary10.11 Exercises

•[Back]•[Contents]

Page 31: Statt is Tics

31

Chapter 10 (continued 1)

10.1 Introduction10.2 Tests of goodness -of- fit

– Purpose: to test for a dependence on a qualitative variable that allow for more than two categorires for a response.Namely, it test there is a significant difference between observed frequency distribution and a theoretical frequency distribution .

– Procedure for a Chi-square goodness -of- fit test

10.3 The analysis of contingency tables– Purpose :to determine whether a dependence exists between to

qualitative variables– Procedure for a Chi-square Test for independence of two

directions of Classification

10.4 Contingency tables in statistical software packages

•[Back]

Page 32: Statt is Tics

32

Chapter 10 (continued 2)

10.5 Introduction to analysis of variancePurpose: Comparison of more than two means

10.6 Design of experiments• Concepts of experiment, design of the experiment, response variable, factor,

treatment• Concepts of Between-sample variation, Within-sample variation

10.7 Completely randomized designs• This design involves a comparison of the means of k treatments, based on

independent random samples of n1, n2,…, nk observations drawn from populations.

• Assumptions: All k populations are normal, have equal variances• F-test for comparing k population means

10.8 Randomized block designs• Concept of randomized block design• Tests to compare k Treatment and b Block Means

• 10.9 Multiple comparisons of means and confidence regions

•[Back]

Page 33: Statt is Tics

33

Chapter 11. Simple Linear regression and correlation

11.1 Introduction: Bivariate relationships11.2 Simple Linear regression: Assumptions11.3 Estimating A and B: the method of least squares11.4 Estimating 2

11.5 Making inferences about the slope, B11.6. Correlation analysis11.7 Using the model for estimation and prediction11.8. Simple Linear Regression: An Overview Example

11.9 Exercises

•[Back]•[Contents]

Page 34: Statt is Tics

34

Chapter 11 (continued 1)

11.1 Introduction: Bivariate relationships• Subject is to determine the relationship between two variables. • Types of relationships: direct and inverse• Scattergram11.2 Simple Linear regression: Assumptions• a simple linear regression model y = A + B x + e• assumptions required for a linear regression model: E(e) = 0, e is

normal, 2 is equal a constant for all value of x. 11.3 Estimating A and B: the method of least squares• the least squares estimators a and b , formula for a and b

11.4 Estimating 2

• Formula for s2, an estimator for 2 • interpretation of s, the estimated standard deviation of e

•[Back]

Page 35: Statt is Tics

35

Chapter 11 (continued 2)

11.5 Making inferences about the slope, B• Problem about making an inference of the population regression line

E(y)=A+Bx based on the sample regression line y^=a+bx• Sampling distribution of the least square estimator of slope b• Test of the utility of the model: H0: B =0 against Ha: B0 or B>0, B<0• A (1-) 100% Confidence interval for B

11.6. Correlation analysis• Is the statistical tool for describing the degree to which one variable is

linearly related to another.• The coefficient of correlation r is a measure of the strength of the linear

relationship between two variables• The coefficient of determination 11.7 Using the model for estimation and prediction• A (1-) 100% Confidence interval for the mean value of y for x = xp

• A (1-) 100% Confidence interval for an individual y for x = xp

11.8. Simple Linear Regression: An Example

•[Back]

Page 36: Statt is Tics

36

Chapter 12. Multiple regression12.1. Introduction: the general linear model12.2 Model assumptions12.3 Fitting the model: the method of least squares

12.4 Estimating 2

12.5 Estimating and testing hypotheses about the B parameters12.6. Checking the utility of a model12.7. Using the model for estimating and prediction12.8 Multiple linear regression: An overview example12.8. Model building: interaction models12.9. Model building: quadratic models12.10 Exercises

•[Back]•[Contents]

Page 37: Statt is Tics

37

Chapter 12 (continued 1)

12.1. Introduction: the general linear model y = B0 + B1x1 + ... + Bkxk + e, where y - dependent., x1, x2, ..., xk -

independent variables, e - random error.

12.2 Model assumptions– For any given set of values x1, x2, ..., xk , the random error e has a normal

probability distribution with the mean equal 0 and variance equal 2.– The random errors are independent.

12.3 Fitting the model: the method of least squaresLeast square prediction equation: y^= b0 +b1 x1 +….+ bk xk

12.4 Estimating 2

12.5 Estimating and testing hypotheses about the B parameters• Sampling distributions of b0, b1, ..., bk

• A (1-) 100% Confidence interval for Bi (i =0, 1,.., k)

• Test of an individual parameter coefficient Bi

•[Back]

Page 38: Statt is Tics

38

Chapter 12 (continued 1)

12.6. Checking the utility of a model• Finding a measure of how well a linear model fits a set of data: the

multiple coefficient of determination• testing the overall utility of the model

12.7. Using the model for estimating and prediction• A (1-) 100% confidence interval for the mean value of y for a given x • A (1-) 100% confidence interval for an individual y for for a given x

12.8 Multiple linear regression: An overview example12.8. Model building: interaction models• Interaction model with two independent variables: E(y) = B0 + B1x1 +

B2x2 + B3x1x2 • procedure to build an interaction model

12.9. Model building: quadratic models• Quadratic model in a single variable: E(y) = B0 + B1x + B2x2

• procedure to build a quadratic model

•[Back]

Page 39: Statt is Tics

39

Chapter 13. Nonparametric statistics13.1. Introduction• Situations where t and F test are unsuitable• What do nonparametric methods use?

13.2. The sign test for a single population• Purpose: to test hypotheses about median of any populations • Procedure for the sign test for a population median• Sign test based on a large sample (n>=10)

13.3 Comparing two populations based on independent random samples:Wilcoxon rank sum test

• Nonparametric test about the difference between two populations is the test to detect whether distribution 1 is shifted to the right of distribution 2 or vice versa.

• wilcoxon rank sum test for a shift in population locations• The case of large samples (n110, n210)

•[Back]•[Contents]

Page 40: Statt is Tics

40

Chapter 13 (continued 1) 13.4. Comparing two populations based on matched pairs• Wilcoxon signed ranks test for a shift in population locations • Wilcoxon signed ranks test for large samples (n25)

13.5. Comparing populations using a completely randomized design: The Kruskal-Wallis H test

• The Kruskal-Wallis H test is the nonparam. Equivalent of ANOVA F test when the assumptions that populations are normally distributed with common variance are not satisfied.

• The Kruskal-Wallis H test for comparing k population probability distributions

13.6 Rank Correlation: Spearman’s rs statistic• Is stastistic developed to measure and to test for correlation between two

random variables.• Formula for computing Spearman’s rank correlation coefficient rs • Spearman’s nonparametric test for rank correlation

13.7 Exercises

•[Contents]•[Back]