Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average...

42
Analysis of Individual Variables Descriptive – Measures of Central Tendency • Mean – Average score of distribution (1 st moment) • Median – Middle score (50 th percentile) of distribution Measures of Variation (used to measure the range of the distribution relative to the measures of central tendency) • Range – Distance between lowest and highest data point • Mean Deviation – Average distance between Mean and data points • Variance – Sum of Squared distance from mean (2 nd moment) • Standard Deviation – Square root of variance
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average...

Page 1: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Analysis of Individual Variables

• Descriptive – – Measures of Central Tendency

• Mean – Average score of distribution (1st moment)• Median – Middle score (50th percentile) of distribution

– Measures of Variation (used to measure the range of the distribution relative to the measures of central tendency)

• Range – Distance between lowest and highest data point• Mean Deviation – Average distance between Mean and data

points • Variance – Sum of Squared distance from mean (2nd moment)• Standard Deviation – Square root of variance

Page 2: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Analysis of Individual Variables

Obs Income1 20.502 31.503 47.704 26.205 44.006 8.287 30.808 17.209 19.90 Mean 31.28

10 9.96 Median 25.7011 55.80 Variance 500.6812 25.20 Stdev 22.3813 29.0014 85.5015 15.1016 28.5017 21.4018 17.7019 6.4220 84.90

Page 3: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Analysis of Relationship among Variables

• Correlation• Regression

– Two Variable Models– Multiple Variable Models– Discrete Dependent Variable Models

Page 4: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Scatter Plot of Money Supply Growth and Inflation

Page 5: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Correlation

• A scatter plot is a graph that shows the relationship between the observations for two data series in two dimensions

• Correlation analysis expresses this numerically– In contrast to a scatter plot, which graphically depicts the

relationship between two data series, correlation analysis expresses this same relationship using a single number

– The correlation coefficient is a measure of how closely related two data series are

– The correlation coefficient measures the linear association between two variables

Page 6: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Correlation

• Determine association between 2 variables • Measured on a scale from +1 to -1

– values close to +1.0 indicates strong positive relationship

– values close to -1.0 indicates strong negative relationship

– values close to 0 indicates little or no relationship

+1 0 -1

Page 7: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Variables with Perfect Positive Correlation

Page 8: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Variables with Perfect Negative Correlation

Page 9: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Variables with a Correlation of 0

Page 10: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Variables with a Non-Linear Association

Page 11: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Calculating correlations

• The sample correlation coefficient ‘r’ is,

n

i

iY

n

i

iX

n

i

ii

YX

n

YYs

n

XXs

n

YYXXYXCov

ss

YXCovr

1

2

1

2

1

)1(

)(,

)1(

)(

)1(

))((),(

),(

Page 12: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Calculating correlations

• E.g.: Is it true that higher education leads to higher compensation?– To answer this question, we need to look at the data and

calculate correlation

Years of Education

Compensation (000)

17.97 163.3022.86 142.0517.25 100.0013.35 103.5514.97 90.0015.87 97.5013.17 90.0011.1 80.00

13.86 90.258.97 49.50

Page 13: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Calculating correlations

• The sample correlation coefficient ‘r’ is,

22

1

2

1

2

1

)(,)(

)(),(

Y of average

X, of average

calculate toneed weso

)1(

)(,

)1(

)(,

)1(

))((),(

),(

YYXX

YYXX

Y

X

n

YYs

n

XXs

n

YYXXYXCov

ss

YXCovr

ii

ii

n

i

iY

n

i

iX

n

i

ii

YX

Page 14: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Calculating correlationsYears of

Ed. Comp (000) (X-XBar)2 (Y-YBar)2 (X-XBar)(Y-YBar)17.97 163.30 9.20 3929.41 190.1222.86 142.05 62.77 1716.86 328.2917.25 100.00 5.35 0.38 -1.4213.35 103.55 2.52 8.61 -4.6614.97 90.00 0.00 112.68 -0.3515.87 97.50 0.87 9.70 -2.9113.17 90.00 3.12 112.68 18.7611.10 80.00 14.72 424.98 79.1013.86 90.25 1.16 107.43 11.168.97 49.50 35.61 2612.74 305.00

Sums: 135.32 9035.48 923.10:

XBar 14.94YBar 100.62n -1 9.00

Covariance 102.57

SX 3.88

SY 31.69

r 0.83

Calculations

Page 15: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Calculating correlations (EXCEL)

Years of Ed.

Comp (000)

17.97 163.3022.86 142.0517.25 100.0013.35 103.5514.97 90.0015.87 97.5013.17 90.0011.10 80.0013.86 90.258.97 49.50

Correlation =CORREL(array1, array2)Correlation 0.83

Page 16: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Correlation Matrix

US Eqt UK US FI Japan Korea Mexico China HK S'pore IndiaUS Eqt 1.00

UK 0.27 1.00US FI -0.13 -0.27 1.00Japan 0.20 -0.15 0.08 1.00Korea -0.13 -0.17 0.28 -0.01 1.00

Mexico -0.10 0.28 -0.35 -0.38 -0.01 1.00China 0.17 -0.12 0.29 0.09 0.19 0.00 1.00

HK 0.22 0.24 -0.38 -0.23 -0.55 0.32 -0.08 1.00S'pore 0.52 0.24 0.00 0.08 -0.02 0.30 0.35 -0.01 1.00India 0.30 0.57 0.17 -0.12 -0.11 -0.17 0.24 0.01 0.35 1.00

Page 17: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Correlations Among Stock Return Series

Page 18: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression

• Most times its not enough to just say whether 2 variables are correlated– we would like to define a relationship between the two variables– E.g. when the economy grows 1%, how much will the S&P500

increase

• To do this, we use a technique of Regression

Page 19: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression

• How the term Regression came to be applied to the subject of statistical models.

• 19th century scientist, Sir Francis Galton, studying human subjects found in all things "regression toward mediocrity”– E.g. If your parents are very smart, you are likely to

be significantly less smart - so its really not your fault!!

Page 20: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression

• In modern times, when we talk of Regression analysis, we make an implicit assumption of a ‘mean’ relationship between variables and we try to determine that relationship.

• Regression analysis is concerned with –– the study of the dependence of one variable (the dependent

variable) – on one or more other variables (the explanatory variables) – with a view to estimating and/or predicting the mean or

average value of the former – in terms of the fixed values of the latter.

Page 21: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

• Regression analysis is concerned with relationship of 2 variables, say ‘y’ and ‘x’ and can be written as –

– All this means is that the value of ‘y’ is a function of the value of ‘x’– Another way of saying it is that ‘y’ doesn’t independently get its

value, but somehow depends on ‘x’ to get its value– Thus y can so how be derived from ‘x’– Thus ‘y’ is a dependent variable and ‘x’ is an independent variable

• Regression is thus, the study of a relationship between the dependent and independent variables

)( ii xfy

Page 22: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regressionx y

1.0 2.01.5 3.02.0 4.02.5 5.03.0 6.03.5 7.04.0 8.04.5 9.05.0 10.05.5 11.06.0 12.06.5 13.07.0 14.07.5 15.025 ?

2 where,*)(

50,25 if so , *2

)(:

25 when x y, is what :Q

xyxfy

yxxy

xfyA

0

2

4

6

8

10

12

14

16

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Page 23: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression

303*33

202*22

1011:

10 when x y3, y2, y1, are what :Q

yxy

yxy

yxyA

0

5

10

15

20

25

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

x y1 y2 y30.0 0.0 0.0 0.00.5 0.5 1.0 1.51.0 1.0 2.0 3.01.5 1.5 3.0 4.52.0 2.0 4.0 6.02.5 2.5 5.0 7.53.0 3.0 6.0 9.03.5 3.5 7.0 10.54.0 4.0 8.0 12.04.5 4.5 9.0 13.55.0 5.0 10.0 15.05.5 5.5 11.0 16.56.0 6.0 12.0 18.06.5 6.5 13.0 19.57.0 7.0 14.0 21.07.5 7.5 15.0 22.510

Page 24: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression

2 1, where, *)(

21,10 if so

)*2(1

)(:

10 when x y, is what :Q

xyxfy

yx

xy

xfyA

x y10.0 1.00.5 2.01.0 3.01.5 4.02.0 5.02.5 6.03.0 7.03.5 8.04.0 9.04.5 10.05.0 11.05.5 12.06.0 13.06.5 14.07.0 15.07.5 16.0

10.0

0

2

4

6

8

10

12

14

16

18

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Page 25: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

• Regression analysis is concerned with –– the study of a relationship between the dependent and

independent variables

– In reality, we can are estimating a relationship, so we can calculate the value of a random variable

)( ii xfy

ii xy

Page 26: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

• Real data from which we estimate relationship is never very good because we deal with random variables– What we end up having is some thing like this

– What we try to do in regression is estimates the “Line of Best Fit”, so that we can come up with this equation

– This is also the equation of line, so this form of regression is called a ‘Linear regression”

ii xy

errorxy ii

Page 27: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

y = 0.841+0.3909x

R2 = 0.7247

2.002.202.402.602.803.003.203.403.603.804.00

2.00 3.00 4.00 5.00 6.00 7.00 8.00

Page 28: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

• Regression Model – Equation of a Line

• Terminology – ‘y’– Dependent Variable, or– Left-Hand Side Variable, or– Explained Variable, or

iii xy

Page 29: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

• Terminology – ‘x’– Independent Variable, or– Right-Hand Side Variable, or– Explanatory Variable, or– Regressor, Covariate, Control Variable

• Terminology – ‘’– Error– Disturbance

Two Variable Regression Model

Page 30: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

iii xy

• Terminology – – ‘’ - Intercept– ‘’ – Slope– ‘’ - error

Page 31: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Assumptions of the Linear Regression Model

• The relationship between the dependent variable, Y, and the independent variable, X is linear

• The independent variable, X, is not random• About the error –

– The expected value (remember average) of the error term is 0– The error term is normally distributed– The variance of the error term is the same for all observations– The error term is uncorrelated across observations

Page 32: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Regression Relationship estimation

• The model is estimated by the “Least Squares Estimation” method

Page 33: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Two Variable Regression Model

XY

XVar

YXCov

xy ii

)(

),(

Page 34: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

• Inferences from Regression can be made about– Model - how well does the specified model perform, i.e., are

the specified independent variables, taken together a good predictor of the dependent variable (R2)

– Independent Variables – The contribution of each independent variable in predicting the dependent variable (hypothesis test)

Inferences from Regression

iii xy 11

Page 35: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Model power

variationTotal

variationdUnexplaine1

variationTotal

variationexp variationTotal

variation2

lainedUnToal

ExplainedR

Page 36: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Inference about Model

• Coeff. of Determination (R2)

• So, higher the R2 – better model (Yes? That would be too easy!)

x1-xm)

(x1, y1)

ym

yp

y1

xm x1

SST SSE

SSRSST

SSER

SST

SSE

SST

SSRSST

SSE

SST

SSR

SSESSRSST

1

1

1

2

Page 37: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Inference about Model

• If the model is correctly specified, R2 is an ideal measure

• Addition of a variable to a regression will increase the R2 (by construction)

• This fact can be exploited to get regressions with R2 ~ 100% by addition of variables, but this doesn’t mean that the model is any good

• Adj-R2 should be reported

Page 38: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Inference about Parameters

• Coefficients are estimated with a confidence interval• To know if a specific independent variable (xi) is

influential in predicting the dependent variable (y), we test whether the corresponding coefficient is statistically different from 0 (i.e. i = 0).

• We do so by calculating the t-statistic for the coefficient

• If the t-stat is sufficient large, it indicates that bi is significantly different from 0 indicating that i * xi plays a role in determining y

Page 39: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

Inference about parameters

• We can test to see if the slope coefficient is significant by using a t-test.

1

01

^

bst

Page 40: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

In Excel

Page 41: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

In Excel

Page 42: Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.

In Excel

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.405156042R Square 0.164151419Adjusted R Square 0.149740236Standard Error 0.05350165Observations 60

ANOVAdf SS MS F Significance F

Regression 1 0.032604637 0.032604637 11.39055864 0.001321732Residual 58 0.166020739 0.002862427Total 59 0.198625377

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -9.72076E-05 0.007438982 -0.01306732 0.98961893 -0.014987948 0.014793533X Variable 1 0.939398568 0.278341127 3.374990169 0.001321732 0.382238272 1.496558865