S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

18
© Willett, Harvard University Graduate School of Education, 07/04/22 S052/III.1(b) – Slide 1 S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area? More details can be found in the Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) i i i i X X Y 2 2 1 1 0 Do your residuals meet the required assumptions? Test for residual normalit y Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If your sole predictor is continuous, MRA is identical to correlational analysis If your sole predictor is dichotomous, MRA is identical to a t-test If your several predictors are categorical, MRA is identical to ANOVA If time is a predictor, you need discrete- time survival analysisIf your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotom ous outcome) Multinomia l logistic regression analysis (polytomo us outcome) Discrimina nt Analysis If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Transform the outcome or predictor Use non- linear regression analysis. If your outcome vs. predictor relationship is non-linear, Today’s Topic Area

description

If your several predictors are categorical , MRA is identical to ANOVA. If your sole predictor is continuous , MRA is identical to correlational analysis. If your sole predictor is dichotomous , MRA is identical to a t-test. Do your residuals meet the required assumptions ?. - PowerPoint PPT Presentation

Transcript of S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

Page 1: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 1

S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

More details can be found in the “Course Objectives and Content” handout on the course webpage.More details can be found in the “Course Objectives and Content” handout on the course webpage.

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If your sole predictor is continuous, MRA is

identical to correlational analysis

If your sole predictor is dichotomous, MRA is identical to a t-test

If your several predictors are

categorical, MRA is identical to ANOVA

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

Discriminant Analysis

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Transform the outcome or predictor

Use non-linear regression analysis.

If your outcome vs. predictor relationship

is non-linear,

Today’s Topic Area

Page 2: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 2

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Printed Syllabus – What Is Today’s Topic?

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Printed Syllabus – What Is Today’s Topic?

Please check inter-connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of today’s class when you pre-read the day’s materials.

Please check inter-connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of today’s class when you pre-read the day’s materials.

Today, in Sections III.1(b) & (c) on Using Principal Components AnalysisUsing Principal Components Analysis, I will: Introduce the idea of forming a general weighted linear composite from multiple indicators.Show how Principal Components Analysis can create an optimal weighted linear

composite. Identify the 1st principal component as the required optimal composite. Interpret the 1st principal component by inspecting weights in the 1st eigenvector.Determine the variance of the 1st principal component by examining the 1st eigenvalue.Connect the dimensionality of a set of indicators to the eigenvalues produced by PCA.Determine the dimensionality of a set of indicators using Rule of One & Scree Plot.Appendix 1: Assessing the magnitudes of the elements of the eigenvector

Today, in Sections III.1(b) & (c) on Using Principal Components AnalysisUsing Principal Components Analysis, I will: Introduce the idea of forming a general weighted linear composite from multiple indicators.Show how Principal Components Analysis can create an optimal weighted linear

composite. Identify the 1st principal component as the required optimal composite. Interpret the 1st principal component by inspecting weights in the 1st eigenvector.Determine the variance of the 1st principal component by examining the 1st eigenvalue.Connect the dimensionality of a set of indicators to the eigenvalues produced by PCA.Determine the dimensionality of a set of indicators using Rule of One & Scree Plot.Appendix 1: Assessing the magnitudes of the elements of the eigenvector

Page 3: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 3

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Recalling the TSUCCESS Dataset

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Recalling the TSUCCESS Dataset

Dataset TSUCCESS.txt

Overview Responses of a national sample of teachers to six questions about job satisfaction.

SourceAdministrator and Teacher Survey of the High School and Beyond (HS&B) dataset, 1984 administration, National Center for Education Statistics (NCES).

Sample Size 5269 teachers (4955 of them with complete data).

More Info

HS&B was established to study the educational, vocational, and personal development of young people beginning in their elementary or high school years and following them over time as they began to take on adult responsibilities. The HS&B survey included two cohorts: (a) the 1980 senior class, and (b) the 1980 sophomore class. Both cohorts were surveyed every two years through 1986, and the 1980 sophomore class was also surveyed again in 1992.

A dataset in which the investigators measured multiple indicators of what they thought was a single underlying construct that represented Teacher Job Satisfaction: The data described in TSUCCESS_info.pdf.

A dataset in which the investigators measured multiple indicators of what they thought was a single underlying construct that represented Teacher Job Satisfaction: The data described in TSUCCESS_info.pdf.

Page 4: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 4

Cronbach Coefficient AlphaVariables AlphaƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒRaw 0.696594Standardized 0.735530

Cronbach Coefficient AlphaVariables AlphaƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒRaw 0.696594Standardized 0.735530

I have argued that, when forming composites, a standardized composite is preferred, to compensate for the heterogeneous metrics and variances of the multiple indicators being composited, as follows …I have argued that, when forming composites, a standardized composite is preferred, to compensate for the heterogeneous metrics and variances of the multiple indicators being composited, as follows …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Extending The Basic “Equally-Weighted” Standardized Composite

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Extending The Basic “Equally-Weighted” Standardized Composite

A standardized composite is formed by first standardizing each indicator to zero mean & standard deviation 1:A standardized composite is formed by first standardizing each indicator to zero mean & standard deviation 1:

6

66*6

5

55*5

4

44*4

3

33*3

2

22*2

1

11*1

s

XXX

s

XXX

s

XXX

s

XXX

s

XXX

s

XXX

ii

ii

ii

ii

ii

ii

iCiC

iX 2

iX 2

iX1

iX1

iX 5

iX 5

iX 4

iX 4

iX 3

iX 3

iX 6

iX 6

Then, the standardized indicator scores are added together to form composite Ci, as follows:

Or, if you prefer to use “normed” weights, Ci would be†:

Then, the standardized indicator scores are added together to form composite Ci, as follows:

Or, if you prefer to use “normed” weights, Ci would be†:

*6

*5

*4

*3

*2

*1 111111 iiiiiii XXXXXXC

*6

*5

*4

*3

*2

*1

6

1

6

1

6

1

6

1

6

1

6

1iiiiiii XXXXXXC

† Weights are said to be “normed” when the sum of their squares is equal to one.

† Weights are said to be “normed” when the sum of their squares is equal to one.

Page 5: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 5

Cronbach Coefficient AlphaVariables AlphaƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒRaw 0.696594Standardized 0.735530

Cronbach Coefficient AlphaVariables AlphaƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒRaw 0.696594Standardized 0.735530

More generally, a weighted linear composite can be formed by weighting and adding standardized indicators together to form a composite measure of teacher job satisfaction …More generally, a weighted linear composite can be formed by weighting and adding standardized indicators together to form a composite measure of teacher job satisfaction …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators An Extension Of The Basic Standardized Composite To Include “Weights”?

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators An Extension Of The Basic Standardized Composite To Include “Weights”?

iCiC

iX 2

iX 2

iX1

iX1

iX 5

iX 5

iX 4

iX 4

iX 3

iX 3

iX 6

iX 6

6w5w 4w

2w

3w

1w

By choosing weights that differ from unity and differ from each other, we can create an infinite number of potential composites, as follows:

Where we typically use “normed” weights, such that:

By choosing weights that differ from unity and differ from each other, we can create an infinite number of potential composites, as follows:

Where we typically use “normed” weights, such that:

*66

*55

*44

*33

*22

*11 iiiiiii XwXwXwXwXwXwC

126

25

24

23

22

21 wwwwww

Among all such weighted linear composites, are there some that are “optimal”?

How would we define such “optimal” composites?: Does it make sense, for instance, to seek a

composite with maximum variance, given the original standardized indicators.

Perhaps I can also choose weights that take account of the differing inter-correlations among the indicators, and “pull” the composite “closer” to the more highly-correlated indicators?

Among all such weighted linear composites, are there some that are “optimal”?

How would we define such “optimal” composites?: Does it make sense, for instance, to seek a

composite with maximum variance, given the original standardized indicators.

Perhaps I can also choose weights that take account of the differing inter-correlations among the indicators, and “pull” the composite “closer” to the more highly-correlated indicators?

Page 6: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 6

*-----------------------------------------------------------------------------* Input the dataset, name and label the six indicators of teacher satisfaction*-----------------------------------------------------------------------------*; DATA TSUCCESS; INFILE 'C:\DATA\S052\TSUCCESS.txt'; INPUT X1-X6; LABEL X1 = 'Have high standards of teaching'

X2 = 'Continually learning on job' X3 = 'Successful in educating students' X4 = 'Waste of time to do best as teacher' X5 = 'Look forward to working at school' X6 = 'Time satisfied with job';

PROC FORMAT; VALUE AFMT 1='Strongly disagree' 2='Disagree' 3='Slightly disagree' 4='Slightly agree' 5='Agree' 6='Strongly agree'; VALUE BFMT 1='Strongly agree' 2='Agree' 3='Slightly agree' 4='Slightly disagree' 5='Disagree' 6='Strongly disagree'; VALUE CFMT 1='Not successful' 2='Somewhat successful' 3='Successful' 4='Very Successful'; VALUE DFMT 1='Almost never' 2='Sometimes' 3='Almost always' 4='Always'; *-----------------------------------------------------------------------------* Carry out the principal components analysis*-----------------------------------------------------------------------------*; PROC PRINCOMP DATA=TSUCCESS OUT=TSUCCESS PREFIX=PC_; VAR X1-X6;

*-----------------------------------------------------------------------------* Input the dataset, name and label the six indicators of teacher satisfaction*-----------------------------------------------------------------------------*; DATA TSUCCESS; INFILE 'C:\DATA\S052\TSUCCESS.txt'; INPUT X1-X6; LABEL X1 = 'Have high standards of teaching'

X2 = 'Continually learning on job' X3 = 'Successful in educating students' X4 = 'Waste of time to do best as teacher' X5 = 'Look forward to working at school' X6 = 'Time satisfied with job';

PROC FORMAT; VALUE AFMT 1='Strongly disagree' 2='Disagree' 3='Slightly disagree' 4='Slightly agree' 5='Agree' 6='Strongly agree'; VALUE BFMT 1='Strongly agree' 2='Agree' 3='Slightly agree' 4='Slightly disagree' 5='Disagree' 6='Strongly disagree'; VALUE CFMT 1='Not successful' 2='Somewhat successful' 3='Successful' 4='Very Successful'; VALUE DFMT 1='Almost never' 2='Sometimes' 3='Almost always' 4='Always'; *-----------------------------------------------------------------------------* Carry out the principal components analysis*-----------------------------------------------------------------------------*; PROC PRINCOMP DATA=TSUCCESS OUT=TSUCCESS PREFIX=PC_; VAR X1-X6;

In Data-Analytic Handout III.1(b).1 …In Data-Analytic Handout III.1(b).1 …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Programming Principle Components Analysis In PC-SAS

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators Programming Principle Components Analysis In PC-SAS

PROC PRINCOMP is the PC-SAS procedure for conducting Principal Components Analysis (PCA): By choosing sets of weights, PCA seeks out optimal weighted linear

composites of the original standardized indicators. These composites are called the “principal componentsprincipal components.” The first principal component is that weighted linear composite that

has maximum variance, given the indicator-indicator inter-correlations.

After a PCA, scores on the new composite(s) are output, for each person, and you must: Specify a dataset into which they are output

(here, I used the original dataset, TSUCCESS).

Provide a prefix to be used for labeling the new composite variable(s) (here, PC_ ).

The VAR statement specifies the indicators to include in the PCA

Page 7: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 7

The PRINCOMP Procedure  Observations 4955 Variables 6   Simple Statistics  X1 X2 X3 Mean 4.3294 3.8736 3.1550StD 1.0882 1.2427 0.6693 

The PRINCOMP Procedure  Observations 4955 Variables 6   Simple Statistics  X1 X2 X3 Mean 4.3294 3.8736 3.1550StD 1.0882 1.2427 0.6693 

       X4 X5 X6 4.2270 4.4244 2.83651.6660 1.3289 0.5714 

       X4 X5 X6 4.2270 4.4244 2.83651.6660 1.3289 0.5714 

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Univariate Output …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Univariate Output …

Notice the sample size used in the PCA: Recall that the total sample size was 5269 teachers. PCA has used the list-wise deleted sample of 4955

teachers in order to avoid violations of positive definiteness.

Notice the sample size used in the PCA: Recall that the total sample size was 5269 teachers. PCA has used the list-wise deleted sample of 4955

teachers in order to avoid violations of positive definiteness.

As a first step, PCA estimates the sample mean and standard deviation of the indicators and standardizes each of them automatically, as follows:As a first step, PCA estimates the sample mean and standard deviation of the indicators and standardizes each of them automatically, as follows:

57.0

84.2

33.1

42.4

67.1

23.467.0

15.3

24.1

87.3

09.1

33.4

6*6

5*5

4*4

3*3

2*2

1*1

ii

ii

ii

ii

ii

ii

XX

XX

XX

XX

XX

XX

So, we begin with a total of six units of original standardized indicator variance that PCA then seeks to disperse maximally into new composites.

So, we begin with a total of six units of original standardized indicator variance that PCA then seeks to disperse maximally into new composites.

The output has several important features, starting with ...The output has several important features, starting with ...

Page 8: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 8

The PRINCOMP  Observations 4955   Correlation Matrix

  X1 X1 Have high standards of teaching 1.0000 X2 Continually learning on job 0.5548 X3 Successful in educating students 0.1610 X4 Waste of time to do best as teacher 0.2127 X5 Look forward to working at school 0.2531 X6 Time satisfied with job 0.1921

The PRINCOMP  Observations 4955   Correlation Matrix

  X1 X1 Have high standards of teaching 1.0000 X2 Continually learning on job 0.5548 X3 Successful in educating students 0.1610 X4 Waste of time to do best as teacher 0.2127 X5 Look forward to working at school 0.2531 X6 Time satisfied with job 0.1921

Procedure Variables 6  

X2 X3 X4 X5 X60.5548 0.1610 0.2127 0.2531 0.19211.0000 0.1663 0.2313 0.2697 0.22250.1663 1.0000 0.2990 0.3557 0.43260.2313 0.2990 1.0000 0.4478 0.39930.2697 0.3557 0.4478 1.0000 0.55290.2225 0.4326 0.3993 0.5529 1.0000

Procedure Variables 6  

X2 X3 X4 X5 X60.5548 0.1610 0.2127 0.2531 0.19211.0000 0.1663 0.2313 0.2697 0.22250.1663 1.0000 0.2990 0.3557 0.43260.2313 0.2990 1.0000 0.4478 0.39930.2697 0.3557 0.4478 1.0000 0.55290.2225 0.4326 0.3993 0.5529 1.0000

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Bivariate Output …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Bivariate Output …

PCA assumes that all bivariate inter-relationships among the indicators are linear -- so, in a thorough PCA, you must: Check the bivariate linearity assumption by inspecting a complete

set of bivariate scatter-plots (not included here). Fix any curvilinearity by applying suitable transformations to the

indicators before proceeding (not needed here).

PCA assumes that all bivariate inter-relationships among the indicators are linear -- so, in a thorough PCA, you must: Check the bivariate linearity assumption by inspecting a complete

set of bivariate scatter-plots (not included here). Fix any curvilinearity by applying suitable transformations to the

indicators before proceeding (not needed here).

PCA seeks a set of ideal composites that are mutually uncorrelated and with decreasing maximum variance, given the set of original indicators: Of course, it may not succeed in piling all six units of original standardized variance into a single “ideal”

composite because the original indicators are not all perfectly inter-correlated to begin with! So, in creating its “ideal composites,” PCA accounts for inter-correlations among the indicators, putting as

much of original variance into the 1st component as it can, before moving on to the creation of the 2nd, etc.

PCA seeks a set of ideal composites that are mutually uncorrelated and with decreasing maximum variance, given the set of original indicators: Of course, it may not succeed in piling all six units of original standardized variance into a single “ideal”

composite because the original indicators are not all perfectly inter-correlated to begin with! So, in creating its “ideal composites,” PCA accounts for inter-correlations among the indicators, putting as

much of original variance into the 1st component as it can, before moving on to the creation of the 2nd, etc.

Page 9: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 9

Eigenvectors  PC_1 PC_2 X1 Have high standards of teaching 0.3472 0.6182X2 Continually learning on job 0.3617 0.5950X3 Successful in educating students 0.3778 -.3021X4 Waste of time to do best as teacher 0.4144 -.1807X5 Look forward to working at school 0.4727 -.2067X6 Time satisfied with job 0.4591 -.3117

  PC_3 PC_4 PC_5 PC_6 0.0896 0.0264 0.6261 0.31080.0543 -.0217 -.6685 -.25480.7555 0.4028 0.0503 -.1746-.5972 0.6510 -.0493 0.1129-.2418 -.4501 0.3022 -.61760.0558 -.4584 -.2548 0.6433

List of the original indicators

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Multivariate Output …

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators PCA Provides Multivariate Output …

This “ideal” composite is called the first principal component, it is:

Where:

This “ideal” composite is called the first principal component, it is:

Where:

*6

*5

*4

*3

*2

*1 46.047.041.038.036.035.01_ iiiiiii XXXXXXPC

57.0

84.2

33.1

42.4

67.1

23.467.0

15.3

24.1

87.3

09.1

34.4

6*6

5*5

4*4

3*3

2*2

1*1

ii

ii

ii

ii

ii

ii

XX

XX

XX

XX

XX

XX

iPC 1_ iPC 1_

iX1iX1

iX 2

iX 2

iX 3

iX 3

iX 4

iX 4

iX 5

iX 5

iX 6

iX 6

.35

.36

.38

.41

.47

.46

First principal component

This Column Is Called The “First Eigenvector”It contains the weights that PCA has determined will provide a linear composite of the six original standardized indicators with maximum possible variance, given the inter-correlations among the indicators.

Page 10: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 10

PC_1 PC_2

X1 Have high standards of teaching 0.3472 0.6182X2 Continually learning on job 0.3617 0.5950X3 Successful in educating students 0.3778 -.3021X4 Waste of time to do best as teacher 0.4144 -.1807X5 Look forward to working at school 0.4727 -.2067X6 Time satisfied with job 0.4591 -.3117

PC_3 PC_4 PC_5 PC_6 0.0896 0.0264 0.6261 0.31080.0543 -.0217 -.6685 -.25480.7555 0.4028 0.0503 -.1746-.5972 0.6510 -.0493 0.1129-.2418 -.4501 0.3022 -.61760.0558 -.4584 -.2548 0.6433

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators What Is The First Principal Component Measuring?

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators What Is The First Principal Component Measuring?

iPC 1_ iPC 1_

iX1iX1

iX 2

iX 2

iX 3

iX 3

iX 4

iX 4

iX 5

iX 5

iX 6

iX 6

.35

.36

.38

.41

.47

.46

First principal component

Each standardized indicator is approximately equally weighted in the First Principal Component: This suggests that the first principal component is an even-

handed synthesis of information originally contained equally in the six original standardized indicators.

Teachers who score highly on the first principal component:Have high standards of teaching performance.Feel that they are continually learning on the job.Believe that they are successful in educating students.Feel that it is not a waste of time to be a teacher.Look forward to working at school.Are always satisfied on the job.

Let’s define the first principal component as an overall index of teacher enthusiasm?

Each standardized indicator is approximately equally weighted in the First Principal Component: This suggests that the first principal component is an even-

handed synthesis of information originally contained equally in the six original standardized indicators.

Teachers who score highly on the first principal component:Have high standards of teaching performance.Feel that they are continually learning on the job.Believe that they are successful in educating students.Feel that it is not a waste of time to be a teacher.Look forward to working at school.Are always satisfied on the job.

Let’s define the first principal component as an overall index of teacher enthusiasm?

Page 11: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 11

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators How Much Of The Original Standardized Variance Does The 1st Component Contain? S052/III.1(b): Using PCA To Form An “Ideal” Composite From Multiple Indicators How Much Of The Original Standardized Variance Does The 1st Component Contain?

This column contains the eigenvalues.

The eigenvalue for the first principal component is its estimated variance:

In this example, where indicator-indicator correlations were low, the best that PCA has been able to do is to form an “optimal” composite that contains 2.61 units of standardized variance, out of the original 6 units. That’s 43.43% of the total original standardized

variance.

This implies that 3.39 units of the original standardized variance remaining may measure something else!!!

Perhaps we can form other substantively-interesting composites from these same six indicators, by choosing different sets of weights:

Maybe there are other “dimensions” of information still hidden within the data?

Let’s inspect the other “principal components” that PCA has formed in these data …

The eigenvalue for the first principal component is its estimated variance:

In this example, where indicator-indicator correlations were low, the best that PCA has been able to do is to form an “optimal” composite that contains 2.61 units of standardized variance, out of the original 6 units. That’s 43.43% of the total original standardized

variance.

This implies that 3.39 units of the original standardized variance remaining may measure something else!!!

Perhaps we can form other substantively-interesting composites from these same six indicators, by choosing different sets of weights:

Maybe there are other “dimensions” of information still hidden within the data?

Let’s inspect the other “principal components” that PCA has formed in these data …

Page 12: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 12

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

There’s more to be learned about the dimensionality of the data by inspecting the remaining eigenvaluesThere’s more to be learned about the dimensionality of the data by inspecting the remaining eigenvalues

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Are There Other Principal Components That Are Worth Paying Attention To?

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Are There Other Principal Components That Are Worth Paying Attention To?

What’s going on here is that PCA has piled as much of the 6 units of original

standardized variance as it can into the first principal component (43%) and created a

composite with maximum possible variance (2.61 units), given these indicators

What’s going on here is that PCA has piled as much of the 6 units of original

standardized variance as it can into the first principal component (43%) and created a

composite with maximum possible variance (2.61 units), given these indicators

Then, it has taken the remaining variance (3.31 units or 57%) and used as much of it as possible to form a second principal component, that:

• Is uncorrelated with the first component,• Has the next largest possible variance

(1.21 units or an additional 20.19%).

Then, it has taken the remaining variance (3.31 units or 57%) and used as much of it as possible to form a second principal component, that:

• Is uncorrelated with the first component,• Has the next largest possible variance

(1.21 units or an additional 20.19%).

Then, it’s done the same again with a third principal component….Then, it’s done the same again with a third principal component….

And a fourth principal component….And a fourth principal component….

And a fifth….And a fifth….

And a sixth….And a sixth….

Here’s the “dimensionality” argument:Here’s the “dimensionality” argument: If all six original indicators were very strongly inter-

correlated, with inter-correlations close to 1, and there were no measurement error in any indicator, then: All six indicators would be high-quality measures of

an underlying uni-dimensional construct. All six units of original standardized variance would

end up in 1st component, with none left over.

If there were two independent dimensions of information in the 6 original indicators, then the bulk of the original variability would end up in the first two components.

Conclusion: Conclusion: You can determine the dimensionality of the original indicators by examining the eigenvalues..

Page 13: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 13

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Using The “Rule of One” To Determine The Dimensionality Of The Data

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Using The “Rule of One” To Determine The Dimensionality Of The Data

Conclusion: In the teacher job satisfaction example, the Rule of One

suggests that there are two major dimensions of information present in the original six indicators.

The third, fourth, fifth and sixth components then simply becomes “trash cans” into which trivial left-over amounts of independent information, or the measurement error, end up (it does have to go somewhere, after all!).

Be careful if you use the Rule of One to determine the dimensionality of a large number of indicators because, then, it tends to over-estimate the number of dimensions.

Conclusion: In the teacher job satisfaction example, the Rule of One

suggests that there are two major dimensions of information present in the original six indicators.

The third, fourth, fifth and sixth components then simply becomes “trash cans” into which trivial left-over amounts of independent information, or the measurement error, end up (it does have to go somewhere, after all!).

Be careful if you use the Rule of One to determine the dimensionality of a large number of indicators because, then, it tends to over-estimate the number of dimensions.

Use the “Rule of One”Use the “Rule of One”“The only principal components worth

paying attention to, are those whose variances (eigenvalues) are bigger than any one of the original indicators on its

own (i.e., bigger than 1)”

Use the “Rule of One”Use the “Rule of One”“The only principal components worth

paying attention to, are those whose variances (eigenvalues) are bigger than any one of the original indicators on its

own (i.e., bigger than 1)”

The Rule of the One, RingThe Rule of the One, Ring

How can we decide how many dimensions of independent information were present in the original indicators?How can we decide how many dimensions of independent information were present in the original indicators?

Page 14: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 14

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

Eigenvalues of the Correlation Matrix  Eigenvalue Difference Proportion Cumulative 1 2.60599489 1.39439026 0.4343 0.43432 1.21160463 0.49880170 0.2019 0.63633 0.71280293 0.11761825 0.1188 0.75514 0.59518468 0.14741881 0.0992 0.85435 0.44776587 0.02111886 0.0746 0.92896 0.42664701 0.0711 1.0000

ComponentNumber

Eigen-value

1 2.6062 1.2123 0.7134 0.5955 0.4486 0.426

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8

Component #

Eig

en

valu

e

How many eigenvalues rise above the scree?How many eigenvalues rise above the scree?

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Using A “Scree Plot” To Determine The Dimensionality Of The Data

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Using A “Scree Plot” To Determine The Dimensionality Of The Data

How do we decide how many dimensions of independent information were present in the original indicators?How do we decide how many dimensions of independent information were present in the original indicators?

Use a Scree-Plot …Use a Scree-Plot …

Page 15: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 15

Eigenvectors  PC_1 PC_2 X1 Have high standards of teaching 0.3472 0.6182X2 Continually learning on job 0.3617 0.5950X3 Successful in educating students 0.3778 -.3021X4 Waste of time to do best as teacher 0.4144 -.1807X5 Look forward to working at school 0.4727 -.2067X6 Time satisfied with job 0.4591 -.3117

We’ve already interpreted the 1st principal component as measuring TEACHER TEACHER

ENTHUSIASMENTHUSIASM

We’ve already interpreted the 1st principal component as measuring TEACHER TEACHER

ENTHUSIASMENTHUSIASM

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Interpreting The Second Principal Component In The Teacher Job Satisfaction Data

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Interpreting The Second Principal Component In The Teacher Job Satisfaction Data

iPC 2_ iPC 2_

iX1iX1

iX 2

iX 2

iX 3

iX 3

iX 4

iX 4

iX 5

iX 5

iX 6

iX 6

.62

.60

-.30

-.18

-.20

-.31

*6

*5

*4

*3

*2

*1

*6

*5

*4

*3

*2

*1

31.020.018.030.060.062.02_

31.020.018.030.060.062.02_

iiiiiii

iiiiiii

XXXXXXPC

XXXXXXPC

*6

*5

*4

*3

*2

*1

*6

*5

*4

*3

*2

*1

31.020.018.030.060.062.02_

31.020.018.030.060.062.02_

iiiiiii

iiiiiii

XXXXXXPC

XXXXXXPC

Teachers who score highhigh on the second component… Have high standards of teaching performance. Feel that they are continually learning on the job.

But also … Believe they are not successful in educating students. Feel that it is a waste of time to be a teacher. Don’t look forward to working at school. Are never satisfied on the job

Teachers who score highhigh on the second component… Have high standards of teaching performance. Feel that they are continually learning on the job.

But also … Believe they are not successful in educating students. Feel that it is a waste of time to be a teacher. Don’t look forward to working at school. Are never satisfied on the job

2nd principal component is measuring TEACHER FRUSTRATIONTEACHER FRUSTRATION2nd principal component is measuring TEACHER FRUSTRATIONTEACHER FRUSTRATION

Page 16: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 16

*---------------------------------------------------------------------------------* Input the dataset, name and label the six indicators of teacher satisfaction*---------------------------------------------------------------------------------*; DATA TSUCCESS; INFILE 'C:\DATA\S052\TSUCCESS.txt'; INPUT X1-X6; LABEL X1 = 'Have high standards of teaching'

X2 = 'Continually learning on job' X3 = 'Successful in educating students' X4 = 'Waste of time to do best as teacher' X5 = 'Look forward to working at school' X6 = 'Time satisfied with job';

PROC FORMAT; VALUE AFMT 1='Strongly disagree' 2='Disagree' 3='Slightly disagree' 4='Slightly agree' 5='Agree' 6='Strongly agree'; VALUE BFMT 1='Strongly agree' 2='Agree' 3='Slightly agree' 4='Slightly disagree' 5='Disagree' 6='Strongly disagree'; VALUE CFMT 1='Not successful' 2='Somewhat successful' 3='Successful' 4='Very Successful'; VALUE DFMT 1='Almost never' 2='Sometimes' 3='Almost always' 4='Always'; *---------------------------------------------------------------------------------* Carry out the principal components analysis and inspect the first two components*---------------------------------------------------------------------------------*; PROC PRINCOMP DATA=TSUCCESS OUT=TSUCCESS PREFIX=PC_; VAR X1-X6; PROC PRINT DATA=TSUCCESS(OBS=35); VAR PC_1 PC_2; PROC CORR NOPROB NOMISS DATA=TSUCCESS; VAR PC_1 PC_2;

*---------------------------------------------------------------------------------* Input the dataset, name and label the six indicators of teacher satisfaction*---------------------------------------------------------------------------------*; DATA TSUCCESS; INFILE 'C:\DATA\S052\TSUCCESS.txt'; INPUT X1-X6; LABEL X1 = 'Have high standards of teaching'

X2 = 'Continually learning on job' X3 = 'Successful in educating students' X4 = 'Waste of time to do best as teacher' X5 = 'Look forward to working at school' X6 = 'Time satisfied with job';

PROC FORMAT; VALUE AFMT 1='Strongly disagree' 2='Disagree' 3='Slightly disagree' 4='Slightly agree' 5='Agree' 6='Strongly agree'; VALUE BFMT 1='Strongly agree' 2='Agree' 3='Slightly agree' 4='Slightly disagree' 5='Disagree' 6='Strongly disagree'; VALUE CFMT 1='Not successful' 2='Somewhat successful' 3='Successful' 4='Very Successful'; VALUE DFMT 1='Almost never' 2='Sometimes' 3='Almost always' 4='Always'; *---------------------------------------------------------------------------------* Carry out the principal components analysis and inspect the first two components*---------------------------------------------------------------------------------*; PROC PRINCOMP DATA=TSUCCESS OUT=TSUCCESS PREFIX=PC_; VAR X1-X6; PROC PRINT DATA=TSUCCESS(OBS=35); VAR PC_1 PC_2; PROC CORR NOPROB NOMISS DATA=TSUCCESS; VAR PC_1 PC_2;

Finally, the principal components scores can be examined and used in subsequent analysis …Finally, the principal components scores can be examined and used in subsequent analysis …

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators You Can Obtain 1st & 2nd Principal Components Scores For Each Teacher

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators You Can Obtain 1st & 2nd Principal Components Scores For Each Teacher

Output the principal components into the TSUCCESS dataset, and label them with the prefix PC_

Output the principal components into the TSUCCESS dataset, and label them with the prefix PC_

Print a few cases for inspectionPrint a few cases for inspection

Check the inter-correlation of the first two principal components

Check the inter-correlation of the first two principal components

Page 17: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 17

Obs PC_1 PC_2

 1 -0.67402 1.64567 2 -3.70420 1.25497 3 -2.80870 1.46971 4 . . 5 -0.72933 0.16173 6 . . 7 0.68828 -0.66211 8 -1.64624 1.96727 9 1.84142 1.6660610 -0.11813 -0.2059611 -3.70653 0.8550712 2.11717 0.8482013 -0.66466 -0.4725814 -1.09068 -0.9936215 -0.89365 0.4289416 1.61503 0.5529917 -1.95180 -2.3319218 -1.40406 -0.2508419 1.18572 -0.8790420 -2.05647 -1.8849521 -0.36685 -0.0974922 -2.64324 -0.2120723 2.21446 1.1030524 -2.55062 -0.75701• -0.03442 -2.97280

(cases deleted)

Obs PC_1 PC_2

 1 -0.67402 1.64567 2 -3.70420 1.25497 3 -2.80870 1.46971 4 . . 5 -0.72933 0.16173 6 . . 7 0.68828 -0.66211 8 -1.64624 1.96727 9 1.84142 1.6660610 -0.11813 -0.2059611 -3.70653 0.8550712 2.11717 0.8482013 -0.66466 -0.4725814 -1.09068 -0.9936215 -0.89365 0.4289416 1.61503 0.5529917 -1.95180 -2.3319218 -1.40406 -0.2508419 1.18572 -0.8790420 -2.05647 -1.8849521 -0.36685 -0.0974922 -2.64324 -0.2120723 2.21446 1.1030524 -2.55062 -0.75701• -0.03442 -2.97280

(cases deleted)

Pearson Correlation Coefficients N = 4955  PC_1 PC_2 PC_1 1.00000 0.00000 PC_2 0.00000 1.00000

Pearson Correlation Coefficients N = 4955  PC_1 PC_2 PC_1 1.00000 0.00000 PC_2 0.00000 1.00000

As promised, scores on the first & second principal components are completely independent (uncorrelated) …As promised, scores on the first & second principal components are completely independent (uncorrelated) …

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Inspecting 1st & 2nd Principal Components Scores For Each Teacher

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Inspecting 1st & 2nd Principal Components Scores For Each Teacher

Notice that anyone missing a score on any indicator is also missing a

score on any composite

Notice that anyone missing a score on any indicator is also missing a

score on any composite

But, wait, there is still more to come … But, wait, there is still more to come …

Page 18: S052/III.1(b): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic Area?

© Willett, Harvard University Graduate School of Education, 04/21/23 S052/III.1(b) – Slide 18

Eigenvectors  PC_1 PC_2 X1 Have high standards of teaching 0.3472 0.6182X2 Continually learning on job 0.3617 0.5950X3 Successful in educating students 0.3778 -.3021X4 Waste of time to do best as teacher 0.4144 -.1807X5 Look forward to working at school 0.4727 -.2067X6 Time satisfied with job 0.4591 -.3117

Eigenvectors  PC_1 PC_2 X1 Have high standards of teaching 0.3472 0.6182X2 Continually learning on job 0.3617 0.5950X3 Successful in educating students 0.3778 -.3021X4 Waste of time to do best as teacher 0.4144 -.1807X5 Look forward to working at school 0.4727 -.2067X6 Time satisfied with job 0.4591 -.3117

The estimated correlation between any indicator and any component can be found by multiplying the corresponding component loading by the square root of the eigenvalue.

This is sometimes useful in interpretation.

The estimated correlation between any indicator and any component can be found by multiplying the corresponding component loading by the square root of the eigenvalue.

This is sometimes useful in interpretation.

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Appendix 1: Interesting Aside On Evaluating The Size Of Principal Component Loadings

S052/III.1(c): Using PCA To Form An “Ideal” Composite From Multiple Indicators Appendix 1: Interesting Aside On Evaluating The Size Of Principal Component Loadings

Correlation of X1 and PC_1:= 0.347 2.61= 0.347 1.62= 0.561

Correlation of X1 and PC_1:= 0.347 2.61= 0.347 1.62= 0.561

Correlation of X1 and PC_2:= 0.618 1.212= 0.618 1.101= 0.680

Correlation of X1 and PC_2:= 0.618 1.212= 0.618 1.101= 0.680

Correlation of X1 and PC_3:= …==

Correlation of X1 and PC_3:= …==