CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL...

CONSTRUCT VALIDITY OF ACCESMENT CENTRES:

LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS 31st Annual ACSG Conference

• March 2011

What is known about the construct validity currently: Over last 50 years – popular in the assessment of personal differences for

managerial development purposes

Multi-occupation, multi-company investigation with high face validity

AC post-exercise dimension ratings (PEDRs) is more pervasive than cross situational stability in candidate ratings

Bowler, M. C., & Woehr, D. J. (2006). A meta-analytic evaluation of the impact of dimension and exercise factors on assessment center ratings. Journal of Applied Psychology, 91, 1114–1124.

Lance, Lambert, Gewin, Lievens, & Conway, (2004) found in a meta-analysis that exercise effect explain almost three times more variance than dimension ratings

Problematic for construct validity: PEDRs is a function of exercise design and not person competencies.

What is known about the construct validity currently:

Recently there has been two schools of thought to assess the construct validity of AC’s: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory

FOUR basic models within the CFA tradition: Correlated Dimension Correlated Exercises Model (CDCE): MTMM One-dimension-correlated exercise model (1DCE) an uncorrelated dimensions, correlated exercises, plus g model (UDCE +

g) Correlated dimension-correlated uniqueness (CDCU) model

Lance, Woehr & Meade (2007). A Monte Carlo Investigation of Assessment Center Construct Validity Models. Organizational Research Methods, 10(3), 430-448

Advantages of CFA approach Partition out error variance; ALSO Partition out Exercise effects

Thus PEDR’s are a function of both exercise and dimension effects

However, technically CTCE model difficult to model (Empirical under-identification)

Prerequisite is construct validity before partitioning out exercise effects

Thus critical first step was to assess construct validity of dimensions with actual DAC data

An Example: Achievement motivation and Financial Perspective

An Example: Achievement motivation AM:

DIMENSION: ACHIEVEMENT MOTIVATION

EXCERCISES

ANALYSIS PROBLEM (AP)

SIMULATED IN BASKET (SIB)

TRAITS

INNOVATION IN_AP

ENERGY EN_AP

PROCESS SKILLS PS_AP PS_SIB

Correlation Matrix

AM: Option 1 CDCE model would be preferable: WHY? Differentiate sources of variance:

SCENARIO 2:

ACHIEVMENT MOTIVEATION

IN_AP

IN_AP

IN_AP

IN_SIB

ANALYSIS

PROBLEM

SIMULATED IN BASKET

-Empirical under-identification-We have 13 parameters to measure in the model, yet only 10 pieces of information in the covariance matrix-Thus we have to much model parameters to gauge with too little information (-3df)-Similar to equation: X + Y = 6-Unlimited possible combinations to solve the equation

ANALYSIS PROBLEM

x2

x3

δ2

lx11 δ1

δ3

lx21

lx31

x1

SIMULATED IN BASKETx4

lx42 δ3

ACHIEVEMENT MOTIVATION

lx13 x1

x2

x3lx23

lx43

lx23

φ21

AM: Technical Problems Simulated in Basket only measures one dimension (trait): Process Skills

Whereas Innovation, Energy and Process skills are gauged with analysis problem excercise

For basic CFA we need at least three indicators for each dimension.

However, if we have a single dimension and single exercise effect we need a minimum of five indicators

This have DAC design implications if we want to gauge the measurement effect in addition to the dimension effects

Literature review by Lievens and Conway (2001) suggest that median number of three exercises and five dimensions

AM: Option 2

ACHIEVMENT MOTIVEATION

IN_AP

IN_AP

IN_AP

IN_AP

GLOBAL METHOD EFFECT

• Still not enough degrees of freedom, need at least 5 indicators (10 possible sources of information yet must measure 12 parameters, thus -2 df

• SOLUTION: Include more exercises per dimension

Financial Perspective (FP)DIMENSION: FINANCIAL PERSPECTIVE

EXCERCISES

ANALYSIS PROBLEM

(AP)

GROUP DISCUSSION

(GD)

ONE:ONE(ONE)

SIMULATED IN

BASKET (SIB)

TRAITS

BROKER MARKET

(BM)BM_AP BM_GD BM_ONE BM_SIB

CROSS UP SELLING

(CUS)CUS_AP CUS_GD CUS_ONE CUS_SIB

PROFIT (PROF)

PROF_AP PS_SIB PS_ONE PS_SIB

Correlation Matrix

LARGE CORRELATIONS

BETWEEN EXCERCERCISES

FP:CTCE

BROKER

MARKET

ANALYSIS

PROBLEM

CUS_AP

CUS_GD

CUS_ONE

CUS_SIB

CROSS UP SELLING

BM_AP

BM_GD

BM_ONE

BM_SIB

PROF_AP

PROF_GDPROF_ONEPROF_SIB

GROUP DISCUSSION

ONE:ONE

SIMULATED IN BASKET

PROFIT

FP: CDCE Model did not converge although

enough df (78-44=34df) Singularity problems: Chiefly because of

multi-colinearity Go back to only dimension level without

exercise effects Thus only Broker Market, Cross up

Selling and Profit individually

CFA: Broker Market: FITCHI-SQUARE = 2.600 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.27247 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 2.428. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.942 BENTLER-BONETT NON-NORMED FIT INDEX = 0.954 COMPARATIVE FIT INDEX (CFI) = 0.985 BOLLEN'S (IFI) FIT INDEX = 0.986 MCDONALD'S (MFI) FIT INDEX = 0.997 JORESKOG-SORBOM'S GFI FIT INDEX = 0.988 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.938 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.013 STANDARDIZED RMR = 0.035 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.056 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.217) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.607 RELIABILITY COEFFICIENT RHO = 0.613

CFA: BM: Parameter estimates

BROKER MARKET1.0

BM_AP

BM_GD

BM_ONE

BM_SIB

0.41*

E1*0.91

0.53* E2*0.85

0.64*

E3*0.770.61*

E4*0.80

Figure X: EQS 6 broker market trait only Chi Sq.=2.60 P=0.27 CFI=0.98 RMSEA=0.06

0.41*

0.91

0.53* 0.85

0.64*

0.770.61*

0.80

• Thus BM showed good fit and parameter estimates• Broker Market Simulated in Basket was the best

predictor of Broker Market• All factor loadings were statistically significant

(p<0.05)

CFA: CUS: FIT• Problems with fit: BBNFI; IFI and reliability .• ERROR MESSAGE IN EQS DUE TO SINGULARITY OF

COVARIANCE MATRIX CHI-SQUARE = 1.456 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.48280 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 1.391. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.972 BENTLER-BONETT NON-NORMED FIT INDEX = 1.036 COMPARATIVE FIT INDEX (CFI) = 1.000 BOLLEN'S (IFI) FIT INDEX = 1.011 MCDONALD'S (MFI) FIT INDEX = 1.003 JORESKOG-SORBOM'S GFI FIT INDEX = 0.993 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.964 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.010 STANDARDIZED RMR = 0.026 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.000 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.183) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.630 RELIABILITY COEFFICIENT RHO = 0.635 MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY = 0.684

CFA: CUS: Parameter estimates

CROSS UP SELLING--1.00

CUS_AP

CUS_GD

CUS_ONE

CUS_SIB

0.27

0.36

0.36

0.28

0.390.24

0.29

0.10

Figure X: EQS 6 cross up selling trait only Chi Sq.=1.46 P=0.48 CFI=1.00 RMSEA=0.00

0.27

0.36

0.39

0.29

• Indicators did not do that well this time.• Best predictor was Group Discussion

CFA: PROFIT: FIT CHI-SQUARE = 0.634 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.72820 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 0.621. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.988 BENTLER-BONETT NON-NORMED FIT INDEX = 1.090 COMPARATIVE FIT INDEX (CFI) = 1.000 BOLLEN'S (IFI) FIT INDEX = 1.027 MCDONALD'S (MFI) FIT INDEX = 1.007 JORESKOG-SORBOM'S GFI FIT INDEX = 0.997 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.984 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.007 STANDARDIZED RMR = 0.018 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.000 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.143) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.633 RELIABILITY COEFFICIENT RHO = 0.642 MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY = 0.688 MAXIMAL RELIABILITY CAN

CFA: PROFIT: Parameter estimates

PROFIT--1.00

PROF_AP

PROF_GD

PROF_ONE

PROF_SIB

0.26

0.37

0.420.25

0.34

0.260.31

0.11

Figure X: EQS 6 profit trait only Chi Sq.=0.63 P=0.73 CFI=1.00 RMSEA=0.00

0.26

0.42

0.34

0.31

• Group discussion is once again the best predictor

CFA: Three dimensions no Exercise effects

BROKER MARKET

CROSS UP SELLING

PROFIT

BM_AP

BM_GD

BM_ONE

BM_SIB

CUS_AP

CUS_GD

CUS_ONE

CUS_SIB

PROF_AP

PROF_GD

PROF_ONE

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

E11

E12

• Model did not work, neither did single universal dimension work

Conclusion

The Broker Market sub-dimension worked individually but not the Cross up Selling or Profit sub-dimensions

For this reason we can not expect the combined CFA model to work which incorporates all three dimensions

Have to work out problems on sub-scale level first before moving on to global level

Because construct validity is lacking at the subscale level it does not make sense to look at the exercise effects

Must sort out construct validity on sub-scale level first

G-theory Generalizability theory (G-theory) extends the framework of classical test

theory in order to take into account the multiple sources of variability that can have an effect on test scores (Lynch & McNamara, 1999)

DAC the following sources of variance is often considered: Person Exercise Dimension Person*Dimension interaction (Cross situational specifity) Person* Exercise interaction (Low construct validity) Dimension* Exercise (Observability of particular dimension)

G-study is then designed to estimate the relative effects of these facets on test performance data.

Overall index of reliability (similar to Cronbach coefficient alpha) are expressed as phi I(Φ) coefficient and is also referred to generally as “an index of dependability’”

Meaning of different sources of variance in DAC Dimension effect: variance in ratings attributed to certain

dimensions, i.e. certain dimensions receiving higher/lower ratings compared to others

Person effect: general performance factor of persons Exercise effect: certain exercises overall receive higher/lower

ratings in comparisons with others Person*Dimension effect: amount of variance attributed to

person’s performance on dimension across exercises:- this is indicative of cross-situational specifity

Person*Exercise effect: amount of variance attributed to person receiving high/low rating on certain exercises regardless of dimension being measured

Dimension*Exercise effect: amount of variance attributed to specific dimension being measured in a specific exercise:- referred to as obervability of a particular dimension

Construct Validity G-study construct validity: person, dimension &

person*dimension variance must collectively > exercise, and person*exercise effect

Consider a practical DAC example with G-Theory

N=372

Nine dimensions with mostly two exercise: Simulated In-Basket Role Play

A Practical exampleDimension Exercises

SIB Role Play InterviewChange Orientation ✓ ✓Communication ✓Customer Service Orientation ✓ ✓Interpersonal Interaction ✓ ✓Planning & Organizing ✓ ✓Problem Analysis & Decision-making ✓ ✓Self-Management ✓ ✓Team Management ✓ ✓

A Practical example: Variance Components for entire DAC

A Practical example: Important note

In SPSS: For the ANOVA and MINQUE methods, negative variance component

estimates may occur. Some possible reasons for their occurrence are: (a) the specified model is not the correct model, or (b) the true value of the variance equals zero

In light of the foregoing example: Variance attributed to exercise effects (.108) > variance attributed

to person effects (.322)

This finding seems to be in-line with Lance et al’s (2004) contention that method effects are three time more than trait effects

In the current example 2,9 more variance was explained by exercise effects compared to dimension effects.

A Practical example: Variance Components for selected dimensions

However, could it be that the G-study on the entire DAC ironed out some of the robust dimension effects on the sub-dimension level?

I.e. are we throwing out the good with the bad?

To investigate the relative contribution of each dimension to the overall G-coefficient – one could conduct forward G-analysis on the individual dimension level

However, when we calculate the Φ coefficient on subscale level, there will be no variance component for dimension, dimension*exercise, dimension*person, or dimension*person*exercise effect

The biggest problem with the approach is that it will not be able to compare person*dimension variance with person*exercise variance since no person*exercise variance component is generated

However it is still possible to compare person variance with person*exercise variance

A Practical example: Variance of communication

A Practical example: Variance of Team Management

Final Verdict: G-study and DAC

Investigate dimensions individually to assess contribution of different sources

Poorly designed dimensions may inflate observed variance attributed to exercise, exercise by dimension, and exercise by person effects

The way G-studies is conducted have design implications for DAC: All vs some approach to design

IRT ANALYSIS Previously we noted:

Recently there has been two schools of thought to assess the construct validity of AC’s: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory

Fairly new area: IRT modeling with interval data

Consider Achievement Motivation discussed earlier

IRT Approach Logistical model dictate that a respondents response to an

item should depend on two parameters only: Difficulty of endorsing the items (item location parameter) Standing of respondent on the latent trait (person location

parameter) The expectation is that persons with a higher standing on the

latent trait should have a higher probability of endorsing a particular item compared to a person with a lower standing on the same trait

This is a key requirement for DAC since the central aim is to discriminate between person who is low and high on the trait (dimension).

Deviations from these indications might suggest that the DAC exercises are not operating as expected

Rating scale The current DAC was rated on a 5-point response scale

with non-integer values (i.e. decimal values) Common wisdom: more response categories = more

reliable measure that resemble interval data. However it remains to be seen if people actually make

distinction between response categories. It is expected that thresholds between 5 response

categories will be sequentially ordered along the latent traits

We can examine the Graphed category response function to see if each of the 4 thresholds becomes the modal category at some point on the latent trait continuum

Empirical response categories for INN-AP

Empirical response categories for EN_AP

Empirical response categories for PS_AB

Empirical response categories for PS_SIB

Empirical response categories

ITEM DIFFICULTY MEASURE OF -1.13 ADDED TO MEASURES ------------------------------------------------------------------- |CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 1 1 1 1| -4.95 -4.41| .29 .10|| NONE |( -9.04)| 1 | 2 2 27 54| -.60 -.43| .55 .62|| -6.82 | -3.31 | 2 | 3 3 15 30| 1.89 1.67| .72 .42|| 2.46 | 2.28 | 3 | 4 4 7 14| 3.55 3.31| .68 .60|| 4.36 |( 4.43)| 4 | 5 1 1| | || NONE | | 5 ------------------------------------------------------------------- OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.

Response Scales What we see here is that although there is

supposed to be 5 response categories – raters effectively make use of three response categories when rating PEDR’s

Furthermore, person reliability is not very good. This indicates estimates the confidence we have

that people will be allocated to the same ranking order when exposed to the Achievement Motivation DAC again

This is similar to the person*dimension effect in G-studies

Fit Statistics SUMMARY OF 97 MEASURED (NON-EXTREME) PERSON ------------------------------------------------------------------------------- | TOTAL MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 34.6 8.0 .66 .64 1.01 -.2 1.05 -.1 | | S.D. 6.0 .0 2.32 .16 .94 1.3 1.29 1.3 | | MAX. 48.0 8.0 5.09 1.29 5.40 3.4 8.32 4.2 | | MIN. 19.0 8.0 -8.77 .45 .10 -2.5 .08 -2.7 | |-----------------------------------------------------------------------------| | REAL RMSE .76 TRUE SD 2.19 SEPARATION 2.88 PERSON RELIABILITY .89 | |MODEL RMSE .66 TRUE SD 2.22 SEPARATION 3.39 PERSON RELIABILITY .92 | | S.E. OF PERSON MEAN = .24 | -------------------------------------------------------------------------------

Fit Statistics: PERSON AND ITEM PARAMETERS

ITEM STATISTICS: MISFIT ORDER -------------------------------------------------------------------------------------------------- |ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PT-MEASURE |EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM G | |------------------------------------+----------+----------+-----------+-----------+-------------| | 8 590 98 -.24 .17|1.48 2.7|2.45 4.3|A .61 .74| 53.6 64.1| TRANS_SIB 0 | | 6 632 98 1.63 .14|1.24 1.4|1.14 .6|C .80 .81| 62.9 62.0| TRANS_EN 0 | | 7 545 98 -1.53 .14| .95 -.3| .95 -.1|D .80 .78| 57.7 56.6| TRANS_PS 0 | | 5 545 98 -1.53 .14| .92 -.5| .83 -.6|d .81 .78| 58.8 56.6| TRANS_IN 0 || |------------------------------------+----------+----------+-----------+-----------+-------------| | MEAN 426.4 98.0 .00 .18|1.00 -.1|1.05 -.1| | 65.7 64.4| | | S.D. 154.5 .0 1.52 .03| .29 1.9| .58 2.0| | 8.2 5.4| | --------------------------------------------------------------------------------------------------

THUS from this Table we can see from the high ZSTD infit statistics that PS_SIB underestimates expected item scores

Expected Item Characteristic Curves: PS-SIB

Expected Item Characteristic Curves: EN_AP

Expected Item Characteristic Curves: PS_SIB

Validation problems of DAC’s

If the SEM approach is to preferred: Empirical Considerations

At least 5 exercises per dimension for an uni-dimensional construct and single exercise effect

If the 1DCE approach is used with multiple sub-dimensions than at least 3 exercises per sub-dimension is needed

Multiple raters for each dimension Sample size > 150 Minimum of 5-point rating scale

Validation problems of DAC’s Substantive considerations:

Theoretical underpinnings of DAC dimensions

Are we really measuring more than fluid intelligence (g) in DAC’s?

Have we considered discriminant and convergent validity outside the MTMM doctrine: Cross-validation with paper & pencil measures?

Rater calibration: Higher inter-rater agreement at the expense of restriction of range and construct validity

PEDR’s lies at the heart of the problem: What are we rating?

Competency potential

CompetenceObservableBehaviour

PEDR’s

? ??

PEDR’s lies at the heart of the problem: What are we rating? If we are proponing to measure competency potential -

would it not be better to use paper & pencil measures with more control (standardisation) and objectivity?

When designing exercises to measure AC dimensions – what is the constitutive meaning of the proposed dimensions? “Creative thinking & Entrepreneuric Energy”

Why not cross-validate AC constructs with “known constructs”?

For example: Empowering Leadership (DAC) – Transformational leadership (Bass & Avolio, 1995).

Rating calibration: Guidelines vs Rules! More variance in PEDR’s when raters are given more

discretion (i.e. guidelines not rules)

PEDR’s lies at the heart of the problem: What are we rating? Exercises: Uni-dimensionality is paramount Avoid conglomeration of constructs when designing

exercises Be adamant about micro measurement through

thoroughly designed scoring reports Attach scoring scale to each elicited behaviour Can raters list all observable behaviors without

guidance? Finally: Is DAC a new science? OR Can we apply some known psychometric truths to

DAC or are “behaviour to complex to measure”

Legislative Pitfalls !! LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL

ASSESSMENTS EEA implications:

The usage of psychometric test in South Africa are monitored and guided by the Employment Equity Act (Republic of South Africa, 1998) prohibiting the use of psychological tests unless it can be shown that the tests are valid and not biased against any employee or group (i.e., without measurement bias)

According to the paragraph In paragraph 8 of the Employment Equity Act (Republic of South Africa, 1998, p. 16) this position is reiterated and qualified by stating:

Psychological testing and other similar assessments of an employee are prohibited unless the test or assessment being used:

a) has been scientifically shown to be valid and reliable; b) can be applied fairly to all employees; c) is not biased against any employee or group.

Legislative Pitfalls !! According to the main propositions of the EEA the users of psychometric tests are behooved to provide

evidence that suggest that selection processes adheres to the act.

THUS, whenever allegations of discrimination is advanced the burden of proof shifts to the employer to demonstrate the job-relatedness of the selection procedure and that the inferences derived from the predictor scores are fair.

This interpretation is reinforced in Chapter II of the EEA under the heading “burden of proof”, paragraph 11: Whenever unfair discrimination is alleged in terms of the Act, the employer against whom the allegation is made

must establish that it is fair

Is it possible to immunize oneself from EEA legislation by claiming to use DAC for developmental vs. Selective purposes?

Ultimately, developmental DAC can still be discriminating unfairly, especially in promotional practices

In an effort to avoid legislation: Make sure to get psychometric “INTEL” on DAC

LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS

CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL...

Documents

Transcript of CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL...