Short form of the Wilde Intelligenztest: Psychometric qualities and … · 2016-07-28 ·...

Academiejaar 2015 – 2016

Examenperiode tweede semester

Short form of the Wilde Intelligenztest:

Psychometric qualities and utility of a 12-minute intelligence

test for personnel selection

Masterproef II neergelegd tot het behalen van de graad van Master of Science in de

Psychologie, afstudeerrichting Bedrijfspsychologie en Personeelsbeleid

door

Louis Lippens

Promotor: Prof. dr. Filip Lievens

Begeleider: Drs. Jan Corstjens

Ondergetekende, Louis Lippens, geeft toelating tot het raadplegen van de scriptie door

derden.

Louis Lippens

I

VOORWOORD

Vele medestudenten zullen instemmend knikken wanneer ik stel dat een hoop werk in

dit sluitstuk kruipt. Persoonlijk kan ik met genoegen terugkijken op het eindresultaat.

Het was een proces van leren en persoonlijke ontwikkeling. Enerzijds heb ik meer

inzicht gekregen in het concept ‘intelligentie’, haar psychometrische fundamenten en de

inzetbaarheid ervan in de praktijk. Anderzijds ben ik de uitdaging aangegaan om mijn

masterproef in het Engels te schrijven, wat mijn taalvaardigheid sterk heeft

aangescherpt.

Deze masterproef is tot stand gekomen met de hulp van velen. Eerst en vooral hebben

mijn promotor, professor Filip Lievens, en thesisbegeleider, Jan Corstjens, het mogelijk

gemaakt om dit boeiende onderwerp aan te snijden. Verder hebben meer dan

driehonderd bachelorstudenten, in het kader van hun curriculum en onder begeleiding

van verschillende doctoraatsstudenten, ervoor gezorgd dat ik bruikbare data voor

handen had om het onderzoek te voeren. Waarvoor grote dank.

Daarnaast wil ik graag mijn ouders bedanken voor hun steun tijdens mijn vijfjarig

traject. Zij stonden steeds klaar om mij bij te staan en vooruit te stuwen, zowel voor,

tijdens als na examenperiodes. Ook mijn vriendin, Anna, is een heuse steun geweest

doorheen de jaren. Beiden hebben we de opleiding psychologie succesvol aangevat en

voltooid. Zij heeft me in het bijzonder geholpen met de laatste pennentrekken.

Tot slot hoop ik ten volle dat al wie dit werk leest en interesse heeft in het subject van

deze masterproef, intelligentie of verkorte testvormen nuttige informatie kan putten uit

de inhoud.

II

ABSTRACT

Since the early years of test development, researchers have explored numerous ways to

shorten tests – a trend that has recently been revived (Schipolowski, Schroeders, &

Wilhelm, 2014). Yet, many of these abbreviated forms are insufficiently researched and

validated (Smith, Combs, & Pearson, 2012). The current study is an inquiry into the

psychometric qualities of the WIT-S, a short form of the Wilde Intelligenztest (Jäger &

Althoff, 1983) for personnel selection, where time-efficiency and the ability to predict

an applicant’s future job performance are two crucial elements (Klimoski & Jones,

2008; Schmidt & Hunter, 2004). 318 Dutch-speaking participants between 18 and 59

years took part in the research and completed a large-scale test battery. In accordance to

the literature, male, younger and higher educated test takers averagely achieve better

scores on the WIT-S. Moreover, the items of the WIT-S are sufficiently interrelated as

the composite scale reaches satisfactory reliability levels. The convergence of the

WIT-S with a full-length measure of general intelligence and divergence with an

emotion recognition test furthermore suggest that the short form measures a general

factor of intelligence. However, the current research is not able to show that each item

of the WIT-S attains an adequate difficulty and discriminatory level, that the WIT-S

preserves the factor structure of a multifactorial measure and that the subscales of the

WIT-S reach adequate levels of reliability. Future research on the reliability and latent

factor structure of the WIT-S should be able to provide more conclusive findings about

its psychometric qualities and hence usefulness in practice.

Key words: intelligence, short form, reliability, validity, personnel selection

III

SAMENVATTING

Sinds het prille begin van testontwikkeling hebben onderzoekers verschillende manieren

onderzocht om testen in te korten – een trend die recentelijk weer is opgeleefd

(Schipolowski, Schroeders, & Wilhelm, 2014). Echter zijn veel van deze verkorte

testvormen onvoldoende onderzocht en gevalideerd (Smith, Combs, & Pearson, 2012).

Het huidig onderzoek exploreert de psychometrische kwaliteiten van de WIT-S, een

verkorte vorm van de Wilde Intelligenztest (Jäger & Althoff, 1983) voor

personeelsselectie, waar tijdsefficiëntie en het vermogen om de toekomstige prestatie

van een sollicitant te kunnen voorspellen twee cruciale elementen zijn (Klimoski &

Jones, 2008; Schmidt & Hunter, 2004). 318 Nederlandstalige participanten tussen 18 en

59 jaar namen deel aan het onderzoek en legden hierbij een grootschalige testbatterij af.

In overeenstemming met de literatuur behalen mannelijke, jongere en hoger opgeleide

participanten gemiddeld gezien betere scores op de WIT-S. Tevens staan de items van

de WIT-S voldoende met elkaar in verband daar de samengestelde testschaal adequate

betrouwbaarheidsniveaus bereikt. Verder convergeert de WIT-S met een uitgebreide

intelligentietest en divergeert de WIT-S met een emotie-herkenningstest, wat suggereert

dat de verkorte test een algemene intelligentiefactor meet. Daarentegen kan in het

huidig onderzoek niet worden aangetoond dat elk testitem een gepast moeilijkheids- en

onderscheidingsniveau bereikt, dat de WIT-S de factorstructuur van een multifactorieel

instrument behoudt en dat de subschalen van de WIT-S adequate

betrouwbaarheidsniveaus behalen. Toekomstig onderzoek naar de betrouwbaarheid en

latente factorstructuur van de WIT-S zou meer determinerende bevindingen over de

psychometrische kwaliteiten van de test moeten kunnen verschaffen en bijgevolg over

de inzetbaarheid van de WIT-S in de praktijk.

IV

TABLE OF CONTENTS

VOORWOORD..................................................................................................... I

ABSTRACT........................................................................................................... II

SAMENVATTING................................................................................................ III

INTRODUCTION................................................................................................. 2

The concept of intelligence.............................................................................. 2

Leading psychometric theories of intelligence................................................. 3

General intelligence (g)............................................................................. 3

Primary Mental Abilities........................................................................... 4

The three-stratum theory: Cattell-Horn-Carroll........................................ 4

The g-VPR theory..................................................................................... 5

Psychometric fundamentals of the current research......................................... 5

Modified Model of Primary Mental Abilities........................................... 5

MMPMA as an alternative........................................................................ 7

Measurements of intelligence........................................................................... 8

Major long-form intelligence tests............................................................ 8

Raven Progressive Matrices...................................................................... 9

Short-form measurements of intelligence................................................. 10

Clinical short-form measurements..................................................... 11

Baddeley’s and Wonderlic’s short form............................................. 12

Short form of the Wilde Intelligenztest.............................................. 12

Intelligence testing in personnel selection........................................................ 14

The importance of intelligence.................................................................. 14

Intuition in recruitment and selection........................................................ 15

Hypotheses....................................................................................................... 17

Test score differentiation........................................................................... 17

Items and their latent factors..................................................................... 19

Reliability and construct validity.............................................................. 20

METHOD............................................................................................................... 22

Sample.............................................................................................................. 22

Measures........................................................................................................... 22

Short form of the Wilde Intelligenztest..................................................... 22

V

Geneva Emotion Recognition Test............................................................ 23

Kaufman Adolescent and Adult Intelligence Test..................................... 24

Data analysis..................................................................................................... 24

Item analysis.............................................................................................. 25

Factor analysis........................................................................................... 25

Reliability analysis.................................................................................... 27

Construct validity...................................................................................... 28

RESULTS............................................................................................................... 29

Descriptive statistics......................................................................................... 29

Analysis of the WIT-S items............................................................................ 32

Factor structure of the WIT-S........................................................................... 35

Exploratory factor analysis........................................................................ 35

Confirmatory factor analysis..................................................................... 39

Reliability of the WIT-S................................................................................... 43

Construct validity of the WIT-S....................................................................... 44

DISCUSSION......................................................................................................... 46

General findings and implications.................................................................... 46

Test score differentiation........................................................................... 46

Items and their latent factors..................................................................... 47

Reliability and construct validity.............................................................. 48

Limitations........................................................................................................ 49

Suggestions for future research........................................................................ 50

Conclusion........................................................................................................ 51

REFERENCES...................................................................................................... 53

Running head: PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 1

In recent years, a trend has emerged of developing short form measures,

including abbreviated intelligence tests, for both research purposes and practical

applications (Schipolowski, Schroeders, & Wilhelm, 2014; Smith, McCarthy, &

Anderson, 2000). The rationale for this trend is threefold. Firstly, the administration of

full-length test batteries in large-scale assessments is undoubtedly costly. Secondly,

measuring intelligence in a selection context is highly valuable as it is one of the best

predictors of job performance and job training outcomes (Schmidt & Hunter, 1998).

Thirdly, time is a critical element, especially in personnel selection (Klimoski & Jones,

2008). To illustrate the latter, imagine a company that wants to hire the best candidate

out of 30 applicants. Estimating every applicant’s IQ by means of a well-constructed

and validated intelligence test, such as the Kaufman Adolescent and Adult Intelligent

Test (Kaufman, 1990), will require more than 30 hours to prepare, administer and score.

Consequently, there is insufficient time to justify the administration of long-form

measures and the recruiters of the company fall back on the convenient unstructured job

interview, or even worse, intuition (Highhouse, 2008).

However, the number of research on short-form intelligence tests is easily

outnumbered by research on shortened versions of, for instance, personality measures

(Kruyen, Emons, & Sijtsma, 2013). Moreover, a greater amount of short-form

intelligence tests is validated and used in a clinical context than in organizational

settings (see i.a. Donnell, Pliskin, Holdnack, Axelrod, & Randolph, 2007; Jeyakumar,

Warriner, Raval, & Ahmad, 2004; Wymer, 2003). Therefore, the need arises to explore

novel, shortened versions of intelligence measures. Of course, this can not be done

overnight: a substantial amount of in-depth psychometric research should be conducted

on the validity and reliability of any short-form measure, especially when the measure

will be used to make selection decisions on the individual level (Nunnally & Bernstein,

1994; Ziegler, Kemper, & Kruyen, 2014). Contrarily, past research has shown that, in

many cases, insufficient validation has been carried out when short forms were

developed (Smith et al., 2000; Smith, Combs, & Pearson, 2012).

The current research is an inquiry into the psychometric qualities of a short form

of the Wilde Intelligenztest (WIT-S) for personnel selection. It should provide us with

an answer to the question whether the WIT-S is able to function as a valid and reliable

alternative to the time-consuming, full-scale intelligence tests.

PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 2

INTRODUCTION

The concept of intelligence

“For though all men are, as we trust and believe, capable of the divine faculty of

reason […] yet it is not to all that the heavenly beam is disclosed in its splendour”

(Jenner, 1807, p. 1). People have tried to comprehend and define the concept of

intelligence numerous times and in many different ways. More than a century ago,

academics were already struggling to understand what intelligence exactly comprised.

Galton (1869) was convinced that sensory discrimination, e.g. not being able to

differentiate between heat and cold, underlay the construct of intelligence. Spearman

(1904), on the other hand, much more emphasized educational achievements.

Individuals who made the most of their education were, in his vision, seen as more

intelligent. In 1921, seventeen leading investigators on the topic of intelligence

exchanged their views on the construct and what they conceived it to be, but did not

agree on a general and comprehensive conceptualization of intelligence (Thorndike et

al., 1921). One of the first definitions was proposed by Boring (1923), two years later,

during a debate on the true meaning of intelligence. He defined it as “the capacity to do

well in an intelligence test” because, after all, “intelligence is what the tests test”

(Boring, 1923, p. 35). This definition, however, is circular and does not at all explain

what intelligence is assumed to be exactly.

Decades later, Wechsler (1975) contributed to a renewed debate and defined

intelligence as “the capacity of an individual to understand the world about him and his

resourcefulness to cope with its challenges” (p. 139). In his article he clearly stressed

the complexity of the concept of intelligence and its stratification. Resnick (1976) also

revived the interest in the nature of intelligence during a symposium by asking

colleagues what intelligence actually is, which she also cited in her book ‘The Nature of

Intelligence’. Neisser (1979) later addressed Resnick’s (1976) question by posing that

intelligence could not be explicitly defined. According to him no definite criteria of

intelligence exist, and thus a process-based definition of intelligence does not either. In

1996, he and a few of his colleagues did however try to get a grasp of the

conceptualization of intelligence. In a response to the controversial publication ‘The

bell curve’ by Herrnstein and Murray (1994), Neisser et al. (1996) stated that by

conceptualizing intelligence, one “attempts to clarify the ability of individuals to


understand complex ideas, to adapt effectively to the environment, to learn from

experience, to engage in various forms of reasoning, and to overcome obstacles by

taking thought” (p. 77). Also in the aftermath of ‘The bell curve’, Gottfredson (1997a)

defined intelligence analogously as “a very general mental capability that, among other

things, involves the ability to reason, plan, solve problems, think abstractly,

comprehend complex ideas, learn quickly and learn from experience” (p. 13). To date,

this working definition is still used by renowned psychologists, for example in the

review by Nisbett et al. (2012) on the recent findings and developments in the field of

intelligence.

If we take a look at the definitions of Wechsler (1975), Neisser et al. (1996) and

Gottfredson (1997a) altogether, we see three elements consistently return. First of all,

the ability to understand the complexity of ideas. Secondly, the ability to adapt to our

environment and learn from it. And thirdly, the ability to overcome the obstacles,

problems and challenges one faces by reasoning and thought processing. This gives us a

general idea of the various elements and the operationalization of the construct.

Although looking for communalities in existing definitions can be valuable, the fact that

the definitions mentioned are working definitions and not universally accepted

definitions of intelligence by (industrial and organizational) psychologists is

problematic (Scherbaum, Goldstein, Yusko, Ryan, & Hanges, 2012). A coherent,

agreed-upon definition is needed to clearly understand the complex concept itself and

subsequently to be able to construct valid measurements of intelligence, yet it can not be

offered at present.

However, instead of losing ourselves in the discussion on how to operationally

conceptualize and define intelligence, we should rather look further into the

psychometric theories – the fundamentals that underlie modern concepts of intelligence.

The psychometric approach to intelligence is also debated by many researchers,

reflecting on whether it is ultimately the best approach, though it is well regarded and

widely accepted (Scherbaum et al., 2012). These fundamentals will furthermore enable

us to understand how intelligence is structured, measured and used in practice.

Leading psychometric theories of intelligence

General intelligence (g). The theory of general intelligence is one of the first

theories of intelligence (Spearman, 1904; Spearman, 1927). The g-model presumes that


the individual differences in cognitive performance are due to just one factor of

cognitive ability, namely ‘general intelligence’ or ‘g’. For the most part, the theory and

existence of g is supported by the well-documented fact that there are positive

correlations between almost all cognitive ability tests, which academics also call the

positive manifold (Gottfredson, 1997b; Jensen, 1998). The above led to the assumption

that there should be a general factor of intelligence underlying all of these tests.

Primary Mental Abilities. Not long after its nascence, Thurstone (1938)

contested the theory of general intelligence. Rather than assuming that there was only

one single dimension of intelligence, he demonstrated that there were several,

statistically independent primary abilities (Thurstone, 1938; Thurstone & Thurstone,

1941). These Primary Mental Abilities (PMA) are ‘Spatial Reasoning’, ‘Perceptual

Speed’, ‘Number Facility’, ‘Verbal Relations’, ‘Word Fluency’, ‘Memory’, and

‘Inductive Reasoning’. Individuals presumably differed in terms of the extent to which

they perform on these abilities. However, more recent research (Abad, Colom, Juan-

Espinosa, Garcia, 2003; Detterman & Daniel, 1989; Legree, Pifer, & Grafton, 1996)

indicated that these effects were more likely due to restriction of range, as the

participants in the original study by Thurstone (1938) were part of a high-ability group,

making it harder to find strong scientific evidence for a common g-factor. The cognitive

differentiation found in the high-ability group was probably partially the outcome of the

urge to specialize in current society; acquiring a high level of expert knowledge

consumes time and effort and catalyses differentiation of cognitive abilities (Ericsson,

Charness, Feltovich, & Hoffman, 2006).

The three-stratum theory: Cattell-Horn-Carroll. Another model and

psychometric substantiated theory, originally developed by Cattell (1963) and Horn

(1965a) as the Gf-Gc Model (Horn & Cattell, 1966), and later modified by Carroll

(1993), is the Cattell-Horn-Carroll (CHC) hierarchical model with three strata or levels.

The first stratum consists of narrowly defined abilities, e.g. ‘Lexical Retrieval Ability’

and ‘Numerical Facility’. The second stratum comprises broader abilities, including

fluid intelligence (Gf), the ability to handle novel and uncommon problems and issues,

and crystallized intelligence (Gc), the ability to handle common problems and issues by

bringing one’s acquired knowledge into practice. The third stratum consists of the

general intelligence factor g.


The main scientific evidence in favour of the differentiation between Gc and Gf,

opposed to the unitary g-model, is the distinction between respectively consciously

controlled and automatic, unconsciously memory systems (Jacoby, Toth, & Yonelinas,

1993), and the difference in test performance dependent of age. The performance on

tests focussing on Gf rises in childhood and adolescence, but then starts declining with

age in adulthood and onward (Horn & Cattell, 1967; Salthouse 1996). This decline is

much faster than the performance on tests focussing on Gc, which keeps on increasing

with age until late adulthood before it stagnates and eventually declines (Ackerman,

1996; Blair, 2006; Horn & Cattell, 1967). Findings on the neurobiological level also

supports this differentiation as (a) severe damage to the prefrontal cortex (PFC) impairs

performance on Gf-based but not on Gc-based tests and (b) the PFC degenerates more

rapidly with age than does the rest of the cerebral cortex (Blair, 2006; Nisbett et al.,

2012). Finally, general gains in scores on Gf-based tests (e.g. Raven’s Progressive

Matrices; Raven, 1938) have substantially exceeded gains in Gc-based test in the past

decades (Flynn, 2007).

The g-VPR Theory. Johnson and Bouchard (2005a) recently suggested a four-

stratum model. The first stratum consists of primary cognitive ability traits. The second

stratum consists of narrowly defined abilities – similar to the first stratum of the CHC-

model. The third stratum consists of a verbal, perceptual and rotational ability. The

fourth and also last stratum consists of the general intelligence factor g – similar to the

third stratum of the CHC-model. The principal psychometric evidence in favour of the

g-VPR model is the superior fit of the model when it was compared with other

psychometric models of intelligence on the basis of several large-scale data sets

(Johnson et al., 2007; Johnson, Nijenhuis, & Bouchard, 2007), including the original

data used by Thurstone and Thurstone (1941) to support their model of primary mental

abilities (Johnson & Bouchard, 2005b).

Psychometric fundamentals of the current research

Modified Model of Primary Mental Abilities. The psychometric model

operationalized in the current study is the ‘Modifiziertes Model der Primary Mental

Abilities’ or ‘Modified Model of Primary Mental Abilities’ (MMPMA) (Kersting,

Althoff, & Jäger, 2008). The model is an updated version of the model of Primary

Mental Abilities (Thurstone, 1938), adapted to the current Zeitgeist and theoretical and


methodological advances. Figure 1 represents a visual overview of the model. The first

stratum consists of specific ability tests such as ‘Working Efficiency’ and ‘Economic

Knowledge’. The second stratum, the heart of the model, comprises the primary mental

abilities. The third stratum consists of the higher-order factors ‘fluid intelligence’ (Gf)

and ‘crystallized intelligence’ (Gc), analogue to the eponymous second-stratum factors

in the CHC-model (Carroll, 1993) and the higher-order factors in the original Gf-Gc

model (Horn & Cattell, 1966).

Figure 1. Modified Model of Primary Mental Abilities based on Thurstone’s Model of

Primary Mental Abilities (1938) – constructed by Kersting et al. (2008).

Kersting et al. (2008) have made three important assumptions regarding their

model. First of all, ‘Reasoning’ (in their model also referred to as deductive reasoning)

is regarded as an operationalization of the mental abilities ‘Verbal Comprehension’,

‘Numeric Reasoning’ and ‘Spatial Reasoning’, that presumably pertain the same content

domain. Secondly, the primary mental ability factors are considered to be part of a

hierarchical nested-factors model with higher-order (Gf and Gc) and lower-order

factors. Thirdly, fluid intelligence and working memory are hypothesized to be highly

overlapping. The third assumption is elucidated by the substantial amount of recent


research on working memory and its relation to intelligence. Seemingly, Gf can be

substantially improved by increasing one’s working memory capacity through training

(see i.a. Basak, Boot, Voss, & Kramer, 2008; Borella, Carretti, Riboldi, & De Beni,

2010; Jaeggi et al., 2010; Mackey, Hill, Stone, & Bunge, 2011). Yet, the question on

how Gf and working memory exactly relate to each other is complex and still partly

unsolved (Nisbett et al., 2012).

MMPMA as an alternative. The Modified Model of Primary Mental Abilities

builds on the fundamentals of two of the four leading psychometric models of

intelligence. Kersting et al. (2008) integrated the Gf-Gc distinction of the CHC-theory

and the differentiation in lower-order abilities of the PMA theory into one

comprehensive model. Herewith, the question arises why this particular integrated

model would be a better fit – in a personnel selection context – than one of the four

leading psychometric models of intelligence.

As previously mentioned, Gottfredson (1997b) and Jensen (1998) demonstrated

the existence of a positive manifold of correlations between intelligence and cognitive

ability tests, which is evidence in favour of the model of general intelligence

(Spearman, 1904; Spearman, 1927). Yet, Van Der Maas and colleagues (2006) argue in

their study that this does not automatically imply the existence of just one single factor

of intelligence. Even Jensen (1998) himself notes that a distinction can be made

between intelligence as a learning ability (read: Gf) and intelligence as acquired

knowledge from learning (read: Gc), which is supported by research from Reeve &

Bonaccio (2011). This is in line with the findings that the psychometric fundamentals of

intelligence are hierarchically organized and that intelligence consists of multiple

dimensions (Carroll, 1993; Lang, Kersting, & Beauducel, 2016, McGrew, 2009; Reeve,

2004).

These findings do not, however, detract from the existence of a general factor at

the top of the hierarchy (see i.a. Lang et al., 2016; Lubinski, 2000). Dickens (2007)

explained the appearance and role of g in a multidimensional factor structure (e.g. as

premised by Carroll, 1993). He argues that “people who are better at any given

cognitive skill are more likely to end up in environments that cause them to develop a

wide range of skills” (Nisbett et al., 2012, p. 21). Consequently, a person who possesses


a great amount of general intelligence will more likely develop and possess a substantial

amount of specific mental abilities and skills.

The current study focuses on intelligence in a personnel selection context (cfr.

infra). In this respect, Schneider, Joel, and Newman (2015) plead for a more

differentiated approach towards the concept of intelligence in the field of human

resources. They mention the following reasons for their plea: (a) specific abilities

predict a statistically significant larger amount of variance than does the unitary g-

model, (b) according to the ‘compatibility principle’ specific abilities predict specific

types of job performance, enabling organizations to take into account these

differentiations in hiring decisions, and (c) the disadvantage of minority groups in

intelligence testing (see i.a. Nisbett et al., 2012) could be compensated by the

implementation of specific cognitive abilities in the tests. Also, Goldstein, Zedeck, and

Goldstein (2002) have shown that in order to predict job performance on the basis of

intelligence it is important to include more narrowly defined, lower-order cognitive

abilities, because a single predictor like g falls short.

One could furthermore plead that the presumably statistical superior g-VPR

model should be used (Johnson & Bouchard, 2005a). However, Hunt (2011) rightly

notes in his overview of human intelligence that statistical superiority should not be the

only criterion taken into account when selecting a suiting psychometric model; also the

practical implications of the model used should be considered. For him, and I quote, “It

makes sense [in personnel classification] to distinguish between candidates who do not

know but can learn (low Gc, high Gf) and those who already have the requisite

knowledge (high Gc).” (Hunt, 2011, p. 110). This is a plea in favour of a more CHC-

oriented model and hence the Modified Model of Primary Mental Abilities as an

alternative.

Measurements of intelligence

Major long-form intelligence tests. Two of the most widely known – if not the

most popular – maximum performance, individually administered intelligence tests are

the Wechsler Intelligence Scale for Children (WISC) and the Wechsler Adult

Intelligence Scale (WAIS) (Wechsler, 1949; Wechsler, 1955). They are both in their

fourth edition and are mainly used in educational and clinical settings. According to

Hunt (2011), Wechsler initially developed his intelligence tests from a more pragmatic


point of view rather than basing them on specific psychometric theories. As reported by

Williams, Weiss and Rolfhus (2003) the WISC-IV (English version) has strong internal

psychometric qualities with scale-reliability estimates ranging from .88 to .94 and

convergent scale-validity coefficients with the WISC-III ranging from .72 to .89. The

WAIS-IV-NL (Dutch translation) has strong psychometric qualities as well: split-half

reliability estimates of the subtests ranged from .75 to .97, test-retest correlation with a

mean retest interval of 4 months and 12 days of the full scale ranged from .94 to .95 and

alpha coefficients estimates for the full scale across all age groups equalled .97

(Pearson, 2012; Wechsler, 2008). Furthermore, regarding the construct validity of the

test, the total IQ scores of the WAIS-IV-NL strongly correlated (r = .90) with its

predecessor’s scores (WAIS-III-NL; Pearson, 2012). On average, the WISC takes 75

minutes to administer, the WAIS takes about 90 minutes (Wechsler, 2008).

On the other hand, Kaufman (1990) based his Kaufman Adolescent and Adult

Intelligence Test (KAIT), another maximum performance individual intelligence test,

on the Gf-Gc distinction (Horn & Cattell, 1966; Kaufman & Kaufman, 1993). His test

of intellectual functioning is also widely known and used (Hunt, 2011). The fluid,

crystallized and total IQ scales of the Dutch version (KAIT-IV-NL) have average alpha

reliability estimates ranging from .92 to .95, while the test-retest reliabilities (with a

retest interval of 3 months and 10 days) of these scales ranged from .80 to .89 (Mulder,

Dekker, & Dekker, 2004). As for the construct validity, total IQ scores of the KAIT-IV-

NL strongly correlated (r = .80) with the WAIS-IV-NL scores (Pearson, 2012).

According to the original manual, the core battery of the KAIT takes 58 to 73 minutes

to administer; the extended battery takes 83 to 102 minutes (Kaufman & Kaufman,

1993).

Raven Progressive Matrices. The Raven Progressive Matrix test (RPM; Raven,

1938; Raven & Raven, 2008) is a single-format, time-limited, inductive reasoning test,

frequently and widely used in i.a. European personnel selection contexts (Raven &

Raven, 2008). The RPM takes up to 45 minutes to complete and is an optimal measurer

of general intelligence (Jensen, 1998). Because the items of the RPM only consist of

figural patterns and thus no verbal items are included, the matrices heavily reduce

cultural and ethnic differences in IQ scores (Nisbett et al., 2012). This is a possible

explanation why these tests are popular in (European) selection procedures: minimizing


verbality in intelligence tests effectively lowers subgroup differences in test scores and

racioethnic disadvantages individuals experience in personnel selection (Ployhart &

Holtz, 2008). Practitioners in the field of I-O psychology generally agree that diversity

in organizations is very valuable and, as a result, intelligence tests that minimize the

subgroup differences are desirable (Murphy, Cronin, & Tam, 2003). In sum, the

relatively short administration time of 45 minutes – especially in comparison with its

widely used counterparts such as the KAIT – the minimized verbal component and its

strong relatedness to g are great advantages (see Raven & Raven, 2008).

Short-form measurements of intelligence. Since the existence of long-form

intelligence tests, people have questioned whether it is necessary to include all subtests

and test items in order to accurately assess intelligence (Doll, 1917). Early on, test

developers already introduced elements of time in their intelligence measures, linking

the speed someone correctly answers items to their intellectual capacities (e.g.

Thorndike, Bregman, Cobb, & Woodyard, 1926). To date, researchers still asks

themselves this question and pursue the development and evaluation of shorter

measurements of intelligence (Ziegler et al., 2014; also see i.a. Bilker et al., 2012;

Donnell et al., 2007). The main purpose of these short forms is mainly saving time and

cutting costs, without losing psychometric validity (Smith et al., 2000).

However, several researchers have previously contested and discouraged the

development and use of short forms. Wechsler (1967) posed that “reduction in the

number of subtests as a time-saving device is unjustifiable” (p. 37) and that one has to

find the time when desiring to pursue time-saving methods. Levy (1968) stated that

“many of the essential components of the real problem [when producing short forms]

have been forgotten” (p. 415). According to Levy (1968) the real problem comprised

the misbalance between time-efficiency and loss in validity, claiming that the loss in

validity is consistently too high to advocate the time gain.

From a nuanced perspective, Schipolowski et al. (2014) argue in their article on

the construction of short forms of intelligence measures that reliable and efficient short-

form assessment is certainly possible. Moreover, according to Ziegler and colleagues

(2014), it is a misunderstanding that short forms are obsolete or that they could not be

used for decision on both the group and individual level. Loss in validity can be

justified, as there are several degrees of shortening, and this loss shouldn’t be too high


per definition. One can a priori assume that extreme shortening (e.g. reducing a 150-

item test to only six items) will account for a large loss in validity and reliability,

whereas reasonable shortening (e.g. reducing it to 45 items) will account for a smaller

loss, given that the content domain of the items and the factor structure are preserved

(see i.a. Nunnally & Bernstein, 1994). Smith and colleagues (2000) furthermore assure

that well-constructed short-form intelligence tests – either just for screening purposes or

as substitutions for full-scale test batteries – have sufficient potential. According to

them, there is still a lot of room for improvement in terms of the methodology of short-

form development and validation, which is promising.

Clinical short-form measurements. Research on the psychometric qualities and

utility of existing short-form intelligence tests in a clinical context is already hopeful.

Jeyakumar, Warriner, Raval, and Ahmad (2004), for example, provided converted

reliability and validity data based on the original publications by Wechsler regarding six

short forms of the Wechsler scales including the Wechsler Abbreviated Scale of

Intelligence (WASI) (Psychological Corporation, 1999). They’ve found estimates of

internal consistency ranging from .88 to .98 and part-whole correlations ranging from

.72 to .96. Furthermore, Donnell et al. (2007) evaluated and found two reliable short

forms of the WAIS-III specifically constructed for the evaluation of elderly patients

suffering from neurocognitive impairment associated with dementia and patients with

motoric disabilities. Other research on abbreviated forms of the Wechsler scales also

provides evidence that these shortened forms can be implemented and used as reliable

intelligence tests in clinical settings where time is either very valuable or scarce (see i.a.

Adams, Smigielski, & Jenkins, 1984; Axelrod et al., 2001; Donders, 1997; Ward &

Ryan, 1996; Wymer, 2003).

Furthermore, comparisons of several short-form intelligence tests provide us

with a promising construct validity perspective as well. The study by Prewett (1995)

revealed correlations between the Matrix Analogies Test Short Form (MAT) (Naglieri,

1985) and the WISC-III, and between the Kaufman Brief Intelligence Test (K-BIT)

(Kaufman & Kaufman, 1990) and the WISC-III of .67 and .78, respectively. In addition,

Canivez (1995; 2005) found evidence for high concurrent validity of the K-BIT when

compared to the WISC-III in a student sample. In the study by Hays, Reas, and Shaw

(2002) in a sample of 85 psychiatric patients, the K-BIT also significantly and


substantially correlated with the WASI (r = .89). Finally, The Wide Range Intelligence

– another multi-dimensional, short-form intelligence test (Glutting, Adams, Sheslow,

2000) – was also significantly correlated with the WASI (r = .71) in the study by Bialik

(2008) within a sample of college students.

Baddeley’s and Wonderlic’s short form. Two of the more known short-form

measures of intelligence are Baddeley’s (1968) Three-Minute Reasoning Test and the

Wonderlic Personnel Test (WPT; Wonderlic, 1973; Wonderlic & Wonderlic, 1992).

The extremely short Three-Minute Reasoning Test (T-MRT) consists of a series of

simple questions that vary in verbal complexity. The allocated IQ score depends on the

number of correctly answered items within the three-minute time limit (Baddeley,

1968). The scores on the test significantly correlate with those on the British Army

verbal intelligence test (r = .59) and is therefore one of the first and fastest short-form

intelligence tests able to assess one’s verbal comprehension (Baddeley, 1968). The T-

MRT is a great example of a ‘speeded test’, meaning that it assesses one’s capacities to

answer relatively simple questions in a limited amount of time (Hunt, 2011).

In addition, the Wonderlic Personnel Test is a popular, highly compressed

intelligence test, usually administered in group, and consists of 50 multiple choice

questions with a time limit of twelve minutes. Participants are shown a balanced mix of

problem-solving, verbal and arithmetic items. The test is mainly used in an occupational

context, specifically personnel selection (Hunt, 2011), and highly correlates with the

WAIS, as evaluated in research conducted by Dodrill (1981) and Dodrill and Warner

(1988) in a variety of psychiatric and non-psychiatric subject samples. In other research,

Dodrill (1983) also found a five-year test-retest reliability estimate of .94. Split-half

reliabilities ranged between .88 and .94 (Wonderlic, 1973).

Short form of the Wilde Intelligenztest. Overall, the psychometric qualities –

reliability and validity in particular – of the abbreviated intelligence tests in the previous

section on short-form measurements of intelligence are promising. However, most of

these tests were either specifically constructed for use in clinical settings or examined in

clinical samples. Other tests, such as Baddeley’s T-MRT and the Wonderlic Personnel

Test are well suited for (employee) screening and selection (Hunt, 2011). Yet, a

moderate to high proportion of the WPT and T-MRT items require adequate verbal

comprehension and linguistic reasoning. This verbal component potentially leads to test


score differences in racioethnic subgroups (Nisbett et al., 2012; Sackett, Schmitt,

Ellingson, & Kabin, 2001), lowering the chances of individuals from these subgroups to

be selected and thus decreasing the diversity and selection rates for racial minorities

(Hough, Oswald, & Ployhart, 2001; Murphy, 2010; Ployhart & Holtz, 2008).

The intelligence test used in the current research, a short form of the Wilde

Intelligenztest (WIT-S), is a non-verbal intelligence test based on the ‘Modified Model

of Primary Mental Abilities’ (Kersting, Althoff, & Jäger, 2008). The goal of the test is

to measure applicants’ intelligence in a highly time-efficient manner and subsequently

to be able to differentiate between applicants on the basis of their capability to perform

well in their future job (cfr. infra; Schmidt & Hunter, 1996). Psychometric qualities of

its parent test, the WIT-1 (Jäger & Althoff, 1983) suffice. On the one hand, split-half

reliability estimates of the composite scale ranged from .97 to .98. On the other hand,

split-half coefficients of the individual scales ranged from .88 to .99, whereas alpha

coefficients ranged from .82 to .97. One-year retest reliability coefficients ranged from

.59 to .68, except for one of the retention subscales (r = .15). Construct and criterion

validity were investigated in numerous ways and in various settings (military/police,

civil services, management, health care, etc.), including selection contexts, and rendered

adequate results. See Jäger and Althoff (1983) for a detailed overview.

The WIT-S comprises 45 items from four second-stratum abilities, namely

‘Reasoning’ (general deductive reasoning), ‘Spatial Reasoning’, ‘Numeric Reasoning’

and ‘Perceptual Speed’. All four mental abilities are theoretically related to fluid

intelligence (Gf), which makes it possible to identify applicants who perhaps do not

possess the required knowledge (Gc) but have enough learning capabilities, and to

predict their ability to solve complex (work-related) problems (Agnello, Ryan, &

Yusko, 2015). This is an advantage over the Wonderlic Personnel Test, as recent

research by Hicks, Harrison and Engle (2015) has shown that the WPT does not directly

relate to fluid intelligence, while the overall relation to general intelligence remains

somewhat unclear as well.

Furthermore, the linguistic component of the WIT-S has been minimized,

similar to the Raven Progressive Matrices, and no items requiring verbal comprehension

or word fluency were integrated in the WIT-S. This is likely to be a diversity asset

compared to the Wonderlic Personnel Test, Baddeley’s Three-Minute Reasoning Test,


the Wide Range Intelligence Test, the Kaufman Brief Intelligence Test, and the

Wechsler Abreviated Scales of Intelligence, as noted by Hough and colleagues (2001).

The WIT-S consequently has the capabilities to be embedded in an increasingly

multicultural and international business environment (Reeve, Scherbaum, & Goldstein,

2015; Scherbaum & Goldstein, 2015) because the test items do not heavily rely on any

particular language. Moreover, the test can be completed with a minimum knowledge of

English – beside Dutch currently the only test language – which is more and more

becoming widely known and used as common language on a global scale, a.k.a. ‘lingua

franca’ (Jenkins & Leung, 2013).

Moreover, the WIT-S is a twelve-minute, hence time-restricted intelligence test.

Consequently, the test is considered to be a ‘speeded test’. In order to be formally

classified as a speeded test, at least 20% of the test takers should leave one or more

questions unanswered and at least one test takers should leave 25% of the questions (or

more) blank (Van der Linden, 2011a). Speeded tests measure the speed whereby a

participant is able to correctly answer relatively simple questions, as opposed to ‘power

tests’, which measure the precision with which a participant is able to correctly answer

relatively complex questions (Hunt, 2011; Van der Linden, 2011b). Recent research on

the latent constructs ‘speed’ and ‘power’ suggests that a clear distinction between the

two constructs can be made in psychometric tests on the basis of time and accuracy data

(Partchev, De Boeck, Steyer, 2011). Besides, the correlation between mental speed and

general intelligence is moderate but stable, and mental speed is also substantially related

to Gf (Sheppard & Vernon, 2008).

Intelligence testing in personnel selection

The importance of intelligence. Hunter and Hunter (1984) have shown that

general intelligence, or general mental ability (GMA) as they call it, is one of the best

single predictors of future job performance, but also future learning and training

abilities. Schmidt and Hunter (1998) later elaborated on the validity of various

assessment methods in personnel selection and demonstrated that the mean predictive

validity of GMA for job performance is .51 and the mean predictive validity of GMA

for the performance in job training programs is .56. A valid theoretical explanation for

the fact that GMA predicts a large amount of variance in future job performance is

underpinned by the well-documented research on the mediating role of job knowledge


in the relationship between GMA and job performance (Borman et al. 1993; Hunter &

Schmidt, 1996; Schmidt, 2002; Schmidt & Hunter, 1992; Schmidt & Hunter, 2004;

Schmidt, Hunter, & Outerbridge, 1986). Higher levels of GMA lead to higher levels of

job knowledge, which, in their turn, lead to higher levels of job performance.

Employees who are more intelligent are able to gain more job knowledge and they tend

to gain it more rapidly.

Furthermore, Schmidt and Hunter (1998) concluded that “the validity of the

personnel measure (or combination of measures) used in hiring is directly proportional

to the practical value of the method – whether measured in dollar value of increased

output or percentage of increase in output” (p. 273). Using the most appropriate and

valid assessment methods – intelligence tests in particular – thus has a lot of economic

value in today’s world of increased business competition and knowledge work

(Barkema, Baum, & Mannix, 2002; Murphy, 1999). Besides, the complexity and

ambiguity of jobs has been increasing rapidly, making intelligence an excellent choice

when selecting performance predictors (DeNisi, Hitt, Jackson, 2003; Gottfredson, 2002;

Lang, Kersting, Hülsheger, & Lang, 2010; Scherbaum et al., 2012).

Intuition in recruitment and selection. During the five years of our academic

education, before officially becoming industrial and organizational psychologists, our

professors frequently stress the fact that our curriculum is structured in such a way that

we will be able to operate as true scientists-practitioners in the field of human resources.

As Jex and Britt (2008) so inclusively put it: “Organizational psychologists use

scientific methods to study behavior in organizations. They also use this knowledge to

solve practical problems in organizations; this is the essence of the scientist-practitioner

model […]” (p. 19). Thus, in order to attain the full potential of organizations and their

employees, rigorous scientific research and practice should intertwine harmoniously.

Unfortunately, this is not always the case.

As previously mentioned, Schmidt and Hunter (1998) have shown that

intelligence tests substantially predict job performance, even better than do unstructured

interviews. Yet, human resource managers generally perceive unstructured interviews as

more effective than intelligence tests (Terpstra, 1996), although they know that there are

many downsides to these type of interviews (Rynes, Colbert, & Brown, 2002).

Moreover, unstructured interviews are far more popular and intensively used in


selection contexts than any other assessment method, including intelligence tests

(Buckley, Norris, & Wiese, 2000). Agnello et al. (2015) claim that intelligence is “one

of the most critical individual difference variables in human resources” (p. 47), but

practitioners still heavily rely on intuition. They are, in many cases, more reluctant to

using reliable and validated selection decision aids, including intelligence tests, than

they are to using their own gut feeling and so-called ‘experience in predicting human

behavior’ (Highhouse, 2008).

Interviews conducted by Miles and Sadler-Smith (2014), held to understand the

rationale of managers using their intuition as an indicator for performance, personality

and the person-environment fit in personnel selection, revealed a few interesting

insights. First of all, there were some complications with face validity. More objectively

measured assessment methods are generally well received by applicants, yet, not all of

them are perceived as appropriate to assess the specific selection criteria, eventually

increasing the use of intuition. In addition, intelligence tests are not generally seen as

comprehensive predictors of employee behavior (Murphy et al., 2003) and are often

even hackled for being decontextualized and restricted (Lievens & Reeve, 2012).

Secondly, the high costs of setting up assessment centers, including a great amount of

formal testing, was considered less attractive by many practitioners than using a more

intuitive approach that didn’t require financial capital. Thirdly, intuition was used to

complement the more rational and objective approach to recruitment and was integrated

as an additional selection decision aid, voiding valid test outcomes in some cases.

More importantly, and highly relevant to the current study, is the finding that

human resources practitioners wish to hire the most suited person for the job in a highly

time-efficient manner (Klimoski & Jones, 2008). In this respect, intuitive approaches

including a casual screening of an applicant’s résumé or a quick, unstructured interview

will take up less time than a full-scale intelligence test like the KAIT, but eventually fall

short. In order to perform personnel selection in an evidence-based manner, but also

taking into account the need for a less time-consuming hiring process, (re)new(ed),

time-saving intelligence assessment methods should be explored. This is in line with the

recent plea of Scherbaum and colleagues (2012), arguing that we should invest more

resources in the research on the measurement of intelligence, because in recent years,

the contribution of I-O psychologists to this research domain has steadily decreased. In


addition, Agnello and his colleagues (2015) stated that a decent amount of advancement

in psychometrics and improved measurement techniques have been discussed in modern

years, yet “much more research on many of these developments is needed” (p. 53).

Hypotheses

As for the potential utility of the WIT-S, quite a lot has already been mentioned

and discussed in the introduction. Firstly, the minimized verbal component potentially

potentially leads to decreased adverse impact in selection (Hough et al., 2001; Murphy,

2010; Ployhart & Holtz, 2008), being a strong asset in an internationalizing, complex,

technology driven and multicultural business environment (Reeve et al., 2015;

Scherbaum & Goldstein, 2015). Secondly, the four factors of the WIT-S focus on fluid

intelligence, which should make it possible to differentiate between applicants who can

learn but do not yet possess the necessary job knowledge and applicants who have the

requisite knowledge but only possess little ability to learn from training (Agnello et al.

2015; Hunt, 2011). Thirdly, ‘speeded’ intelligence tests have an economic and time-

saving advantage (Hunt, 2011), which is especially valuable in the current context of

increased knowledge work and augmented competition between organizations

(Barkema et al., 2002; Murphy, 1999; Scherbaum et al., 2012).

The empirical section of this study focusses on testing the general psychometric

qualities of the WIT-S. It is ought to provide us with an answer to the question whether

the short form of the Wilde Intelligenztest is a well-constructed, valid and reliable

alternative to the time-consuming full-scale intelligence tests. Thoroughly researching

these psychometric qualities is essential if we eventually want to use the WIT-S in a

personnel selection context (Schipolowski et al., 2014; Smith et al., 2000).

Test score differentiation. First of all, I want to explore whether there are any

subgroup differences in test scores for sex, age and educational level, which is

elaborated in the next three paragraphs. I also want to provide sufficient data and

information on the test scores of the WIT-S and the research sample in general.

Eventually, when I am able to conclude that the WIT-S is as a reliable and valid short-

form alternative, norm groups will be provided, enabling both academics and

practitioners to interpret the unstandardized test scores (Evers, Lucassen, Meijer, &

Sijtsma, 2009) and counter the potential adverse impact inherent to these group

differences (Sackett & Wilk, 1994).


According to current research on sex differences in intelligence, no evidence

exists that supports mean differences in general intelligence (Jensen, 1998; Johnson &

Bouchard, 2007). By contrast, in the extensive research conducted by Johnson and

Bouchard (2007), differences between males and females on various test of specific

mental abilities did show up, even though g masked up many of these dissimilarities in

previous research. Whilst males perform better on tests focussing on visuospatial

abilities including mental rotation (more Gf-oriented), females perform better on tests

measuring verbal and memory abilities (more Gc-oriented; Johnson & Bouchard, 2007;

Nisbett et al., 2012). We can thus expect males to attain higher scores on the WIT-S,

which presumably focusses entirely on Gf and consists of nearly 18% spatial items.

We can furthermore expect age differences in total test scores on the WIT-S. As

previously discussed, evidence in favour of a Gc-Gf distinction exists. A great amount

of this evidence is based on the difference in performance on fluid-based versus

crystallized-based intelligence tests dependent of age. Younger participants should be

able to perform better on the WIT-S as Gf-related intelligence reaches its summit in

early adulthood but then steadily declines in adulthood and onward (cfr. supra; Horn &

Cattell, 1967; Salthouse, 1996; Nisbett et al., 2012).

Beside differences in sex and age, dissimilarities in educational level are also

taken into consideration. The correlation between performance on tests measuring any

form of cognitive ability and total years of education received is generally high whereby

differences in intelligence account for approximately 30% of the outcome variance

(Neisser et al., 1996). Reasons for this are i.a. the education rewarding process –

encouraging students to continuously perform better – social status and income (Neisser

et al., 1996). We can eventually expect participants who have received (higher)

education to a greater extent to perform better on the WIT-S than participants who did

to a lesser extent or not at all.

Hypothesis 1: Mean WIT-S test scores between demographic subgroups will

differ significantly, whereby (a) younger test takers will, on average, perform

better than older test takers, (b) male test takers will, on average, perform better

than their female counterparts and (c) higher educated test takers will, on

average, perform better than lower or not educated test takers.


Items and their latent factors. Secondly, an assessment of the item facility and

the potency of each item to discriminate between respondents will be executed. In

knowledge-based tests, item analysis classically focuses on the two above-mentioned

statistics (Rust & Golombok, 2014). On the one hand, it is important to show that each

item of the WIT-S possesses an adequate difficulty level and contributes to the test as a

whole (Smith et al., 2000). The more extreme the value of an item facility index

becomes, the more redundant the item is because the participants’ total scores would

have been practically the same as when the item was never included in the test. On the

other hand, items should be able to discriminate between participants (Emons, Sijtsma,

Meijer, 2007; Rust & Golombok, 2014). Items with relatively high, positive item-total

correlations discriminate well between high and low performing participants. Items with

low item-total correlations do not, meaning that the items make little to no contribution

as they are practically uncorrelated with the total test score as well as with most other

test items.

Hypothesis 2: Each individual item of the WIT-S (a) possesses an adequate

difficulty level and (b) discriminates between high and low performers,

conformingly reflecting their total test score.

Thirdly, the factor structure of the WIT-S will be examined. Smith et al. (2000)

urge researchers to assess item-factor scale correlations and the factor-level validity of

the content because they are necessary to demonstrate that the short form reproduces the

hypothesized factor structure. I should be able to empirically show that a higher-order

general factor (e.g. Gf) and four subfactors, namely Reasoning, Spatial Reasoning,

Numeric Reasoning and Perceptual Speed are to be found in the short form by the

means of factor analytical research. The analysis will focus on both exploratory as well

as confirmatory factor analysis. The latter should provide us with the most useful

information on the fit of the data with the theoretical model (Smith et al., 2000).

Hypothesis 3: The WIT-S will reproduce the premised factor structure

comprising Gf and four lower-order mental abilities: Reasoning, Spatial

Reasoning, Numeric Reasoning and Perceptual Speed.


Reliability and construct validity. Fourthly, I want to explore “the degree to

which the items in the test are associated due to what they share in common” (Sijtsma,

2009, p. 179) and consequently whether the WIT-S is a reliable measure. In order to

make decisions in personnel selections, the subscales preferably possess only little

measurement error (Smith et al., 2000). Analogous to factor analysis, I should be able to

show that each item is strongly related to its higher-order (sub)factor(s). Both the

composite-scale reliability and individual-scale reliability coefficients should at least

reach moderately high values. In the current study, moderately high values are regarded

as reliability estimates equalling or exceeding .70, a value generally considered to be

acceptable (Nunnally & Bernstein, 1994). Given the original test characteristics (Jäger

& Althoff, 1983), the WIT-S should be able to attain such values. It is important,

however, to keep in mind that for reliability coefficients to be sufficient, the context of

the test and its future use should be taken into consideration (Cortina, 1993).

Hypothesis 4: The composite-scale and individual-scale reliability estimates of

the WIT-S will equal or exceed .70.

Fifthly, I want to examine the convergent validity and consequently to what

extent the short-form measure of intelligence relates to an already sufficiently validated,

extensive measure of intelligence. Theretofore the rationale is twofold (Smith et al.,

2000): (a) we need to be able to justify the savings in time in relation to potential loss in

validity and (b) careful, thorough examination of the short form’s validity is needed as

these have frequently been developed without prior, profound research. The latter is

illustrated by the finding that, previously, a great amount of investigators assumed that

all of the reliability and validity evidence of the original, full-length measure

automatically applied to the abbreviated form and that because the new measure is

shorter, less validity evidence was required (Smith et al., 2000). The current research

should show that the short form of the Wilde Intelligenztest substantially correlates with

a full-scale intelligence test, in this case the KAIT-IV-NL (Mulder et al., 2004).

In addition, I want to investigate the discriminant validity and consequently to

what extent the short-form differs from a test measuring another construct, such as


emotional intelligence. The latter is defined as the ability to perceive, assimilate,

understand and manage emotions (Mayer, Caruso, & Salovey, 1999). Meta-analysis

(Van Rooy, Viswesvaran, & Pluta, 2005) has shown that, typically, moderate

correlations (𝜌 = .34) exist between ability measures of emotional intelligence and

general cognitive ability tests. Moreover, these are typically higher for Gc-oriented

(verbal, perceptual) than Gf-oriented tests (Mayer, Roberts, & Barsade, 2008).

In the current study, I will investigate the correlation between the WIT-S and a

translated version of the Geneva Emotion Recognition Test (GERT), an emotion

recognition ability test (Schlegel, Grandjean, & Scherer, 2014). The psychometric

qualities of this test are promising, with alpha reliability estimates of the full scale

reaching up to .76. Furthermore, the GERT significantly correlates with the Mayer-

Salovey-Caruso Emotional Intelligence Test (MSCEIT; r = .47; Mayer, Salovey, &

Caruso, 2002), another performance-based test, and the Situational Test of Emotion

Understanding (STEU; r = .50; MacCann & Roberts, 2008), a situational judgement

trait measure of emotional intelligence (Schlegel et al., 2014; Université De Genève,

2014).

Hypothesis 5: The WIT-S will (a) significantly and substantially correlate with a

full-scale intelligence test, the KAIT and (b) weakly, yet significantly correlate

with an emotion recognition ability test, the GERT. Accordingly, the correlation

between the WIT-S and KAIT will surpass the correlation between the WIT-S

and GERT.

Exploring these two forms of validity will help to map the associations between

the latent construct measured by the WIT-S and two related constructs within its

nomological network. Hence, this part of the research will provide us with information

on the test’s construct validity (Cronbach & Meehl, 1955). Beside these correlations, the

item analysis and factor analytical research should produce additional evidence and

information about the interpretability of the construct validity estimates (Cronbach &

Meehl, 1955; Shepard, 1993).


METHOD

Sample

Participants were conveniently recruited in the spring of 2015 via undergraduate

psychology students. These students had to administer a test battery that consisted of

cognitive (e.g. WISC-III-NL and KAIT-IV-NL) and non-cognitive tests (e.g. Geneva

Emotion Recognition Test) as part of their curriculum at Ghent University (Universiteit

Gent, 2014). Each student had to look for two acquaintances from their own region and

ask them to participate in a psychological assessment. They were allowed to search for

participants within relatively broadly specified age categories (e.g. 18-25 years). Before

being allowed to start the autonomous work, the students had to demonstrate that they

master the principles of applying the test battery by means of a competence test.

In total, 35 of the original 353 participants were excluded from the research

because of incomplete data, leaving us with 318 valid cases. Either data from one or

multiple tests was completely missing (N = 26) or the participants’ demographic

information was incomplete and could not be retrieved from or traced back to the

original data file (N = 9). Participants ranged in age from 18-59 years (M = 36.28, SD =

12.53) and were comprised of 147 males and 171 females. For research purposes, the

sample was divided into two quasi-equal age categories in terms of sample size: the first

subgroup comprised the participants aged 18 to 34 (N = 157), the second subgroup

comprised the participants aged 35 to 59 (N = 161).

200 participants received lower education (N = 199) or none (N = 1); thereof 95

were male (47.50%) and 105 were female (52.50%). 118 received higher education;

thereof 52 were male (44.07%) and 66 were female (55.93%). Lower education

comprised primary education, secondary education and specialization courses after

finishing secondary education; higher education comprised post-secondary education.

All participants were proficient in Dutch and used it as a spoken language in their

everyday life (N = 312) or were native speakers (N = 308); 314 subjects had the Belgian

nationality and 4 had the Dutch nationality.

Measures

Short form of the Wilde Intelligenztest. The Short Form of the Wilde

Intelligenztest (WIT-S) was administered during one session through the web-based

research platform Qualtrics (www.Qualtrics.com). The WIT-S was administered among


other tests including the Situational Test of Emotional Understanding and the

Situational Test of Emotional Management (MacCann & Roberts, 2008), and the

Geneva Emotion Recognition Test (Schlegel et al., 2014). The WIT-S comprised 45

items originating from the original test (Der Wilde Intelligenztest; Jäger & Althoff,

1983). Each item was an operationalization of one or two of the following latent

constructs: Reasoning, Spatial Reasoning, Numeric Reasoning and Perceptual Speed.

The test was composed in such a way that the questions were of increasing difficulty;

questions towards the end of the test were supposedly more difficult than at the

beginning. Participants had twelve minutes to solve the speeded test; four countdown

timers evenly distributed throughout the test displayed how much time they had left.

Only one of the participants was able to answer all 45 questions in the twelve-minute

timeframe. The lowest number of questions answered was 9; the average number of

questions answered was 26.29 (SD = 6.14).

The way the WIT-S items were presented varied: 30 items were shown as

multiple choice questions with five choices each, the other fifteen items – numeric

sequences and algebraic exercises – were displayed as single-line text entry questions.

Participants could choose to leave an item blank and scroll down to the next item. An

example question of the Spatial Reasoning subscale was ‘Which figure needs to be

flipped (or mirrored) first before it can overlap with the other four figures?’ and of the

Numeric Reasoning subscale ‘Try to discover the underlying simple calculation rule in

this calculation task and estimate the right answer’. An example item of the Reasoning

subscale was ‘Which letters are next in this letter sequence?’ and of the Perceptual

Speed subscale ‘Which face is different compared to the other two faces?’. For further

data analysis, the answers were recoded where 0 = wrong and 1 = correct.

Geneva Emotion Recognition Test. The Geneva Emotion Recognition Test

(GERT), a recently constructed performance-based test featuring fourteen different

emotions, measured individual differences in emotion recognition. The ability measured

is considered to be a central component of emotional intelligence (Schlegel et al., 2014).

The Dutch translation of the test was administered during the same online session as the

WIT-S. The GERT consisted of dynamic, multimodal video fragments displaying a

various amount of affective states and expressions and comprised 70 short video

snippets in total. An example item was ‘How likely is it that the actor tried to express


sadness?’. After each video fragment, participants had to rate the related item on a 7-

point Likert scale ranging from ‘Not likely at all’ to ‘Very likely’. The GERT took about

15 to 20 minutes to administer; no time-limitations were imposed. All of the 318

participants completed the test. Reversed-scored items were recoded. The lower bound

alpha of the composite scale equalled .79, which is considered to be sufficient for

research purposes (Nunnally & Bernstein, 1994).

Kaufman Adolescent and Adult Intelligence Test. The full-scale Dutch

version of the Kaufman Adolescent and Adult Intelligence Test (KAIT-IV-NL) is a

widely used and thoroughly validated paper-and-pencil measure (Hunt, 2011; Mulder et

al., 2004; Pearson, 2012) and was administered by bachelor students to measure the

general intelligence of the participants. The test (extended battery) consisted of four

subtests operationalizing crystallized intelligence (Auditory Comprehension, Double

Meanings, Definitions and Famous Faces; α = .89), four subtests operationalizing fluid

intelligence (Rebus Learning, Mystery Codes, Logical Steps and Memory for Block

Designs; α = .55) and two measures for memory recall (Rebus Recall and Auditory

Recall). The alpha reliability estimate of the composite scale, comprising 10 subscales,

equalled .84 and was thus acceptable (Nunnally & Bernstein, 1994).

The tasks and exercises of the KAIT varied greatly. They ranged from listening

to and answering questions about recordings of news stories (Auditory Comprehension)

to integrating multiple types of word clues (Double Meanings), as well as using visually

and orally presented logical premises to solve problem statements (Logical Steps). It

took the participants between 70 and 240 minutes to complete the test (N = 313, M =

125.90, SD = 27.20). Answers, observations and raw scores of the KAIT were written

down by the undergraduate students during administration. The data was eventually

entered into multiple Excel worksheets, sent to academic assistants at the University of

Ghent and recoded for further statistical analysis.

Data analysis

IBM SPSS Statistics version 23 was used for descriptive and correlational

statistics, comparisons of means, item analysis, reliability analysis and exploratory

factor analysis. Unless specified otherwise, all correlations reported were based on

Pearson’s product-moment formula; tests of significance were two-tailed. Qualitative

labels for correlations and effect sizes (e.g. small, moderate, large) conformed with


Cohen’s (1988) guidelines. When group means were compared using the t test statistic

Levene’s Test for Equality of Variances was consulted to assess the assumption of

homoscedasticity. Depending on the significance level of Levene’s Test, the appropriate

t test statistic was reported (Brown & Forsythe, 1974). The ‘lavaan’ statistical package

for latent variable analysis (Rosseel, 2012) was used to perform confirmatory factor

analysis in R version 3.2.3 (R Core Team, 2015).

Item analysis. First of all, item facility indices (IFI) were calculated by dividing

the number of participants who had correctly answered the question by the total number

of participants. Values ranged between 0 and 1. The easier the question, the higher the

value of the item facility index, and vice versa. Rust and Golombok (2014) proposed the

following rules of thumb: (a) an item is too difficult and/or too few participants give a

correct answer if the IFI is less than .25, (b) an item is too easy and/or too many

participants give a correct answer if the IFI is greater than .75.

Secondly, item discrimination indices (IDI) were obtained by calculating the

correlation between an item and the total score for the test. Rust and Golombok (2014)

argue that a minimum correlation of .20 is required as items with low item-total

correlations (between –.20 and .20) do not sufficiently discriminate. Relatively high,

negative item discrimination indices indicate that the item presumably measures a

construct different from the construct that is intended to be measured. The item

discrimination index was computed by using the point-biserial correlation formula

(Tate, 1954). Additionally, item-rest correlations were computed by correlating each

item and the participant’s total score for the test, whereby the item itself was subtracted

from the total for the correlation.

Thirdly, item distractor analysis – investigation of the frequencies of the selected

responses per item – was executed for multiple choice items with an item facility index

less than .25 and was reported when conclusive. Additional regard was given to items

that did not meet any of the following criteria: (a) item-total correlation greater than or

equal to .20 and (b) item-rest correlation greater than or equal to .20.

Factor analysis. In order to identify the appropriate rotation strategy as well as

the optimum number of factors to retain in the final exploratory factor analyses (EFA),

an initial EFA was executed using the commonly applied ‘principal axis factoring’

method with varimax rotation (Bartholomew, Steele, Moustaki, & Galbraith, 2008). The


extracted factors were evaluated by inspecting the correlation matrix. Multiple

correlations between the factors were substantial (absolute value ≥ .32). Consequently,

there was sufficient overlap in variance between several factors (more than 10%) to

justify the use of an ‘oblique rotation’ method (Tabachnick & Fidell, 2007). Also, from

a theoretical viewpoint, correlations between the factors could be assumed (Jäger &

Althoff, 1983; Kersting et al., 2008).

Various methods were used to identify the number of factors to retain in the final

EFA (see Gorsuch, 1983): (a) investigation of the eigenvalues, looking for values

greater than 1 (Kaiser criterion; Kaiser, 1960), (b) visual inspection of the scree plot

(Cattell, 1966) and (c) parallel analysis (Horn, 1965b). The latter supposedly presents

the least variability and sensitivity to factor differentiation (Zwick & Velicer, 1986).

The analysis indicated meaningful factors when eigenvalues of the WIT-S data were

greater than those provided by random data comprising the same sample size and

number of factors (Lautenschlager, 1989). Random data and derived eigenvalues for

parallel analysis were generated through principal component analysis with 100random

correlation matrix replications using Patil, Singh, Mishra, and Todd Donavan’s (2007)

‘Parallel Analysis Engine to Aid Determining Number of Factors to Retain’ (version

9.1.3, seed 50).

Additional factor analyses were carried out to identify latent traits using the

principal axis factoring method. Extracted results reported were rotated by means of the

varimax procedure (uncorrelated factors, orthogonal solution) and direct oblimin

procedure (delta = 0; correlated factors, oblique solution). The Kaiser-Meyer-Olkin

Measure of Sampling Adequacy (KMO) and Bartlett's Test of Sphericity were consulted

to evaluate if the data were likely to pool on the factors and if the correlation matrix

equalled an identity matrix. Values greater than or equal to .70 for the KMO coefficient

and significant test statistics (p ≤ .05) for Bartlett’s Test of Sphericity are acceptable and

preferred (Field, 2009). The rotated results were eventually evaluated for the presence

of ‘simple structure’ (Thurstone, 1947) in order to advocate further interpretations

(Kline, 2002).

Next, multiple confirmatory factor analyses were executed to assess if the

hypothesized factor structure of the WIT-S adequately fitted with the data. One-factor,

two-factor and four-factor models were tested and compared to each other: (a) the one-


factor model represented Gf, (b) the two-factor model represented Reasoning,

comprising Numeric Reasoning and Spatial Reasoning, and Perceptual Speed and (c)

the four-factor model represented Reasoning, Numeric Reasoning and Spatial

Reasoning as individual latent factors. As these models specify binary endogenous

variables, a three-way weighted least squared approach (DWLS method) was used to

estimate the model parameters (Flora & Curran, 2004). The estimates reported were

robust variants of the computed test statistics. Parametrization of the model was

performed through factor scaling using Unit Loading Identification (ULI): one non-

standardized factor loading was fixed to 1 for each factor.

Multiple fit measures were consulted to evaluate model fit. Firstly, the chi

square test provided information on the consistency of the observed covariance with the

model-implied covariance. Insignificant 𝜒#$ test statistics (p ≥ .05) indicate good fit with

the covariance data (Kline, 2005). Secondly, the Comparative Fit Indices provided

information on the relative improvement in fit of the specified model versus the baseline

model. CFI values greater than .95 suggest good fit (Hu & Bentler, 1999). Thirdly, the

Root Mean Square Error of Approximation (RMSEA) was evaluated. RMSEA values

less than or equal to .010, .050, and .080 indicate excellent, good, and mediocre fit,

respectively (MacCallum, Browne, & Sugawara, 1996). Finally, the chi square

difference test – using Satorra and Bentler’s (2001) method – provided information on

the relative improvement in fit of the models when compared to each other. Significant

𝜒%$test statistics suggest differences in model fit (Kline, 2005).

Reliability analysis. In order to estimate the composite-scale and individual-

scale reliability levels of the WIT-S, Cronbach’s (1951) alpha and split-half reliability

coefficients were estimated. The latter was calculated using Spearman (1910) and

Brown’s (1910) formula, splitting up the test and scales in odd- and even-numbered

halves. Beside parallel-forms reliability, split-half reliability is considered to be one of

the only measures not weakened by the artefacts of speeded tests (Kline, 2000). In

addition, a special, corrected case of the split-half reliability coefficient was computed

using Gulliksen’s (1950) least restrictive correction for stepped-up odd-even reliability

estimates of partially speeded power tests:


𝑅( = 𝑅**+ −𝐾𝑀/ −𝑀/

$ − 𝑆/$

(𝐾 − 1)𝑆*$

where 𝑅( is the corrected reliability coefficient, 𝑅**+ is the stepped-up odd-even

reliability, K is a reasonable approximation for the number of items in the speeded part

of test, 𝑀/ is the mean of the U-scores – the number of unanswered items per

participant – 𝑆/$ is the variance of the U-scores and 𝑆*$ is the variance of the total error

scores – the number of incorrectly answered items plus the number of unanswered items

per participant (Gulliksen, 1950). Arbitrarily, the value of K equalled the number of

items completed by at least 75% of the participants (K = 23). Furthermore, the ratio

between 𝑀/, the mean of the U-scores, and 𝑆4$, the variance of the total test scores, was

also reported to assess whether the corrected reliability estimate reported was

satisfactory. High 𝑀//𝑆4$ values (i.e. > .30) indicate less satisfactory estimates

(Gulliksen, 1950).

Construct validity. Lastly, the premised interrelations of the WIT-S with

cognitive ability and emotion recognition measurements were explored by evaluating

the convergent and discriminant validity coefficients. Convergent validity was

measured by correlating the unstandardized WIT-S test scores with the unstandardized

KAIT-IV-NL Total, Fluid and Crystallized IQ test scores. In addition, the correlations

between the subscales of the WIT-S and the subscales of the KAIT’s Fluid IQ subscales

were investigated. Discriminant validity was measured by correlating the

unstandardized WIT-S test scores with the unstandardized GERT composite-scale

scores.


RESULTS

Descriptive statistics

Table 1 provides an overview of the sample and subgroup sizes for age, gender

and educational level, and the mean age per demographic subgroup. The mean age of

the male test takers was 34.00 years (SD = 11.90) and 38.24 years (SD = 12.76) as for

the female test takers. The mean age of the participants in the 18-34 years category was

24.89 (SD = 3.99) and 47.39 (SD = 6.65) in the 35-59 years category. As illustrated by

Figure 2, 109 (34.28%) participants were aged 20 to 26 years, another 75 (23.58%)

participants were aged 46 to 54 years; these age groups are thus highly represented as

they equal 57.86% of the total sample, yet cover only 34.14% of the age distribution.

The mean age of the participants who received lower or none education was 36.04 (SD

= 13.01). On average, participants who received higher education were 36.96 years old

(SD = 11.72). Here, lower education comprised primary education, secondary education

and specialization courses; higher education comprised post-secondary education.

Figure 2. Age distribution of the sample compared to a uniform distribution curve.


Table 1 Frequencies for Age, Gender and Education and Mean Age per Demographic Category

Age

N (%) Mean SD

Sample Age 18-34 years 35-59 years Gender Male Female Education a Lower or none Higher

318 (100.00)

157 (49.37) 161 (50.63)

147 (46.23) 171 (53.77)

200 (62.89) 118 (37.11)

36.28

24.89 47.39

34.00 38.24

36.04 36.96

12.53

3.99 6.65

11.90 12.76

13.01 11.72

Note. N = sample or subgroup size; SD = standard deviation of mean value. Numbers between brackets in italic typeface are percentages of total sample size. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. Table 2 Means and T Statistics per Demographic Category of the WIT-S Test Scores

Mean SD t


18.26

19.61 16.94

19.10 17.53

16.61 21.05

6.00

5.38 6.30

6.05 5.88

5.81 5.27

4.05**

2.34*

–6.81**

Note. N = 318. SD = standard deviation of mean value. The t tests measured the differences in mean test scores on the WIT-S per demographic category. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. * p < .05 ** p < .001


Table 3 Means and T Statistics per Demographic Category of the KAIT-IV-NL Test Scores

Total IQ Fluid IQ

Mean SD t Mean SD t


64.48

66.21 62.80

65.97 63.21

60.60 71.07

11.61

10.50 12.40

10.69 12.23

10.80 9.85

2.65*

2.12*

–8.62**

32.26

32.89 31.65

32.85 31.75

29.84 36.36

8.06

6.65 9.21

6.80 8.99

6.72 8.51

1.37

1.22

–7.56**

Note. N = 318. SD = standard deviation of mean value. The given means are mean values of the unstandardized test scores for total as well as fluid IQ. The t tests measured the differences in mean scores per demographic category. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. * p < .05 ** p < .001

Table 2 and 3 provide an overview of the mean WIT-S and KAIT-IV-NL test

scores, respectively. The t statistics per demographic category are also included in these

tables. The t tests measured the difference between the participants’ mean test scores on

the WIT-S and KAIT-IV-NL per demographic category. The mean test score on the

WIT-S was 18.26 (SD = 6.00). Furthermore, the mean test score on the KAIT was 64.48

(SD = 11.61) for total IQ and 32.26 (SD = 8.06) for fluid IQ. The unstandardized mean

total score on the Geneva Emotion Recognition Test (GERT) was 396.02 (N = 318; SD

= 20.81). As subgroup differences for the GERT are not of interest in the context of the

current research, they are not provided in this section.

There were significant differences in mean WIT-S scores between the age,

gender and education subgroups (see Table 2). The effect sizes of these differences

were small to moderate. Firstly, the mean of the subgroup 18-34 years (M = 19.61, SD =

5.38) significantly differed from the subgroup 35-59 years (M = 16.49, SD = 6.30),

t(316) = 4.05, p < .001, d = .46. Secondly, the mean of the male test takers (M = 19.10,

SD = 6.05) significantly differed from the mean of the female test takers (M = 17.53, SD


= 5.88), t(316) = 2.34, p = .020, d = .26. Thirdly, the mean of the lower/not educated

test takers (M = 16.61, SD = 5.81) significantly differed from the mean of the higher

educated test takers (M = 21.05, SD = 5.27), t(316) = –6.81, p < .001, d = .76. These

findings provide support for Hypothesis 1a, 1b and 1c as subgroup differences in mean

test scores for sex, age and educational level were present. On average, younger, male

and higher educated participants performed significantly better on the WIT-S than

older, female and lower or not educated participants, respectively.

There were also significant differences in mean KAIT-IV-NL test scores on the

composite scale between the age, gender and education subgroups (see Table 3). Effect

sizes for age, gender and educational level were .30, .24, .97, respectively. Analogously

to the differences in mean scores on the WIT-S, younger, male and higher educated

participants, on average, had higher KAIT total IQ scores than older, female and lower

or not educated participants, respectively.

Contrarily, as for the differences in test scores on the Fluid IQ subscales, the

mean of the subgroup 18-34 years (M = 32.89, SD = 6.65) did not significantly differ

from the subgroup 35-59 years (M = 31.65, SD = 9.21), t(316) = 1.37, p = .171, d = .15.

In addition, the mean of the male participants (M = 32.85, SD = 6.80) did also not

significantly differ from the mean of the female participants (M = 31.75, SD = 8.99),

t(316) = 1.22, p = .225, d = .14. Analogously to the differences in mean scores on the

WIT-S, higher educated participants, on average, had higher KAIT fluid IQ scores than

lower or not educated participants. However, no significant subgroup differences in

mean scores were found for age and gender.

Analysis of the WIT-S items

Table 4 provides a comprehensive overview of the item facility indices, item

discrimination indices, item-rest correlations and percentages of participants who

completed a particular item of the WIT-S. Item facility indices ranged from .00 to .91;

the mean item facility index equalled .41 (SD = .31). Furthermore, item discrimination

indices ranged from –.08 to .61; the mean item discrimination index equalled .32 (SD =

.19). In addition, item-rest correlations ranged from –.14 to .55; the mean item-rest

correlation equalled .27 (SD = .17).

In total, 317 of the participants (99.69%) left one or more questions unanswered.

282 of the participants (88.68%) left 25% of the questions or more blank. Furthermore,


there were large, significant, negative correlations between the item number and the

item facility index (r = –.91, p < .001) and between the item number and the percentage

of participants who completed the particular item (r = –.95, p < .001). These results

support the premise that the test was composed in such a way that the questions were of

increasing difficulty and the presence of test speededness.

The following items did not meet any of the item facility, item discrimination

and item-rest correlation criteria as suggested by Rust and Golombok (2014): 4, 8, 13,

31, 37 and 39 through 45. Items 4 and 8 had an IFI greater than .75, suggesting that

these items were too easy and/or too many participants gave a correct answer. Items 13,

31, 37 and 39 through 45 had an IFI less than .25, suggesting that these items were too

difficult and/or too few participants gave a correct answer. Moreover, all of the above-

mentioned items’ item-total and item-rest correlations were below .20.

Further item distractor analysis for the multiple choice item 13 ‘Which letters

are next in this letter sequence? a m b b m c c m ? ?’ showed that the correct answer

was chosen 54 times (17.31% of total valid responses). However, distractor ‘1) dd’ was

chosen 232 times (74.36% of total valid responses). Besides a low item facility index

and highly frequently chosen distractor, its rc was negative and significant (rc = –.14, p

< .012). The other three distractors were chosen between 5 and 15 times. As for the

multiple choice item 31 ‘Which form can be made from the unfolded form on the left?’

item distractor analysis showed that the correct answer was chosen 29 times (27.10% of

total valid responses). In contrast, distractor ‘A’ was chosen 45 times (42.06% of total

valid responses). The other three distractors were chosen between 8 and 15 times.

Additional item distractor analysis for the multiple choice item 30 ‘Try to

discover the underlying simple calculation rule in this calculation task and estimate the

right answer without working out the calculation task completely. 536 888 372 : 2 = ?’

showed that the correct answer was chosen 28 times (22.22% of the total valid

responses). However, distractor ‘C) 268 444 186’ was chosen 54 times (42.86% of the

total valid responses). The other three distractors were chosen between 10 and 23 times.

Distractor analyses for items 37 and 39 through 45 were inconclusive as only 3 to 26

participants completed these items.


Table 4 Item Facility Indices (IFI), Item Discrimination Indices (IDI), Item-Rest Correlations and Percentage of Items Completed of the WIT-S

Item IFI a IDI b rc c Completed d

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.

Reasoning Spatial Numeric Perceptual Speed Numeric + Reasoning Numeric Perceptual Speed Numeric Perceptual Speed Reasoning Spatial Numeric Reasoning Perceptual Speed Perceptual Speed Numeric + Reasoning Numeric Reasoning Spatial Numeric Perceptual Speed Numeric Perceptual Speed Numeric + Reasoning Numeric Reasoning Spatial Numeric Spatial Numeric Spatial Reasoning Numeric Numeric + Reasoning Perceptual Speed Numeric Reasoning

.78

.64

.76

.91

.79

.70

.69

.89

.79

.82

.71

.68

.17

.78

.71

.46

.53

.68

.66

.63

.81

.47

.54

.18

.36

.27

.34

.19

.32

.17

.09

.26

.08

.11

.09

.06

.03

.36 .45 .35 .18 .39 .14 .47 .19 .24 .47 .50 .33 –.08 .34 .28 .34 .41 .48 .58 .52 .38 .51 .61 .42 .46 .56 .50 .49 .60 .39 .13 .47 .27 .36 .34 .34 .19

.29 .39 .29 .13 .33 .06 .41 .14 .17 .42 .44 .26 –.14 .28 .21 .26 .33 .42 .52 .46 .32 .44 .55 .36 .39 .50 .44 .44 .55 .34 .09 .41 .22 .31 .30 .30 .16

99.69 99.37 98.74 99.37 96.54 97.48 97.48 99.69 99.69 98.74 98.74 95.91 98.11 95.28 95.91 84.59 81.13 91.19 91.19 83.65 88.68 79.25 76.42 35.53 55.35 55.35 52.83 39.62 45.28 39.62 33.65 29.56 16.04 15.72 14.15 10.06 8.18


Table 4 (Continued)

Item IFI a IDI b rc c Completed d

38. 39. 40. 41. 42. 43. 44. 45.

Spatial Numeric Numeric + Reasoning Spatial Reasoning Numeric Numeric Perceptual Speed

.06

.02

.01

.01

.01

.00

.00

.01

.21 .07 –.07 .05 .04 .00 .09 .07

.17 .05 –.09 .03 .02 –.01 .08 .05

8.81 4.72 2.52 4.72 3.77 2.52 .94 3.14

Note. N = 318. Items were given a label indicating the factor they presumably operationalized. Underlined items did not meet the conditions specified in a, b and c. a IFI = Item Facility Index. IFI was calculated by dividing the number of participants correctly answering the question by the total number of participants. Indices in bold meet the following conditions: .25 ≤ IFI ≤ .75. b IDI = Item Discrimination Index. IDI equalled the correlation between the item and the participant’s total score. Indices in bold meet the following condition: IDI ≥ .20. c rc = item-rest correlation. rc equalled the corrected correlation between each item and the participant’s total score for the test; the item itself was hereby subtracted from the total for the correlation. Indices in bold meet the following condition: rc ≥ .20. d Percentage of participants who have completed the item, regardless of whether the question was answered correctly.

Hypothesis 2a was not supported by the results of the item analyses as the item

facility indices of 28 items were not included in the interval [.25; .75]. Besides, the

indices differed greatly between items and showed a few inconsistencies with the

premise that the questions increased in difficulty (e.g. items 4 and 8). Furthermore,

frequently endorsed detractors deflated the item facility indices in some cases (e.g.

items 13, 30 and 31). Hypothesis 2b was not supported by the findings either as the item

discrimination indices of 14 items were below the required item-total correlation

coefficient of .20; two indices were even negative.

Factor structure of the WIT-S

Exploratory factor analysis. The results of the exploratory factor analyses

generated a Kaiser-Meyer-Olkin Measure of Sampling Adequacy coefficient of .76.

Bartlett’s Test of Sphericity was significant (χ2(990, N = 318) = 3817.21, p < .001). The

data were thus likely to pool on the factors and the correlation matrix did not equal an


identity matrix. Communality estimates ranged from .12 to .65 (Mdn = .38); seventeen

communality coefficients were less than .30, which is considered to be low (Tabachnick

& Fidell, 2007) and most likely indicated low reliability levels of the individual factors

(Fabrigar, Wegener, MacCallum, & Strahan, 1999). Furthermore, 15 factors had

eigenvalues greater than 1. Investigation of the scree plot showed a cutoff between

factor 5 and 6 and Horn’s (1965b) parallel analysis produced scree plots that intersected

at an eigenvalue of 1.52, just above the eigenvalue of factor 6, which equalled 1.50 (see

Figure 3). On the basis of the theoretical fundamentals of the WIT-S discussed in the

introduction and the factor number identification methods, five factors were eventually

extracted and retained.

Table 5 provides an overview of the five-factor EFA of the WIT-S using the

principal axis factoring method with varimax rotation and Kaiser normalization. Direct

oblimin rotation rendered highly inconsistent pattern loadings; these results were not

reported here. Eigenvalues of factors 1 to 5 equalled 6.29, 3.63, 2.04, 1.93 and 1.87,

respectively. The rotated five-factor solution merely explained 28.02% of the variance

of the model. Factor loadings (FL) were considered to be moderate if FL ≥ |.30|, and

high if FL ≥ |.50|. Items 1 through 29 practically all loaded moderately to highly on the

first extracted factor − except for items 6, 9, 15 and 27, and items 4, 8 and 13, which

also did not meet the item analysis criteria (cfr. supra). Items 23 and 25 through 35

loaded moderately to highly on Factor 2. Items 35 through 39 loaded moderately to

highly on Factor 3. Two of the last items in the test (43 and 45) loaded highly on Factor

4. Factor 5 produced both moderately positive and negative inconsistent pattern

loadings.

Only three of Thurstone’s (1947) simple structure criteria were met. First of all,

each factor had at least five zero loadings (a loading between −.10 and .10; Gorsuch,

1983) and each pair of factors had items with salient loadings (a loading > |.30|; Kline,

2002) on one and zero loadings on the other. Furthermore, all pairs of factors had a

relatively small number of saliently loaded variables on both factors. However, one of

the items, item 39, did not produce at least one zero loading on one of the factors.

Moreover, each pair of factors had a proportion of zero loadings on both factors, yet the

proportion was not large for all pairs (e.g. four paired zero loadings on Factor 1 and 2).

Simple structure was thus not achieved.


Figure 3. Parallel analysis scree plots based on Horn (1965b).

Overall, the hypothesized factor structure in Hypothesis 3 was not supported by

the results of the EFA. Firstly, the rendered factor structure did not meet the criteria for

simple structure, impeding interpretations. Secondly, a general latent factor (e.g. Gf)

could not be found as various factor loadings on the first factor were zero, negatively

salient or positively salient. Thirdly, the four premised mental abilities could not be

clearly distinguished on the basis of the above-mentioned findings. However, a clear

pattern was discovered as the first four factors represented specific parts of the test.

Factor 1, for instance, comprised of saliently loaded items situated in the first half of the

test, items situated in the middle of the test loaded moderately to highly on Factor 2, etc.

This suggests that the extracted, rotated factor structure was supposedly an artefact of

the test’s speededness and the increasing difficulty of the questions.


Table 5 Rotated Factor Pattern Loadings of the Five-Factor Exploratory Factor Analysis of the WIT-S Data using Principal Axis Factoring with Varimax Rotation and Kaiser Normalization

Mental Ability Item Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

Reasoning Numeric Numeric + Reasoning a Spatial

R-IQ1 R-IQ10 R-IQ13 R-IQ18 R-IQ26 R-IQ32 R-IQ37 R-IQ42 N-IQ3 N-IQ6 N-IQ8 N-IQ12 N-IQ17 N-IQ20 N-IQ22 N-IQ25 N-IQ28 N-IQ30 N-IQ33 N-IQ36 N-IQ39 N-IQ43 N-IQ44 NR-IQ5 NR-IQ16 NR-IQ24 NR-IQ34 NR-IQ40 S-IQ2 S-IQ11 S-IQ19 S-IQ27 S-IQ29 S-IQ31 S-IQ38 S-IQ41

.38 .50 −.19 .44 .35 .06 .10 −.04 .37 .19 .27 .33 .35 .52 .40 .30 .30 .07 −.04 .10 −.16 −.12 −.02 .44 .32 .34 −.07 −.21 .47 .58 .61 .25 .36 −.14 .05 .00

.00 .03 .01 .17 .48 .65 .00 .04 .06 −.13 −.13 .05 .20 .15 .27 .39 .47 .49 .49 .27 .24 .09 .18 .02 .07 .26 .44 .10 .05 .03 .12 .51 .51 .34 .04 .03

.03 .11 −.08 .08 −.02 .26 .61 .08 −.05 .01 .05 −.11 −.12 −.07 −.08 −.12 −.07 .26 .10 .59 .56 .03 .16 −.08 −.06 −.08 .29 .28 .11 .07 .08 −.01 .17 .08 .79 .05

−.07 −.03 .15 −.09 −.04 .09 .07 .01 −.04 −.03 .08 .15 .09 −.07 .10 .08 .01 .23 .20 .29 −.28 .68 −.03 .14 −.03 −.06 .03 −.21 −.07 −.10 −.11 −.10 −.04 −.15 .03 −.01

.05 −.05 .02 .02 .04 .20 .18 .28 −.15 −.22 −.19 −.14 −.22 .04 −.09 −.12 −.15 .03 −.16 .05 −.48 −.07 −.30 −.01 −.02 .02 .21 −.18 .03 .10 .21 .34 .34 .00 .06 .21


Table 5 (Continued)

Mental Ability Item Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

Perceptual Speed

PS−IQ4 PS−IQ7 PS−IQ9 PS−IQ14 PS−IQ15 PS−IQ21 PS−IQ23 PS−IQ35 PS−IQ45

.18 .41 .27 .32 .26 .35 .45 .07 −.03

−.04 .16 −.06 .06 −.01 .16 .41 .33 .04

.05 .07 −.07 .01 .02 −.07 −.01 .43 .23

.06 −.05 .11 .07 .08 −.13 −.10 .29 .57

−.02 −.01 −.17 .02 .02 −.04 .20 .32 .07

Note. N = 318. Item labels refer to the operationalized latent construct (e.g. R for Reasoning) and item position in the test (e.g. IQ1 for item 1). Factor loadings ≥ |.30| were considered salient and are displayed in bold typeface. Underlined items did not meet the item analysis criteria (see Table 4). a These items presumably operationalized two mental abilities: Reasoning and Numeric Reasoning.

Confirmatory factor analysis. Results of the confirmatory factor analysis

showed that none of the premised models containing all 45 item variables provided

sufficient fit with the data (see Table 6). All 𝜒#$ test statistics were furthermore

significant (p < .001), indicating bad fit of the observed covariance with the model-

implied covariance. CFI values were less than .95, suggesting that the specified models

did not fit better with the data than the baseline models. Moreover, the RMSEAs were

greater than .080, indicating bad model fit in general. Possible explanations for the

misfit are that (a) large proportions of the (un)standardized factor loadings in all of the

three models were both positive and negative, and (b) numerous factor loadings were

insignificant (p > .05). Finally, the chi square difference test (see Table 7) showed that

the four-factor model fit better with the observed data than the one- and two-factor

models (p <. 001). No difference was found between the one- and two-factor model (p =

.106).


Table 6 Goodness-of-Fit Statistics of Full and Partial Models of the WIT-S

Chi Square

𝜒#$ df CFI RMSEA

One-Factor Model Full Partial Two-Factor Model Full Partial Four-Factor Model Full Partial

3018.17 1042.25

3014.68 1042.03

2937.28 989.17

945 495

944 494

939 489

.49 .82

.49 .82

.50 .84

.083 .059

.083 .059

.082 .057

Note. N = 318. Full models comprised all 45 items; partial models comprised the 33 items that met at least one of the item analysis criteria (see Table 4). CFI = Comparative Fit Index. RMSEA = Root Mean Square Error of Approximation. All chi square statistics were significant (p < .001).

Table 7 Comparisons of Model Fit using Scaled Chi Square Difference Tests

𝜒%$ df Δ𝜒%$ Δdf p

Full Models Four-Factor Two-Factor One-Factor Partial Models Four-Factor Two-Factor One-Factor Four-Factor Models Full Partial

7735.42 8090.88 8099.37

1197.05 1284.44 1284.51

1197.05 7735.42

939 944 945

489 494 495

489 939

−

34.75 2.61

−

36.36 .02

−

1210.34

− 5 1 − 5 1 −

450

−

< .001 .106

−

< .001 .885

−

< .001

Note. N = 318. Full models comprised all 45 items; partial models comprised the 33 items that met at least one of the item analysis criteria (see Table 4). A simple approximation of 𝜒%$ was calculated using the method of Satorra and Bentler (2001).


Because the ‘full’ models comprising all items did not fit well with the observed

data, additional CFAs were executed on partial models. The items that did not meet at

least one of the item analysis criteria (N = 12; cfr. supra) were hereby omitted from the

model, leaving us with 33 items in total per partial model. 𝜒#$ statistics were significant

for all partial models (p < .001) and CFI values did not exceed .95. The RMSEA ranged

between .057 and .059, indicating mediocre model fit. The chi square difference test

indicated improved fit of the four-factor partial model compared to the one- and two-

factor partial models (p <. 001; see Table 7). No improved fit of the two-factor model

against the one-factor model was found (p = .885). In fact, the four-factor partial model

showed better fit than all of the above-mentioned models. In the next two paragraphs,

parameter estimates (i.a. (un)standardized factor loadings and factor correlations) of the

best fitting model, the four-factor partial model, are provided.

All unstandardized factor loadings of the four-factor partial model were positive

and standard errors ranged between .13 and .36, which is considered to be low to

moderately low (Kline, 2005). In addition, not all standardized factor loadings of the

four-factor partial model were large (> .50) or significant (e.g. p = .467 for item 6).

They ranged between .35 and .77 for Reasoning, between .38 and .78 for Numeric

Reasoning (except for item 6, which was .06), between .50 and .92 for Spatial

Reasoning and between .19 and .76 for Perceptual Speed. Omitting the insignificantly

loaded item 6 from the model did not provide improved fit (p = .630), nor did

introducing a fifth factor comprising the five Numeric + Reasoning items (p = .713).

Figure 4 provides a simplified graphic overview of the standardized model

parameter estimates of the four-factor partial model; multiple items were left out from

the figure in order to provide a clear overview. Endogenous item variables are presented

within rectangles; exogenous latent factors are displayed within ellipses. Single headed

arrows reflect standardized factor loadings; double-headed arrows reflect factor

correlations. Each observed indicator was modelled to be directly influenced by exactly

one unobserved factor. All factor correlations exceeded .68, meaning that all factors

were related to each other. However, the latent-variable covariance matrix of the four-

factor partial model was not positive definite – the correlation between Reasoning and

Perceptual Speed equalled 1.02 – indicating that the sample size was either too small or

that there were problems with the model specification.


Figure 4. Partial four-factor model with standardized parameter estimates. Multiple

items were omitted from the figure to provide a clear overview.

In sum, the hypothesized factor structure in Hypothesis 3 was not supported by

the results of the CFAs. Goodness-of-fit statistics for both full and partial models

indicated that (a) the observed covariance did not fit well with the model-implied

covariance (b) the premised models did not fit better with the data than the baseline

models. However, RMSEA values indicated mediocre fit of the partial models, with the

four-factor partial model showing the best overall fit. Hence, the hypothesized model

was not identified but the four-factor partial model at least showed improved fit.


Reliability of the WIT-S

An overview of the WIT-S reliability estimates as well as the ratio between the

mean of the unanswered items per participant and the variance of the total test scores is

provided in Table 8. The composite-scale consisted of 45 items (α = .84). The

Reasoning, Numeric Reasoning, Spatial Reasoning and Perceptual Speed subscales

consisted of 13, 15, 8 and 9 items, respectively. Alpha coefficients ranged between .51

(Perceptual Speed) and .66 (Numeric Reasoning and Spatial Reasoning). The alpha

coefficient equalled .57 for the Reasoning subscale. Only the composite-scale alpha

coefficient exceeded .70, which was premised as acceptable. To put this relatively high

value in perspective – as alpha is known to produce higher levels of reliability if the

number of items in the equation is substantial (Cortina, 1993; Schmitt, 1996) – it is

worthwhile to note that the average inter-item correlation was small (r = .09).

Uncorrected split-half reliability estimates indicated higher levels of reliability

overall. The coefficient of the composite scale (𝑅**+ = .86), as well as the Numeric

Reasoning (𝑅**+ = .72) and Spatial Reasoning (𝑅**+ = .71) subscales exceeded .70 and

were thus viewed as acceptable. Split-half coefficients of the Reasoning (𝑅**+ = .67)

and Perceptual Speed (𝑅**+ = .55) subscales did not. By contrast, only the corrected

split-half reliability of the composite scale (𝑅( = .81) was acceptable. The corrected

split-halves of the individual subscales were generally negative. The values equalled

–.49, –.33, –.31 and –.16 for the Reasoning, Spatial Reasoning, Perceptual Speed and

Numeric Reasoning subscales, respectively.

Hypothesis 4 was only partially supported by the results of the reliability

analysis. The composite-scale lower bound alpha and both split-half estimates produced

sufficient levels of reliability, exceeding the premised value. However, except for the

uncorrected split-half estimates of the Numeric Reasoning and Spatial Reasoning

subscales, none of the reliability estimates of the subscales equalled or exceeded the

hypothesized value. In fact, the corrected individual-scale split-half estimates even

produced negative values. The high 𝑀//𝑆4$ ratios of the subscales indicated that the

negative corrected split-half reliability estimates were supposedly an artefact of the high

total number of unanswered items per participant.


Table 8 Reliability Estimates and the Ratio between Mean Unanswered Items and Total Test Score Variance of the Composite Scale and Individual Scales of the WIT-S

Reliability Estimates

α 𝑅**+ 𝑅( 𝑀//𝑆4$

Composite-Scale Individual-Scale Reasoning a Numeric Reasoning Spatial Reasoning Perceptual Speed

.84

.57

.66

.66

.51

.86

.67

.72

.71

.55

.81

−.49 −.16 −.33 −.31

.52

1.53 1.27 1.23 .94

Note. N = 318. α = Cronbach’s (1951) lower bound alpha. 𝑅**+ = Split-half reliability of odd- and even-numbered halves. 𝑅( = Corrected split-half reliability coefficient, estimated by using Gulliksen’s (1950) 𝑅5-formula where K equalled 23. 𝑀//𝑆4$ = Ratio between the mean of the number of unanswered items per participant and the variance of the total test scores. Reliability estimates in bold were greater than .70. a Comprising items labelled ‘Reasoning’ and ‘Numeric + Reasoning’.

Construct validity of the WIT-S

Table 9 provides an overview of the correlations between the WIT-S, KAIT and

GERT. First of all, the correlations between the WIT-S and KAIT were significant and

large for both total IQ (r = .54, p < .001) and fluid IQ (r = .51, p < .001). These findings

provide support for Hypothesis 5a and hence convergent validity. The correlation

between the WIT-S and the crystallized subscale of the KAIT was furthermore

significant and moderate (r = .29, p < .001). Secondly, the correlation between the

GERT and WIT-S was generally smaller, yet significant (r = .19, p = .001), similar to

its correlation with the composite scale of the KAIT. The correlation between the

WIT-S and KAIT exceeded the correlation between the WIT-S and GERT (z = 5.17, p <

.001). These findings provide support for Hypothesis 5b and hence divergent validity.

Correlations between the WIT-S and KAIT fluid IQ subscales were small to

moderate, yet significant, ranging from .14 to .32. Reasoning most substantially

correlated with Rebus Learning (r = .21, p < .001), Logical Steps (r = .28, p < .001) and

Memory for Block Designs (r = .18, p = .002). Spatial Reasoning most substantially

correlated with Mystery Codes (r = .32, p < .001). The correlation between Spatial


Reasoning and Memory for Block Designs equalled .17 (p = .003) Although relatively

small, these significant correlations provide additional support for Hypothesis 5a.

Table 9 Correlations between the Short Form of the Wilde Intelligenztest, Kaufman Adolescent and Adult Intelligence Test and Geneva Emotion Recognition Test

KAIT

WIT-S Total IQ Fluid IQ Crystallized IQ

KAITTotal KAITFluid

KAITCrystallized

GERT

.54**

.51**

.29**

.19**

.89** .78** .20**

.70** .14*

.15**

Note. N = 318. WIT-S = Short form of the Wilde Intelligenztest. KAIT = Kaufman Adolescent and Adult Intelligence Test (Dutch translation). GERT = Geneva Emotion Recognition Test (Dutch translation). * p < .05 ** p ≤ .001


DISCUSSION

The objective of the current study was to provide an answer to the question

whether the short form of the Wilde Intelligenztest (WIT-S) possesses adequate

psychometric qualities and could hence provide practitioners with a reliable and valid

alternative to time-consuming, full-scale intelligence tests in a personnel selection

context. Firstly, I explored differences in mean scores for age, sex and educational level.

Secondly, I calculated item facility and item discrimination indices and performed

distractor analysis to assess the psychometric qualities of the items. Thirdly, I carried

out exploratory and confirmatory factor analyses to investigate the presence of the

premised latent factor structure and overall model fit. Finally, I evaluated reliability

estimates of the WIT-S and the construct interrelations within its nomological network,

namely general intelligence and emotional intelligence.

General findings and implications

Test score differentiation. In accordance to the literature (i.e. Neisser et al.,

1996; Nisbett et al., 2012), men, younger participants aged 18 to 34 years and higher

educated test takers attain, on average, significantly higher WIT-S scores than women,

older people aged 35 to 59 years and lower/not educated test takers. This can be

explained from a theoretical perspective as the WIT-S presumably measures fluid

intelligence (Kersting et al., 2008), defined as the ability to handle novel, uncommon

problems (Carroll, 1993). Males, adolescents, young adults as well as people who have

received (higher) formal education generally achieve better scores on Gf-oriented tests

than their female, older and less educated counterparts (Johnson & Bouchard, 2007;

Neisser et al., 1996; Nisbett et al., 2012). This theoretical explanation is furthermore

supported by the construct validity estimates (cfr. infra).

The results thus indicate that the WIT-S is, in se, biased towards male, younger

and higher educated people. This bias is detrimental, especially in a personnel selection

context. Even more so, it is desirable to minimize the problems practitioners face with

unequal subgroup selection rates (Ployhart & Holtz, 2008). Ideally, and in order to be

able to use the WIT-S in a selection context without inflating adverse impact in the

process, standardization and the development of norm groups should be performed

(Sackett & Wilk, 1994). However, standardization based on the current research would

be flawed because of the (a) skewed age distribution of the sample, (b) inconsistent


factor structure and (c) generally low reliability estimates. Hence, norm groups are not

provided here.

Items and their latent factors. In contrast to what was hypothesized, analysis

of the WIT-S items shows that a large number of items do not possess an adequate

difficulty level and some items insufficiently discriminate between test takers. Highly

easy or difficult items are redundant as a participant’s total score is practically the same

when the item is not included in the test (Rust & Golombok, 2014). In addition, items

that do not adequately discriminate between high and low performers make little to no

contribution to the test either (Emons et al., 2007; Rust & Golombok, 2014). Looking

for an answer as to why the item characteristics of the WIT-S show inadequate

psychometric qualities, I found two interesting, yet logical observations: (a) a very high

number of participants left considerable amounts of items blank and (b) the position of

the items in the test is, generally speaking, inversely proportional to the item difficulty.

These findings not only suggest the classification of the WIT-S as a speed test (Van der

Linden, 2011a) but also support the premise that the questions are of increasing

difficulty.

Item analysis consequently indicates that a substantial amount of items should

better be omitted from the WIT-S as they are too easy, too difficult, do not adequately

discriminate or because one or multiple of its distractor options are chosen too

frequently. Doing so for all items that do not meet the criteria would be inappropriate

because the test itself is time-restricted, on the one hand, and designed in such a way

that the items are of increasing difficulty, on the other hand. Some items, however,

should at least be revised. Especially the items subjected to distractor analysis, the items

with extreme item facility index values and the items that were completed by a

substantial amount of participants (i.e. 75%) but produced very low or even negative

item-total correlations.

In the current study, the WIT-S furthermore does not reproduce the theorised

factor structure including Gf and the four mental abilities Reasoning, comprising

Numeric Reasoning and Spatial Reasoning, and Perceptual Speed. On the contrary, both

full (including all items) and partial (including items with more robust psychometric

attributes) models do not sufficiently fit with the data. The four-factor partial model

does, however, provide improved fit, indicating that the four mental abilities are


interrelated to some extent. Additionally, a clear pattern was discovered through

exploratory factor analysis: the found structure is most likely an artefact of the test’s

speededness. This effect is also noted by Kline (2000) as item correlations in speeded

tests frequently are anomalies caused by timing and order. The analysis itself hereby

produces grouped item loadings on each factor. Although sufficiently conclusive, these

findings do not necessarily rule out the reproduction of the latent structure in future

research (cfr. infra).

Reliability and construct validity. The reliability estimates of the composite

scale of the WIT-S, even Gulliksen’s (1950) split-half correction for speed restrictions,

exceed the hypothesized value of .70. Contrarily, the estimates of the individual scales

do not, except for the uncorrected split-half estimates of the Numeric Reasoning and

Spatial Reasoning subscales. Acceptable interrelatedness of the items is thus present,

yet insufficient evidence for the reliability of the subscales is found in the current study.

In this respect, the literature discussed was more promising (see i.a. Bilker et al., 2012;

Donnell et al., 2007; Jeyakumar et al., 2004). Yet again, the results suggest that the

inacceptable subscale reliability estimates are likely due to the test’s speededness.

These low subscale reliability coefficients do not per definition entail that the

WIT-S is an unreliable measure. The results do imply that a small proportion of the test

variance is attributable to a general factor and that the items are interrelated to some

extent (Cortina, 1993; Sijtsma, 2009), which is contrarily not the case for the subscales

of the WIT-S. We have to be careful, though, to not exaggerate this general

interrelatedness. The sample size in the current study is large and it has been shown

that, along with low item-factor loadings – which applies in the current study, the

presence of systematic measuring errors possibly inflates reliability estimates (Cortina,

1993; Shevlin, Miles, & Walker, 2000).

In addition, the WIT-S has significant and substantial overlap with a full-form

measure of intelligence, the Kaufman Adolescent and Adult Intelligence Test (Kaufman

& Kaufman, 1993), and weak, yet significant overlap with an emotion recognition test,

the Geneva Emotion Recognition Test (Schlegel et al., 2014). The overlap between the

WIT-S and KAIT furthermore surpasses the overlap between the WIT-S and GERT.

The results consequently indicate convergence of the WIT-S with a general factor of

intelligence (g) or a more fluid-oriented form of intelligence (Gf), as measured by the


KAIT. The results furthermore indicate discrimination between the WIT-S and the

ability to adequately recognize emotion expressions, as measured by the GERT. This is

in line with previous research on the interrelatedness of short forms of intelligence

measures and similarly structured, long-form tests (see i.a. Canivez, 2005; Dodrill &

Warner, 1988; Prewett, 1995) and on the relationship between measures of general

intelligence and ability measures of emotional intelligence (Van Rooy et al., 2005).

Limitations

First of all, the skewedness of the age distribution of the participants is one of

the reasons why standardization and subsequently the construction of norm groups is

not desirable. Participants, who were mainly acquaintances of undergraduate students,

were conveniently selected as opposed to randomly sampled. In combination with the

broadly specified age categories in which students were allowed to search for

participants, the probability sampling resulted in the overrepresentation of certain age

categories. The lowered representativeness on the basis of age is consequently

detrimental for the external validity.

Secondly, only reliability measures estimating the interrelatedness of the items

are assessed (i.e. lower bound alphas and split-halves). Although providing these

estimates is not considered to be poor practice, solely relying on these measures is (see

i.a. Cortina, 1993; Schmitt, 1996; Sijtsma, 2009). For one thing, these estimates do not

imply the presence of unidimensionality, nor do they provide information on the

multidimensionality of the measure that is subject of the research (Cortina, 1993;

Schmitt, 1996). Moreover, no definite cutoff can be regarded as the ‘holy grail’ or ideal

level of reliability a test or one of its subscales should eventually reach (Schmitt, 1996).

Alternative forms of reliability should hence be explored. However, on a more positive

note, the current study does provide information on the item-total correlations, latent

factor structure and interrelatedness of the construct measured within its nomological

network, which is advocated for by Sijtsma (2009).

Thirdly, it is important to note that the estimations of convergent and

discriminant validity are not measured under perfect conditions. The tests administered

in the current study not only measure different constructs but also use different methods

to measure them. Here, the WIT-S represents a time-constrained intelligence test, the

KAIT is a full-length intelligence test and the GERT is a full-length emotion


recognition ability test. This differentiation in methodology can serve as a potential

source of construct-irrelevant variance (Arthur & Villado, 2008). When either the

construct or method is controlled for in the research design, the variance would have

been isolated, which permits more meaningful construct comparisons (Arthur &

Villado, 2008).

Finally, the inconsistent factor structure, low reliability estimates as well as the

large proportion of inadequate item properties are largely due to test speededness and

the way the test was designed and administered. As a result, it is not possible to draw

solid conclusions on the basis of the current study. However, this does not mean that the

test and its items are unreliable or produce an inconsistent factor structure per se.

Differently operationalized future research can undoubtedly provide academics as well

as practitioners with more information on the true reliability and latent structure of the

WIT-S.

Suggestions for future research

In the first place, I suggest that future research looks beyond Cronbach’s (1951)

time-honoured alpha and other measures of interrelatedness, especially when dealing

with test speededness and its resulting anomalies, as is the case with the WIT-S. In

similar scenarios it is more appropriate to look into parallel forms reliability, measuring

the correlation of the individual and composite scales of one form of the test with a

second form of the tests (Gulliksen, 1950; Kline, 2000). When doing so, researchers

should make sure that the content coverage of the test, scales and even items are

preserved across both forms to the greatest extent possible (see Smith et al., 2000).

Moreover, the WIT-S could be administered in a slightly speeded or even non-

speeded manner, similar to the administration of power tests. In this respect, reliability

coefficients of internal consistency like alpha and split halves or McDonald’s (1999) 𝜔

and Raykov, Dimitrov, and Asparouhov’s (2010) 𝜌 for categorical data would render

more meaningful estimates on both the composite-scale and individual-scale level. The

administration of a non- or slightly speeded version can also benefit the outcome of

factor analytical research. The artefacts caused by timing and order are hereby

neutralised, making it possible to unveil the true latent factor structure by means of

structural equation modelling instead of a flawed structure based on grouped item


loadings. It will furthermore help us to investigate whether the short form preserves the

content coverage of each factor in the multifactorial instrument (Smith et al., 2000).

Thirdly, a more complex approach comprising a full-fledged multitrait–

multimethod design containing several different constructs measured by various

methods is ideally elaborated in future research (Arthur & Villado, 2008; Campbell &

Fiske, 1959). This multitrait-multimethod design could comprise general intelligence

(measures of g as well as Gf), emotional intelligence and personality traits. These are

subsequently measured in their full-length and short form or even in a self-report or

peer-evaluation format. This way, one could control for anomalies of method as well as

construct differentiation. Moreover, the relatedness and overlap with its parent test, the

WIT (Jäger & Althoff, 1983), could be explored by means of independent test

administrations of both forms (using an appropriate test-retest interval) to elaborate

even further on its construct validity (see Smith et al., 2000).

Finally, I suggest that future research also looks beyond construct validity and,

once its factor structure and internal consistency are clear and sound, focusses on

classification rates and predictive validity. The goal of the WIT-S is namely to measure

an applicant’s intelligence in a personnel selection context. In this respect, it’s essential

to show that the same accuracy of classification, present in the parent form, is

transferred to the abbreviated form as well (Smith et al., 2000). Relating the WIT-S to

one or multiple criterions, for example job performance (see Schmidt & Hunter, 1998),

will furthermore contribute to the general validation process of the short-form measure.

Conclusion

Since the early years of test development, researchers have explored several

ways to shorten tests, because full-length measures are often time-consuming and costly

(Smith et al., 2000; also see Doll, 1917; Thorndike et al., 1926). In the current research,

I’ve investigated the potential of a short form of the Wilde Intelligenztest (WIT-S) for

personnel selection, where (a) time is often a crucial element (Klimoski & Jones, 2008)

and (b) being able to precisely predict an applicant’s future job performance is an

important objective (Highhouse, 2008; Schmidt & Hunter, 2004). In-depth

psychometric research on the reliability and validity of the WIT-S gives us a first

imprint of the potential employability of the abbreviated measure.


First of all, the WIT-S differentiates between age, sex and educational level

subgroups whereby male, younger and higher educated test takers, on average, attain

better scores than their counterparts. This corresponds with earlier research in the field

of intelligence, where measures of fluid intelligence produce similar test characteristics

(cfr. supra). In addition, the items of the WIT-S are sufficiently interrelated as the

composite scale reaches satisfactory reliability levels. Finally, the convergence of the

WIT-S with a full-length measure of general intelligence (Kaufman Adolescent and

Adult Intelligence Test) and divergence with an emotional ability test (Geneva Emotion

Recognition Test) suggest that the short form measures a general factor of intelligence

to at least some extent.

By contrast, the current research is not able to show (a) that each item of the

WIT-S attains an adequate difficulty and discriminatory level, (b) that the WIT-S

preserves the factor structure of a multifactorial measure and (c) that the subscales of

the WIT-S reach satisfactory levels of reliability. Because reliability proceeds validity,

it can not be concluded here that the WIT-S is a reliable and valid alternative to full-

form intelligence tests. The way the test was designed and eventually administered is

likely the root cause of these limitations. Therefore, future research should (a) focus on

parallel forms reliability, (b) investigate the WIT-S in its non- or less speeded form and

(c) control for construct and method variations or use a more complex multitrait–

multimethod design. In sum, future research on the reliability and latent factor structure

of the WIT-S should provide more conclusive findings about its psychometric qualities

before the short-form intelligence test can be employed in a personnel selection context.


REFERENCES

Abad, F. J., Colom, R., Juan-Espinosa, M., & Garcia, L. F. (2003). Intelligence

differentiation in adult samples. Intelligence 31(2), 157–166.

doi:10.1016/s0160-2896(02)00141-1

Ackerman, P. L. (1996). A theory of adult intellectual development: Process,

personality, interests, and knowledge. Intelligence, 22(2), 227–257.

doi:10.1016/s0160-2896(96)90016-1

Ackerman, P. L., Beier, M. E., & Boyle, M. O. (2005). Working memory and

intelligence: The same or different constructs? Psychological Bulletin, 131(1),

30–60. doi:10.1037/0033-2909.131.1.30

Adams, R. L., Smigielski, J., & Jenkins, R. L. (1984). Development of a Satz-Mogel

Short Form of the WAIS-R. Journal of Consulting and Clinical Psychology,

52(5), 908–908. doi:10.1037/0022-006x.52.5.908

Agnello, P., Ryan, R., & Yusko, K. P. (2015). Implications of modern intelligence

research for assessing intelligence in the workplace. Human Resource

Management Review, 25(1), 47–55. doi:10.1016/j.hrmr.2014.09.007

Arthur, W., & Villado, A. J. (2008). The importance of distinguishing between

constructs and methods when comparing predictors in personnel selection

research and practice. Journal of Applied Psychology, 93(2), 435–442.

doi:10.1037/0021-9010.93.2.435

Axelrod, B. N., Ryan, J. J., & Ward, L. C. (2001). Evaluation of seven-subtest short

forms of the Wechsler Adult Intelligence Scale-III in a referred sample. Archives

of Clinical Neuropsychology, 16(1), 1–8. doi:10.1093/arclin/16.1.1

Baddeley, A. D. (1968). A three-minute reasoning test based on grammatical

transformation. Psychonomic Science, 10(10), 341-342.

doi:10.3758/bf03331551

Barkema, H. G., Baum, J. A. C., & Mannix, E. A. (2002). Management Challenges in a

New Time. Academy of Management Journal, 45(5), 916–930.

doi:10.2307/3069322

Bartholomew, D. J., Steele F. S., Moustaki, I., & Galbraith, J. I. (2008). Analysis of

multivariate social science data. Boca Raton: Chapman & Hall/CRC.


Basak, C., Boot, W. R., Voss, M. W., & Kramer, A. F. (2008). Can training in a real-

time strategy video game attenuate cognitive decline in older adults? Psychology

and Aging, 23(4), 765–777. doi:10.1037/a0013494

Bialik, D. (2008). Brief measures of cognitive abilities: A concurrent validity

investigation of the WRIT and WASI. Dissertation Abstracts International:

Section B. Sciences and Engineering, 68(12B), 8443.

Bilker, W. B., Hansen, J. A., Brensinger, C. M., Richard, J., Gur, R. E., Gur, R. C.

(2012). Development of Abbreviated Nine-Item Forms of the Raven’s Standard

Progressive Matrices Test. Assessment, 19(3), 354–369.

doi:10.1177/1073191112446655

Blair, C. (2006). How similar are fluid cognition and general intelligence? A

developmental neuroscience perspective on fluid cognition as an aspect of

human cognitive ability. Behavioral and Brain Sciences, 29, 109–125.

doi:10.1017/S0140525X06009034

Borella, E., Carretti, B., Riboldi, F., & De Beni, R. (2010). Working memory training in

older adults: Evidence of transfer and maintenance effects. Psychology and

Aging, 25(4), 767–778. doi:10.1037/a0020683

Boring, E. G. (1923). Intelligence as the tests test it. The New Republic, 35, 35–37.

Borman, W. C., Hanson, M. A., Oppler, S. H., Pulakos, E. D., Elaine, D., White L.A, &

Leonard A. (1993). Role of early supervisory experience in supervisor

performance. Journal of Applied Psychology, 78(3), 443–449.

doi:10.1037/0021-9010.78.3.443

Brown, D. T. (1994). Review of the Kaufman Adolescent and Adult Intelligence Test

(KAIT). Journal of School Psychology, 32(1), 85–99. doi:10.1016/0022-

4405(94)90030-2

Brown, M. B., & Forsythe, A. B. (1974). Robust Tests for the Equality of Variances.

Journal of the American Statistical Association, 69(346), 364–367.

doi:10.1080/01621459.1974.10482955

Brown, W. (1910). Some experimental results in the correlation of mental abilities.

British Journal of Psychology, 1904–1920, 3(3), 296–322. doi:10.1111/j.2044-

8295.1910.tb00207.x


Buckley, M. R., Norris, A. C., & Wiese, D. S. (2000). A brief history of the selection

interview: may the next 100 years be more fruitful. Journal of Management

History, 6(3), 113–126. doi:10.1108/eum0000000005329

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the

multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

doi:10.1037/h0046016

Canivez, G. L. (1995). Validity of the Kaufman Brief Intelligence Test: Comparisons

With the Wechsler Intelligence Scale for Children – Third Edition. Assessment,

2(2), 101–111.

Canivez, G. L. (2005). Construct Validity of the Kaufman Brief Intelligence Test,

Wechsler Intelligence Scale for Children-Third Edition, and Adjustment Scales

for Children and Adolescents. Journal of Psychoeducational Assessment, 23(1),

15–34. doi:10.1177/073428290502300102

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.

Cambridge: Cambridge University Press. doi:10.1017/cbo9780511571312.002

Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical

experiment. Journal of Educational Psychology, 54(1), 1–22.

doi:10.1037/h0046743

Cattell, R. B. (1966). The Scree Test For The Number Of Factors. Multivariate

Behavioral Research, 1(2), 245–276. doi:10.1207/s15327906mbr0102_10

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences: Second

Edition. New York, NY: Lawrence Erlbaum Associates, Publishers.

doi:10.4324/9780203771587

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and

applications. Journal of Applied Psychology, 78(1), 98–104. doi:10.1037/0021-

9010.78.1.98

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.

Psychometrika, 16(3), 297–334. doi:10.1007/bf02310555

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.

Psychological Bulletin, 52(4), 281–302. doi:10.1037/h0040957


Deary, I. J. (2000). Looking down on human intelligence: from psychometrics to the

brain. New York, NY: Oxford University Press.

doi:10.1093/acprof:oso/9780198524175.001.0001

DeNisi, A., Hitt, M. A., & Jackson, S. E. (2003). The knowledge-based approach to

sustaining competitive advantage. In S.E. Jackson, M.A. Hitt, & A.S. DeNisi

(Eds.), Managing knowledge for sustained competitive advantage (pp. 3–33).

San Francisco: John Wiley & Sons, Inc.

Detterman, D. K., & Daniel, M. H. (1989). Correlations of mental tests with each other

and with cognitive variables are highest in low IQ groups. Intelligence, 13(4),

349–360. doi:10.1016/s0160-2896(89)80007-8

Dickens, W. T. (2007). What is g? Washington, DC: Brookings Institution.

Dodrill, C. B. (1981). An economical method for the evaluation of general intelligence

in adults. Journal of Consulting and Clinical Psychology, 49(5), 668–673.

doi:10.1037/0022-006x.49.5.668

Dodrill, C. B. (1983). Long-term reliability of the Wonderlic Personnel Test. Journal of

Consulting and Clinical Psychology, 51(2), 316–317. doi:10.1037/0022-

006x.51.2.316

Dodrill, C. B., & Warner, M. H. (1988). Further studies of the Wonderlic Personnel

Test as a brief measure of intelligence. Journal of Consulting and Clinical

Psychology, 56(1), 145–147. doi:10.1037/0022-006x.56.1.145

Doll, E. A. (1917). A brief Binet-Simon scale. Psychological Clinic, 11, 197–211.

Donders, J. (1997). A short form of the WISC–III for clinical use. Psychological

Assessment, 9(1), 15–20. doi:10.1037/1040-3590.9.1.15

Donnell, A., Pliskin, N., Holdnack, J., Axelrod, B., & Randolph, C. (2007). Rapidly-

administered short forms of the Wechsler Adult Intelligence Scale – 3rd edition.

Archives of Clinical Neuropsychology, 22(8), 917–924.

doi:10.1016/j.acn.2007.06.007

Dumont, R. & Hagberg, C. (1994) Kaufman Adolescent and Adult Intelligence Test

(KAIT): Test Review. Journal of Psychoeducational Assessment, 12(2), 190–

196.


Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2007). On the consistency of individual

classification using short scales. Psychological Methods, 12(1), 105–120.

doi:10.1037/1082-989x.12.1.105

Ericsson, K. A., Charness, N., Feltovich, P. J., & Hoffman, R. R. (2006). The

Cambridge handbook of expertise and expert performance. Cambridge:

Cambridge University Press.

Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K. (2009). COTAN

Beoordelingssysteem voor de kwaliteit van tests. Amsterdam: Nederlands

Instituut van Psychologen.

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating

the use of exploratory factor analysis in psychological research. Psychological

Methods, 4(3), 272–299. doi:10.1037/1082-989x.4.3.272

Field, A. (2009). Discovering statistics using SPSS. London, England: Sage

publications.

Flora, D. B., & Curran, P. J. (2004). An Empirical Evaluation of Alternative Methods of

Estimation for Confirmatory Factor Analysis With Ordinal Data. Psychological

Methods, 9(4), 466–491. doi:10.1037/1082-989x.9.4.466

Flynn, J. R. (2007). What is intelligence? Beyond the Flynn effect. New York, NY:

Cambridge University Press.

Galton, F. (1869). Hereditary genius: An inquiry into its laws and consequences.

London, Great Britain: Macmillan and Co. doi:10.1037/13474-000

Glutting, J., Adams, W., & Sheslow, D. (2000). Wide Range Intelligence Test.

Wilmington, DE: Wide Range.

Goldstein, H. W., Zedeck, S., & Goldstein, I. L. (2002). g: Is This Your Final Answer?

Human Performance, 15(1–2), 123–142. doi:10.1080/08959285.2002.9668087

Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.

Gottfredson, L. S. (1997a). Mainstream Science on Intelligence: An Editorial With 52

Signatories, History and Bibliography. Intelligence, 24(1), 13–23.

doi:10.1016/s0160-2896(97)90011-8

Gottfredson, L. S. (1997b). Why g matters: The complexity of everyday life.

Intelligence, 24(1), 79–132. doi:10.1016/s0160-2896(97)90014-3


Gottfredson, L. S. (2002). Where and Why g Matters: Not a Mystery. Human

Performance, 15(1), 25–46. doi:10.1207/s15327043hup1501&02_03

Gulliksen, H. (1950). The reliability of speeded tests. Psychometrika, 15(3), 259–269.

doi:10.1007/bf02289042

Hays, J., Reas, D., & Shaw, J. (2002). Concurrent validity of the Wechsler Abbreviated

Scale of Intelligence and the Kaufman Brief Intelligence Test among psychiatric

inpatients. Psychological Reports, 90, 355–359. doi:10.2466/pr0.90.2.355-359

Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure

in American life. New York: Free Press.

Hicks, K. L., Harrison, T. L., & Engle, R. W. (2015). Wonderlic, working memory

capacity, and fluid intelligence. Intelligence, 50, 186–195.

doi:10.1016/j.intell.2015.03.005

Highhouse, S. (2008). Stubborn Reliance on Intuition and Subjectivity in Employee

Selection. Industrial and Organizational Psychology, 1(3), 333–342.

doi:10.1111/j.1754-9434.2008.00058.x

Horn, J. L. (1965a). A rationale and test for the number of factors in factor analysis.

Psychometrika, 30(2), 179–185. doi:10.1007/bf02289447

Horn, J. L. (1965b). Fluid and crystallized intelligence: A factor analytic study of the

structure among primary mental abilities (Unpublished doctoral dissertation).

University of Illinois, IL.

Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and

crystallized intelligence. Journal of Educational Psychology, 57(5), 253–270.

doi:10.1037/h0023816

Horn, J. L., & Cattell, R. B. (1967). Age differences in fluid and crystallized

intelligence. Acta Psychologica, 26, 107–129. doi:10.1016/0001-

6918(67)90011-X

Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, Detection and

Amelioration of Adverse Impact in Personnel Selection Procedures: Issues,

Evidence and Lessons Learned. International Journal of Selection and

Assessment, 9(1–2), 152–194. doi:10.1111/1468-2389.00171


Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure

analysis: Conventional criteria versus new alternatives. Structural Equation

Modeling: A Multidisciplinary Journal, 6(1), 1–55.

doi:10.1080/10705519909540118

Hunt, E. (2011). Human intelligence. Cambridge, England: Cambridge University

Press. doi:10.1017/cbo9780511781308

Hunter, J. E., & Schmidt, F. L. (1996). Intelligence and job performance: Economic and

social implications. Psychology, Public Policy, and Law, 2(3–4), 447–472.

doi:10.1037/1076-8971.2.3-4.447

Jacoby, L. L., Toth, J. P., & Yonelinas, A. P. (1993). Separating conscious and

unconscious influences on memory: Measuring recollections. Journal of

Experimental Psychology: General, 122(2), 139–154.

Jaeggi, S. M., Studer-Luethi, B., Buschkuehl, M., Su, Y.-F., Jonides, J., & Perrig, W. J.

(2010). The relationship between n-back performance and matrix reasoning —

implications for training and transfer. Intelligence, 38(6), 625–635.

doi:10.1016/j.intell.2010.09.001

Jäger A. O., & Althoff, K. (1983). Der WILDE-Intelligenz-Test (WIT): Ein

Strukturdiagnostikum: Handanweisung. Göttingen: Hogrefe.

Jenkins, J., & Leung, C. (2013). English as a Lingua Franca. The Companion to

Language Assessment, 1605–1616. doi:10.1002/9781118411360.wbcla047

Jenner, E. (1807). Classes of the Human Powers of Intellect. The Artist, 19, 1–7.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger

Publishers/Greenwood Publishing Group.

Jeyakumar, S. L. E., Warriner, E. M., Raval, V. V., & Ahmad, S. A. (2004). Balancing

the Need for Reliability and Time Efficiency: Short Forms of the Wechsler

Adult Intelligence Scale-III. Educational and Psychological Measurement,

64(1), 71–87. doi:10.1177/0013164403258407

Jex, S. M., & Britt, T. W. (2008). Introduction to Organizational Psychology. In S. M.,

Jex & T. W. Britt (Eds.), Organizational psychology: A scientist–practitioner

approach (pp. 1–20). Hoboken, NJ: Wiley.


Johnson, W., & Bouchard, T. J., Jr. (2005a). The structure of human intelligence: It is

verbal, perceptual, and image rotation (VPR), not fluid and crystallized.

Intelligence, 33(4), 393–416. doi:10.1016/j.intell.2004.12.002

Johnson, W., & Bouchard, T. J., Jr. (2005b). Constructive replication of the visual

perceptual image-rotation model in Thurstone’s (1941) battery of 60 tests of

mental ability. Intelligence, 33(4), 417–430. doi:10.1016/j.intell.2004.12.001

Johnson, W., & Bouchard, T. J. (2007). Sex differences in mental abilities: g masks the

dimensions on which they lie. Intelligence, 35(1), 23–39.

doi:10.1016/j.intell.2006.03.012

Johnson, W., Bouchard, T. J., Jr., McGue, M., Segal, N. L., Tellegen, A., Keyes, M., &

Gottesman, I. I. (2007). Genetic and environmental influences on the Verbal-

Perceptual-Image Rotation (VPR) model of the structure of mental abilities in

the Minnesota study of twins reared apart. Intelligence, 35(6), 542–562.

doi:10.1016/j.intell.2006.10.003

Johnson, W., te Nijenhuis, J., & Bouchard, T. J., Jr. (2007). Replication of the

hierarchical visual-perception-rotation model in De Wolff and Buiten’s (1963)

battery of 46 tests of mental ability. Intelligence, 35(3), 69–81.

doi:10.1016/j.intell.2006.05.002

Kaiser, H. F. (1960). The Application of Electronic Computers to Factor Analysis.

Educational and Psychological Measurement, 20(1), 141–151.

doi:10.1177/001316446002000116

Kaufman, A. S. (1990). Assessing adolescent and adult intelligence. Needham Heights,

MA: Allyn & Bacon.

Kaufman, A. S., & Kaufman, N. L. (1990). Manual for the Kaufman Brief Intelligence

Test. Circle Pines, MN: American Guidance Service.

Kaufman A. S., & Kaufman N. L. (1993). Manual: Kaufman Adolescent and Adult

Intelligence Test. Circle Pines, MN: American Guidance Service.

Kersting, M., Althoff, K., & Jäger, A. O. (2008). WIT-2. Der Wilde-Intelligenztest.

Verfahrenshinweise. Göttingen: Hogrefe.

Klimoski, R., Jones, R. G. (2008). Intuiting in the selection context. Industrial and

Organizational Psychology, 1(3), 352–354. doi:10.1111/j.1754-

9434.2008.00061.x


Kline, P. (2000). Handbook of psychological testing. London: Routledge.

Kline, P. (2002). An Easy Guide to Factor Analysis. London: Routledge.

Kline, R. B. (2005). Principles and practice of structural equation modeling. New

York: Guilford Press.

Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the Shortcomings of

Shortened Tests: A Literature Review. International Journal of Testing, 13(3),

223–248. doi:10.1080/15305058.2012.703734

Lang, J. W. B., Kersting, M., & Beauducel, A. (2016). Hierarchies of factor solutions in

the intelligence domain: Applying methodology from personality psychology to

gain insights into the nature of intelligence. Learning and Individual

Differences, 47, 37–50. doi:10.1016/j.lindif.2015.12.003

Lang, J. W. B., Kersting, M., Hülsheger, U. R., & Lang, J. (2010). General mental

ability, narrower cognitive abilities, and job performance: The perspective of the

nested-factors model of cognitive abilities. Personnel Psychology, 63, 595–640.

doi:10.1111/j.1744-6570.2010.01182.x

Legree, P. J., Pifer, M. E., & Grafton, F. C. (1996). Correlations among cognitive

abilities are lower for higher ability groups. Intelligence, 23(1), 45–57.

doi:10.1016/s0160-2896(96)80005-5

Levy, P. (1968). Short-form tests: A methodological review. Psychological Bulletin, 69,

410–416. doi:10.1037/h0025736

Lautenschlager, G. J. (1989). A Comparison of Alternatives to Conducting Monte Carlo

Analyses for Determining Parallel Analysis Criteria. Multivariate Behavioral

Research, 24(3), 365–395. doi:10.1207/s15327906mbr2403_6

Lievens F., Reeve C. L. (2012). Where I-O Psychology Should Really (Re)start Its

Investigation of Intelligence Constructs and Their Measurement. Industrial and

Organizational Psychology, 5(2), 153–158. doi:10.1111/j.1754-

9434.2012.01421.x

Lubinski, D. (2000). Scientific and social significance of assessing individual

differences: ‘‘Sinking shafts at a few critical points.’’ Annual Review of

Psychology, 51(1), 405–444. doi:10.1146/annurev.psych.51.1.405


MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and

determination of sample size for covariance structure modeling. Psychological

Methods, 1(2), 130–149. doi:10.1037/1082-989x.1.2.130

MacCann, C., & Roberts, R. D. (2008). New paradigms for assessing emotional

intelligence: Theory and data. Emotion, 8(4), 540–551. doi:10.1037/a0012746

Mackey, A. P., Hill, S. S., Stone, S. I., & Bunge, S. A. (2010). Differential effects of

reasoning and speed training in children. Developmental Science, 14(3), 582–

590. doi:10.1111/j.1467-7687.2010.01005.x

Mayer, J. D., Caruso, D. R., & Salovey, P. (1999). Emotional Intelligence Meets

Traditional Standards for an Intelligence. Intelligence, 27(4), 267–298.

doi:10.1016/S0160-2896(99)00016-1

Mayer, J. D., Salovey, P., & Caruso, D. R. (2002). Mayer-Salovey-Caruso Emotional

Intelligence Test. Toronto, Canada: Multi-Health Systems.

Mayer, J. D., Roberts, R. D., & Barsade, S. G. (2008). Human Abilities: Emotional

Intelligence. Annual Review of Psychology, 59(1), 507–536.

doi:10.1146/annurev.psych.59.103006.093646

McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum.

McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing

on the shoulders of the giants of psychometric intelligence research. Intelligence,

37(1), 1–10. doi:10.1016/j.intell.2008.08.004

Miles, A., & Sadler-Smith, E. (2014). “With recruitment I always feel I need to listen to

my gut”: the role of intuition in employee selection. Personnel Review, 43(4),

606–627. doi:10.1108/pr-04-2013-0065

Mulder, J. L., Dekker, R., & Dekker P. H. (2004). Kaufman – Intelligentietest voor

Adolescenten en Volwassenen: Nederlandstalige bewerking van de Kaufman

Adolescent and Adult Intelligence Test: Handleiding. Leiden: PITS

Testuitgeverij.

Murphy, K. (1999). The challenge of staffing a post-industrial workplace. In D. Ilgen, &

E. Pulakos (Eds.), The changing nature of work performance: Implications for

staffing, personnel actions, and development (pp. 295–324). San Francisco:

Jossey-Bass.


Murphy, K. (2010). How a broader definition of the criterion domain changes our

thinking about adverse impact. In J. Outtz (Ed.), Adverse impact (pp. 137–160).

San Francisco: Jossey-Bass.

Murphy, K. R., Cronin, B. E., & Tam, A. P. (2003). Controversy and consensus

regarding the use of cognitive ability testing in organizations. Journal of Applied

Psychology, 88(4), 660–671. doi:10.1037/0021-9010.88.4.660

Naglieri, J. A. (1985). Matrix Anologies Test – Short Form: Examiner’s Manual. San

Antonio, TX: Psychological Corporation.

Neisser, U. (1979). The Concept of Intelligence. Intelligence, 3(3), 217–227.

doi:10.1016/0160-2896(79)90018-7

Neisser, U., Boodoo, G., Bouchard, T. J., Jr., Boykin, A. W., Brody, N., Ceci, S. J.,

Halpern, D. F., Loehlin, J. C., Perloff, R., Sternberg, R. J., & Urbina, S. (1996).

Ingelligence: Knowns and Unknowns. American Psychologist, 51(2), 77–101.

doi:10.1037/0003-066x.51.2.77

Nisbett, R. E., Aronson, J., Blair, C., Dickens, W., Flynn, J., Halpern, D. F., &

Turkheimer, E. (2012). Intelligence: New Findings and Theoretical

Developments. American Psychologist, 67(2), 130–159. doi:10.1037/a0026699

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.).

New York: McGraw-Hill.

Partchev, I., De Boeck, P., & Steyer, R. (2011). How Much Power and Speed Is

Measured in This Test? Assessment, 20(2), 242–252.

doi:10.1177/1073191111411658

Patil, V. H., Singh, S. N., Mishra, S., & Todd Donavan, D. (2007). Parallel Analysis

Engine to Aid Determining Number of Factors to Retain [Computer software].

Retrieved from http://smishra.faculty.ku.edu/parallelengine.htm

Pearson. (2012). White paper WAIS-IV-NL: Psychometrische eigenschappen: Deel 2

van 3. The Netherlands, Amsterdam: Pearson.

Ployhart, R. E., Holtz, B. C. (2008). The diversity–validity dilemma: Strategies for

reducing racioethnic and sex subgroup differences and adverse impact in

selection. Personnel Psychology, 61(1), 153–172. doi:10.1111/j.1744-

6570.2008.00109.x


Psychological Corporation. (1999). Wechsler Abbreviated Scale of Intelligence. San

Antonio, TX: Pearson.

Prewett, P. N. (1995). A comparison of two screening tests (the Matrix Analogies Test –

Short Form and the Kaufman Brief Intelligence Test) with the WISC-III.

Psychological Assessment, 7(1), 69–72. doi:10.1037/1040-3590.7.1.69

R Core Team (2015). R: A language and environment for statistical computing. Austria,

Vienna: R Foundation for Statistical Computing.

Raven, J. C. (1938). Progressive Matrices: A perceptual test of intelligence. London:

H.K. Lewis.

Raven, J., Raven, J. (Eds.). (2008). Uses and abuses of intelligence: Studies advancing

Spearman and Raven’s quest for non-arbitrary metrics. New York: Royal

Fireworks Press.

Raykov, T., Dimitrov, D. M., & Asparouhov, T. (2010). Evaluation of Scale Reliability

With Binary Measures Using Latent Variable Modeling. Structural Equation

Modeling: A Multidisciplinary Journal, 17(2), 265–279.

doi:10.1080/10705511003659417

Reeve, C. L. (2004). Differential ability antecedents of general and specific dimensions

of declarative knowledge: More than g. Intelligence, 32(6), 621–652.

doi:10.1016/j.intell.2004.07.006

Reeve, C. L., & Bonaccio, S. (2011). The nature and structure of ‘‘intelligence.’’ In T.

Chamorro- Premuzic, A. Furnham, & S. von Stumm (Eds.), Handbook of

Individual Differences (pp. 187–216). Oxford, England: Wiley-Blackwell.

Reeve, C. L., Scherbaum, C., & Goldstein, H. (2015). Manifestations of intelligence:

Expanding the measurement space to reconsider specific cognitive abilities.

Human Resource Management Review, 25(1), 28–37.

doi:10.1016/j.hrmr.2014.09.005

Resnick, L. B. (1976). The Nature of Intelligence. Hillsdale, N.J.: Lawrence Erlbaum

Associates.

Rindler, S. E. (1979). Pitfalls in assessing test speededness. Journal of Educational

Measurement, 16(4), 261–270. doi:10.1111/j.1745-3984.1979.tb00107.x

Rosseel, Y. (2012). lavaan : An R Package for Structural Equation Modeling. Journal of

Statistical Software, 48(2). doi:10.18637/jss.v048.i02


Rust, J., & Golombok, S. (2014). Modern psychometrics: The science of psychological

assessment. New York: Routledge.

Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). HR professionals’ beliefs about

effective human resource practices: Correspondence between research and

practice. Human Resources Management, 41, 149–174.

Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing

in employment, credentialing, and higher education: Prospects in a post-

affirmative-action world. American Psychologist, 56(4), 302–318.

doi:10.1037/0003-066x.56.4.302

Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score

adjustment in preemployment testing. American Psychologist, 49(11), 929–954.

doi:10.1037/0003-066x.49.11.929

Salthouse, T. A. (1996). The processing-speed theory of adult age differences in

cognition. Psychological Review, 103(3), 403-428. doi:10.1037/0033-

295X.103.3.403

Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for

moment structure analysis. Psychometrika, 66(4), 507–514.

doi:10.1007/bf02296192

Scherbaum, C. A., & Goldstein, H. W. (2015). Intelligence and the modern world of

work. Human Resource Management Review, 25(1), 1–3.

doi:10.1016/j.hrmr.2014.09.002

Scherbaum, C. A., Goldstein, H. W., Yusko, K. P., Ryan, R., & Hanges, P. J. (2012).

Intelligence 2.0: Reestablishing a Research Program on g in I-O Psychology.

Industrial and Organizational Psychology, 5(2), 128–148. doi:10.1111/j.1754-

9434.2012.01419.x

Schipolowski, S., Schroeders, U., & Wilhelm, O. (2014). Pitfalls and Challenges in

Constructing Short Forms of Cognitive Ability Measures. Journal of Individual

Differences, 35(4), 190–200. doi:10.1027/1614-0001/a000134

Schlegel, K., Grandjean, D., & Scherer, K. R. (2014). Introducing the Geneva Emotion

Recognition Test: An example of Rasch-based test development. Psychological

Assessment, 26(2), 666–672. doi:10.1037/a0035246


Schmidt, F. L. (2002). The Role of General Cognitive Ability and Job Performance:

Why There Cannot Be a Debate. Human Performance, 15(1), 187–210.

doi:10.1207/s15327043hup1501&02_12

Schmidt, F. L., & Hunter, J. E. (1992). Development of a Causal Model of Processes

Determining Job Performance. Current Directions in Psychological Science,

1(3), 89-92. doi:10.1111/1467-8721.ep10768758

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in

personnel psychology: Practical and theoretical implications of 85 years of

research findings. Psychological Bulletin, 124(2), 262–274. doi:10.1037/0033-

2909.124.2.262

Schmidt, F. L., & Hunter, J. (2004). General Mental Ability in the World of Work:

Occupational Attainment and Job Performance. Journal of Personality and

Social Psychology, 86(1), 162–173. doi:10.1037/0022-3514.86.1.162

Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and

ability on job knowledge, work sample performance, and supervisory ratings of

job performance. Journal of Applied Psychology, 71(3), 432–439.

doi:10.1037/0021-9010.71.3.432

Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment,

8(4), 350–353. doi:10.1037/1040-3590.8.4.350

Schneider, W. J., Joel, W., & Newman, D. A. (2015). Intelligence is multidimensional:

Theoretical review and implications of specific cognitive abilities. Human

Resource Management Review, 25(1), 12–27. doi:10.1016/j.hrmr.2014.09.004

Shepard, L. A. (1993). Evaluating Test Validity. Review of Research in Education, 19,

405. doi:10.2307/1167347

Sheppard, L. D., & Vernon, P. A. (2008). Intelligence and speed of information-

processing: A review of 50 years of research. Personality and Individual

Differences, 44(3), 535–551. doi:10.1016/j.paid.2007.09.015

Shevlin, M., Miles, J. N. V., Davies, M. N. O., & Walker, S. (2000). Coefficient alpha:

a useful indicator of reliability? Personality and Individual Differences, 28(2),

229–237. doi:10.1016/s0191-8869(99)00093-8


Sijtsma, K. (2009). Correcting Fallacies in Validity, Reliability, and Classification.

International Journal of Testing, 9(3), 167–194.

doi:10.1080/15305050903106883

Smith, G. T., Combs, J. L., & Pearson, C. M. (2012). Brief instruments and short forms.

In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J.

Sher (Eds.), APA handbook of research methods in psychology, Vol. 1 (pp. 395–

409). Washington, DC: American Psychological Association. doi:

10.1037/13619-021

Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the Sins of Short-Form

Development. Psychological Assessment, 12(1), 102–111. doi:10.1037//1040-

3590.12.1.102

Spearman, C. (1904). “General Intelligence,” Objectively Determined and Measured.

The American Journal of Psychology, 15(2), 201. doi:10.2307/1412107

Spearman, C. (1910). Correlation calculated with faulty data.British Journal of

Psychology, 1904-1920, 3(3), 271–295. doi:10.1111/j.2044-

8295.1910.tb00206.x

Spearman, C. (1927). The abilities of man. London: Macmillan.

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Upper

Saddle River, NJ: Pearson Allyn & Bacon.

Tate, R. F. (1954). Correlation Between a Discrete and a Continuous Variable. Point-

Biserial Correlation. The Annals of Mathematical Statistics, 25(3), 603–607.

doi:10.1214/aoms/1177728730

Terpstra, D. E. (1996). The search for effective methods. HR Focus, 73, 16–17.

Thorndike, E. L., Bregman, E. O., Cobb, M. V., & Woodyard, E. (1926). The Relations

of Altitude to Width, Area, and Speed. In E. L. Thorndike, E. O. Bregman, M.

V. Cobb, & E. Woordyard (Eds.) The Measurement of Intelligence (pp. 388–

401). Columbia: Bureau of Publications Teachers College Columbia University.

doi:10.1037/11240-013

Thorndike E. L., Terman L. M., Colvin S. S., Freeman F. N., Pintner R., Ruml. B., &

Pressey. S. L. (1921). Intelligence and its measurement: A symposium. Journal

of Educational Psychology, 12(3), 123–147. doi:10.1037/h0076078


Thurstone, L. L. (1938). Primary mental abilities. Chicago: University of Chicago

Press.

Thurstone, L. L. (1947). Multiple factor analysis: A development and expansion of

vectors of the mind. Chicago: University of Chicago.

Thurstone, L. L., & Thurstone, T. G. (1941). Factorial studies of intelligence.

Psychometric monographs, 94(2).

Université De Genève. (2014). Measurement of emotion recognition ability and

emotional understanding [PDF]. Retrieved fromhttp://www.autumnschool-

emo-int.ugent.be

Universiteit Gent. (2014). Course Specifications: Psychodiagnostics II (H001330)

[PDF]. Retrieved from

http://studiegids.ugent.be/2014/EN/studiefiches/H001330.pdf

Van der Linden, W. J. (2011). Setting Time Limits on Tests. Applied Psychological

Measurement, 35(3), 183–199. doi:10.1177/0146621610391648

Van der Linden, W. J. (2011b). Test Design and Speededness. Journal of Educational

Measurement, 48(1), 44–60. doi:10.1111/j.1745-3984.2010.00130.x

Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M.,

& Raijmakers, M. E. (2006). A dynamical model of general intelligence: the

positive manifold of intelligence by mutualism. Psychological review, 113(4),

842-861. doi:10.1037/0033-295x.113.4.842

Van Rooy, D. L., Viswesvaran, C., & Pluta, P. (2005). An Evaluation of Construct

Validity: What Is This Thing Called Emotional Intelligence? Human

Performance, 18(4), 445–462. doi:10.1207/s15327043hup1804_9

Ward, L. C., & Ryan, J. J. (1996). Validity and time savings in the selection of short

forms of the Wechsler Adult Intelligence Scale – Revised. Psychological

Assessment, 8(1), 69–72. doi:10.1037/1040-3590.8.1.69

Wechsler, D. (1949). Wechsler Intelligence Scale for Children. San Antonio, TX: The

Psychological Corporation.

Wechsler, D. (1955). Manual for the Wechsler Adult Intelligence Scale. Oxford,

England: Psychological Corporation.

Wechsler, D. (1967). Manual for the Wechsler Preschool and Primary Scale of

Intelligence. New York: Psychological Corporation.


Wechsler, D. (1975). Intelligence Defined and Undefined: A Relativistic Appraisal.

American Psychologist, 30(2), 135–139. doi:10.1037/h0076868

Wechsler, D. (2008). Wechsler adult intelligence scale – Fourth Edition (WAIS-IV). San

Antonio, TX: NCS Pearson.

Williams, P. E., Weiss, L .G, & Rolfhus, E. L. (2003). WISC-IV Technical Report #2

Psychometric Properties. San Antonio, TX: The Psychological Corporation.

Wonderlic, E. F. (1973). Wonderlic Personnel Test: Manual. Los Angeles: Western

Psychological Services.

Wonderlic, E. F., Wonderlic, C. F. (1992). Wonderlic Personnel Test user’s manual.

Libertyville, IL: Wonderlic Personnel Test, Inc.

Wymer, J. (2003). Utility of a clinically derived abbreviated form of the WAIS-III.

Archives of Clinical Neuropsychology, 18(8), 917–927. doi:10.1016/s0887-

6177(02)00221-4

Ziegler, M., Kemper, C. J., & Kruyen, P. (2014). Short Scales – Five

Misunderstandings and Ways to Overcome Them. Journal of Individual

Differences, 35(4), 185–189. doi:10.1027/1614-0001/a000148

Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for determining the

number of components to retain. Psychological Bulletin, 99(3), 432–442.

doi:10.1037/0033-2909.99.3.432

Short form of the Wilde Intelligenztest: Psychometric qualities and … · 2016-07-28 ·...

Documents

Transcript of Short form of the Wilde Intelligenztest: Psychometric qualities and … · 2016-07-28 ·...