Short form of the Wilde Intelligenztest: Psychometric qualities and … · 2016-07-28 ·...
Transcript of Short form of the Wilde Intelligenztest: Psychometric qualities and … · 2016-07-28 ·...
Academiejaar 2015 – 2016
Examenperiode tweede semester
Short form of the Wilde Intelligenztest:
Psychometric qualities and utility of a 12-minute intelligence
test for personnel selection
Masterproef II neergelegd tot het behalen van de graad van Master of Science in de
Psychologie, afstudeerrichting Bedrijfspsychologie en Personeelsbeleid
door
Louis Lippens
Promotor: Prof. dr. Filip Lievens
Begeleider: Drs. Jan Corstjens
Ondergetekende, Louis Lippens, geeft toelating tot het raadplegen van de scriptie door
derden.
Louis Lippens
I
VOORWOORD
Vele medestudenten zullen instemmend knikken wanneer ik stel dat een hoop werk in
dit sluitstuk kruipt. Persoonlijk kan ik met genoegen terugkijken op het eindresultaat.
Het was een proces van leren en persoonlijke ontwikkeling. Enerzijds heb ik meer
inzicht gekregen in het concept ‘intelligentie’, haar psychometrische fundamenten en de
inzetbaarheid ervan in de praktijk. Anderzijds ben ik de uitdaging aangegaan om mijn
masterproef in het Engels te schrijven, wat mijn taalvaardigheid sterk heeft
aangescherpt.
Deze masterproef is tot stand gekomen met de hulp van velen. Eerst en vooral hebben
mijn promotor, professor Filip Lievens, en thesisbegeleider, Jan Corstjens, het mogelijk
gemaakt om dit boeiende onderwerp aan te snijden. Verder hebben meer dan
driehonderd bachelorstudenten, in het kader van hun curriculum en onder begeleiding
van verschillende doctoraatsstudenten, ervoor gezorgd dat ik bruikbare data voor
handen had om het onderzoek te voeren. Waarvoor grote dank.
Daarnaast wil ik graag mijn ouders bedanken voor hun steun tijdens mijn vijfjarig
traject. Zij stonden steeds klaar om mij bij te staan en vooruit te stuwen, zowel voor,
tijdens als na examenperiodes. Ook mijn vriendin, Anna, is een heuse steun geweest
doorheen de jaren. Beiden hebben we de opleiding psychologie succesvol aangevat en
voltooid. Zij heeft me in het bijzonder geholpen met de laatste pennentrekken.
Tot slot hoop ik ten volle dat al wie dit werk leest en interesse heeft in het subject van
deze masterproef, intelligentie of verkorte testvormen nuttige informatie kan putten uit
de inhoud.
II
ABSTRACT
Since the early years of test development, researchers have explored numerous ways to
shorten tests – a trend that has recently been revived (Schipolowski, Schroeders, &
Wilhelm, 2014). Yet, many of these abbreviated forms are insufficiently researched and
validated (Smith, Combs, & Pearson, 2012). The current study is an inquiry into the
psychometric qualities of the WIT-S, a short form of the Wilde Intelligenztest (Jäger &
Althoff, 1983) for personnel selection, where time-efficiency and the ability to predict
an applicant’s future job performance are two crucial elements (Klimoski & Jones,
2008; Schmidt & Hunter, 2004). 318 Dutch-speaking participants between 18 and 59
years took part in the research and completed a large-scale test battery. In accordance to
the literature, male, younger and higher educated test takers averagely achieve better
scores on the WIT-S. Moreover, the items of the WIT-S are sufficiently interrelated as
the composite scale reaches satisfactory reliability levels. The convergence of the
WIT-S with a full-length measure of general intelligence and divergence with an
emotion recognition test furthermore suggest that the short form measures a general
factor of intelligence. However, the current research is not able to show that each item
of the WIT-S attains an adequate difficulty and discriminatory level, that the WIT-S
preserves the factor structure of a multifactorial measure and that the subscales of the
WIT-S reach adequate levels of reliability. Future research on the reliability and latent
factor structure of the WIT-S should be able to provide more conclusive findings about
its psychometric qualities and hence usefulness in practice.
Key words: intelligence, short form, reliability, validity, personnel selection
III
SAMENVATTING
Sinds het prille begin van testontwikkeling hebben onderzoekers verschillende manieren
onderzocht om testen in te korten – een trend die recentelijk weer is opgeleefd
(Schipolowski, Schroeders, & Wilhelm, 2014). Echter zijn veel van deze verkorte
testvormen onvoldoende onderzocht en gevalideerd (Smith, Combs, & Pearson, 2012).
Het huidig onderzoek exploreert de psychometrische kwaliteiten van de WIT-S, een
verkorte vorm van de Wilde Intelligenztest (Jäger & Althoff, 1983) voor
personeelsselectie, waar tijdsefficiëntie en het vermogen om de toekomstige prestatie
van een sollicitant te kunnen voorspellen twee cruciale elementen zijn (Klimoski &
Jones, 2008; Schmidt & Hunter, 2004). 318 Nederlandstalige participanten tussen 18 en
59 jaar namen deel aan het onderzoek en legden hierbij een grootschalige testbatterij af.
In overeenstemming met de literatuur behalen mannelijke, jongere en hoger opgeleide
participanten gemiddeld gezien betere scores op de WIT-S. Tevens staan de items van
de WIT-S voldoende met elkaar in verband daar de samengestelde testschaal adequate
betrouwbaarheidsniveaus bereikt. Verder convergeert de WIT-S met een uitgebreide
intelligentietest en divergeert de WIT-S met een emotie-herkenningstest, wat suggereert
dat de verkorte test een algemene intelligentiefactor meet. Daarentegen kan in het
huidig onderzoek niet worden aangetoond dat elk testitem een gepast moeilijkheids- en
onderscheidingsniveau bereikt, dat de WIT-S de factorstructuur van een multifactorieel
instrument behoudt en dat de subschalen van de WIT-S adequate
betrouwbaarheidsniveaus behalen. Toekomstig onderzoek naar de betrouwbaarheid en
latente factorstructuur van de WIT-S zou meer determinerende bevindingen over de
psychometrische kwaliteiten van de test moeten kunnen verschaffen en bijgevolg over
de inzetbaarheid van de WIT-S in de praktijk.
IV
TABLE OF CONTENTS
VOORWOORD..................................................................................................... I
ABSTRACT........................................................................................................... II
SAMENVATTING................................................................................................ III
INTRODUCTION................................................................................................. 2
The concept of intelligence.............................................................................. 2
Leading psychometric theories of intelligence................................................. 3
General intelligence (g)............................................................................. 3
Primary Mental Abilities........................................................................... 4
The three-stratum theory: Cattell-Horn-Carroll........................................ 4
The g-VPR theory..................................................................................... 5
Psychometric fundamentals of the current research......................................... 5
Modified Model of Primary Mental Abilities........................................... 5
MMPMA as an alternative........................................................................ 7
Measurements of intelligence........................................................................... 8
Major long-form intelligence tests............................................................ 8
Raven Progressive Matrices...................................................................... 9
Short-form measurements of intelligence................................................. 10
Clinical short-form measurements..................................................... 11
Baddeley’s and Wonderlic’s short form............................................. 12
Short form of the Wilde Intelligenztest.............................................. 12
Intelligence testing in personnel selection........................................................ 14
The importance of intelligence.................................................................. 14
Intuition in recruitment and selection........................................................ 15
Hypotheses....................................................................................................... 17
Test score differentiation........................................................................... 17
Items and their latent factors..................................................................... 19
Reliability and construct validity.............................................................. 20
METHOD............................................................................................................... 22
Sample.............................................................................................................. 22
Measures........................................................................................................... 22
Short form of the Wilde Intelligenztest..................................................... 22
V
Geneva Emotion Recognition Test............................................................ 23
Kaufman Adolescent and Adult Intelligence Test..................................... 24
Data analysis..................................................................................................... 24
Item analysis.............................................................................................. 25
Factor analysis........................................................................................... 25
Reliability analysis.................................................................................... 27
Construct validity...................................................................................... 28
RESULTS............................................................................................................... 29
Descriptive statistics......................................................................................... 29
Analysis of the WIT-S items............................................................................ 32
Factor structure of the WIT-S........................................................................... 35
Exploratory factor analysis........................................................................ 35
Confirmatory factor analysis..................................................................... 39
Reliability of the WIT-S................................................................................... 43
Construct validity of the WIT-S....................................................................... 44
DISCUSSION......................................................................................................... 46
General findings and implications.................................................................... 46
Test score differentiation........................................................................... 46
Items and their latent factors..................................................................... 47
Reliability and construct validity.............................................................. 48
Limitations........................................................................................................ 49
Suggestions for future research........................................................................ 50
Conclusion........................................................................................................ 51
REFERENCES...................................................................................................... 53
Running head: PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 1
In recent years, a trend has emerged of developing short form measures,
including abbreviated intelligence tests, for both research purposes and practical
applications (Schipolowski, Schroeders, & Wilhelm, 2014; Smith, McCarthy, &
Anderson, 2000). The rationale for this trend is threefold. Firstly, the administration of
full-length test batteries in large-scale assessments is undoubtedly costly. Secondly,
measuring intelligence in a selection context is highly valuable as it is one of the best
predictors of job performance and job training outcomes (Schmidt & Hunter, 1998).
Thirdly, time is a critical element, especially in personnel selection (Klimoski & Jones,
2008). To illustrate the latter, imagine a company that wants to hire the best candidate
out of 30 applicants. Estimating every applicant’s IQ by means of a well-constructed
and validated intelligence test, such as the Kaufman Adolescent and Adult Intelligent
Test (Kaufman, 1990), will require more than 30 hours to prepare, administer and score.
Consequently, there is insufficient time to justify the administration of long-form
measures and the recruiters of the company fall back on the convenient unstructured job
interview, or even worse, intuition (Highhouse, 2008).
However, the number of research on short-form intelligence tests is easily
outnumbered by research on shortened versions of, for instance, personality measures
(Kruyen, Emons, & Sijtsma, 2013). Moreover, a greater amount of short-form
intelligence tests is validated and used in a clinical context than in organizational
settings (see i.a. Donnell, Pliskin, Holdnack, Axelrod, & Randolph, 2007; Jeyakumar,
Warriner, Raval, & Ahmad, 2004; Wymer, 2003). Therefore, the need arises to explore
novel, shortened versions of intelligence measures. Of course, this can not be done
overnight: a substantial amount of in-depth psychometric research should be conducted
on the validity and reliability of any short-form measure, especially when the measure
will be used to make selection decisions on the individual level (Nunnally & Bernstein,
1994; Ziegler, Kemper, & Kruyen, 2014). Contrarily, past research has shown that, in
many cases, insufficient validation has been carried out when short forms were
developed (Smith et al., 2000; Smith, Combs, & Pearson, 2012).
The current research is an inquiry into the psychometric qualities of a short form
of the Wilde Intelligenztest (WIT-S) for personnel selection. It should provide us with
an answer to the question whether the WIT-S is able to function as a valid and reliable
alternative to the time-consuming, full-scale intelligence tests.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 2
INTRODUCTION
The concept of intelligence
“For though all men are, as we trust and believe, capable of the divine faculty of
reason […] yet it is not to all that the heavenly beam is disclosed in its splendour”
(Jenner, 1807, p. 1). People have tried to comprehend and define the concept of
intelligence numerous times and in many different ways. More than a century ago,
academics were already struggling to understand what intelligence exactly comprised.
Galton (1869) was convinced that sensory discrimination, e.g. not being able to
differentiate between heat and cold, underlay the construct of intelligence. Spearman
(1904), on the other hand, much more emphasized educational achievements.
Individuals who made the most of their education were, in his vision, seen as more
intelligent. In 1921, seventeen leading investigators on the topic of intelligence
exchanged their views on the construct and what they conceived it to be, but did not
agree on a general and comprehensive conceptualization of intelligence (Thorndike et
al., 1921). One of the first definitions was proposed by Boring (1923), two years later,
during a debate on the true meaning of intelligence. He defined it as “the capacity to do
well in an intelligence test” because, after all, “intelligence is what the tests test”
(Boring, 1923, p. 35). This definition, however, is circular and does not at all explain
what intelligence is assumed to be exactly.
Decades later, Wechsler (1975) contributed to a renewed debate and defined
intelligence as “the capacity of an individual to understand the world about him and his
resourcefulness to cope with its challenges” (p. 139). In his article he clearly stressed
the complexity of the concept of intelligence and its stratification. Resnick (1976) also
revived the interest in the nature of intelligence during a symposium by asking
colleagues what intelligence actually is, which she also cited in her book ‘The Nature of
Intelligence’. Neisser (1979) later addressed Resnick’s (1976) question by posing that
intelligence could not be explicitly defined. According to him no definite criteria of
intelligence exist, and thus a process-based definition of intelligence does not either. In
1996, he and a few of his colleagues did however try to get a grasp of the
conceptualization of intelligence. In a response to the controversial publication ‘The
bell curve’ by Herrnstein and Murray (1994), Neisser et al. (1996) stated that by
conceptualizing intelligence, one “attempts to clarify the ability of individuals to
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 3
understand complex ideas, to adapt effectively to the environment, to learn from
experience, to engage in various forms of reasoning, and to overcome obstacles by
taking thought” (p. 77). Also in the aftermath of ‘The bell curve’, Gottfredson (1997a)
defined intelligence analogously as “a very general mental capability that, among other
things, involves the ability to reason, plan, solve problems, think abstractly,
comprehend complex ideas, learn quickly and learn from experience” (p. 13). To date,
this working definition is still used by renowned psychologists, for example in the
review by Nisbett et al. (2012) on the recent findings and developments in the field of
intelligence.
If we take a look at the definitions of Wechsler (1975), Neisser et al. (1996) and
Gottfredson (1997a) altogether, we see three elements consistently return. First of all,
the ability to understand the complexity of ideas. Secondly, the ability to adapt to our
environment and learn from it. And thirdly, the ability to overcome the obstacles,
problems and challenges one faces by reasoning and thought processing. This gives us a
general idea of the various elements and the operationalization of the construct.
Although looking for communalities in existing definitions can be valuable, the fact that
the definitions mentioned are working definitions and not universally accepted
definitions of intelligence by (industrial and organizational) psychologists is
problematic (Scherbaum, Goldstein, Yusko, Ryan, & Hanges, 2012). A coherent,
agreed-upon definition is needed to clearly understand the complex concept itself and
subsequently to be able to construct valid measurements of intelligence, yet it can not be
offered at present.
However, instead of losing ourselves in the discussion on how to operationally
conceptualize and define intelligence, we should rather look further into the
psychometric theories – the fundamentals that underlie modern concepts of intelligence.
The psychometric approach to intelligence is also debated by many researchers,
reflecting on whether it is ultimately the best approach, though it is well regarded and
widely accepted (Scherbaum et al., 2012). These fundamentals will furthermore enable
us to understand how intelligence is structured, measured and used in practice.
Leading psychometric theories of intelligence
General intelligence (g). The theory of general intelligence is one of the first
theories of intelligence (Spearman, 1904; Spearman, 1927). The g-model presumes that
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 4
the individual differences in cognitive performance are due to just one factor of
cognitive ability, namely ‘general intelligence’ or ‘g’. For the most part, the theory and
existence of g is supported by the well-documented fact that there are positive
correlations between almost all cognitive ability tests, which academics also call the
positive manifold (Gottfredson, 1997b; Jensen, 1998). The above led to the assumption
that there should be a general factor of intelligence underlying all of these tests.
Primary Mental Abilities. Not long after its nascence, Thurstone (1938)
contested the theory of general intelligence. Rather than assuming that there was only
one single dimension of intelligence, he demonstrated that there were several,
statistically independent primary abilities (Thurstone, 1938; Thurstone & Thurstone,
1941). These Primary Mental Abilities (PMA) are ‘Spatial Reasoning’, ‘Perceptual
Speed’, ‘Number Facility’, ‘Verbal Relations’, ‘Word Fluency’, ‘Memory’, and
‘Inductive Reasoning’. Individuals presumably differed in terms of the extent to which
they perform on these abilities. However, more recent research (Abad, Colom, Juan-
Espinosa, Garcia, 2003; Detterman & Daniel, 1989; Legree, Pifer, & Grafton, 1996)
indicated that these effects were more likely due to restriction of range, as the
participants in the original study by Thurstone (1938) were part of a high-ability group,
making it harder to find strong scientific evidence for a common g-factor. The cognitive
differentiation found in the high-ability group was probably partially the outcome of the
urge to specialize in current society; acquiring a high level of expert knowledge
consumes time and effort and catalyses differentiation of cognitive abilities (Ericsson,
Charness, Feltovich, & Hoffman, 2006).
The three-stratum theory: Cattell-Horn-Carroll. Another model and
psychometric substantiated theory, originally developed by Cattell (1963) and Horn
(1965a) as the Gf-Gc Model (Horn & Cattell, 1966), and later modified by Carroll
(1993), is the Cattell-Horn-Carroll (CHC) hierarchical model with three strata or levels.
The first stratum consists of narrowly defined abilities, e.g. ‘Lexical Retrieval Ability’
and ‘Numerical Facility’. The second stratum comprises broader abilities, including
fluid intelligence (Gf), the ability to handle novel and uncommon problems and issues,
and crystallized intelligence (Gc), the ability to handle common problems and issues by
bringing one’s acquired knowledge into practice. The third stratum consists of the
general intelligence factor g.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 5
The main scientific evidence in favour of the differentiation between Gc and Gf,
opposed to the unitary g-model, is the distinction between respectively consciously
controlled and automatic, unconsciously memory systems (Jacoby, Toth, & Yonelinas,
1993), and the difference in test performance dependent of age. The performance on
tests focussing on Gf rises in childhood and adolescence, but then starts declining with
age in adulthood and onward (Horn & Cattell, 1967; Salthouse 1996). This decline is
much faster than the performance on tests focussing on Gc, which keeps on increasing
with age until late adulthood before it stagnates and eventually declines (Ackerman,
1996; Blair, 2006; Horn & Cattell, 1967). Findings on the neurobiological level also
supports this differentiation as (a) severe damage to the prefrontal cortex (PFC) impairs
performance on Gf-based but not on Gc-based tests and (b) the PFC degenerates more
rapidly with age than does the rest of the cerebral cortex (Blair, 2006; Nisbett et al.,
2012). Finally, general gains in scores on Gf-based tests (e.g. Raven’s Progressive
Matrices; Raven, 1938) have substantially exceeded gains in Gc-based test in the past
decades (Flynn, 2007).
The g-VPR Theory. Johnson and Bouchard (2005a) recently suggested a four-
stratum model. The first stratum consists of primary cognitive ability traits. The second
stratum consists of narrowly defined abilities – similar to the first stratum of the CHC-
model. The third stratum consists of a verbal, perceptual and rotational ability. The
fourth and also last stratum consists of the general intelligence factor g – similar to the
third stratum of the CHC-model. The principal psychometric evidence in favour of the
g-VPR model is the superior fit of the model when it was compared with other
psychometric models of intelligence on the basis of several large-scale data sets
(Johnson et al., 2007; Johnson, Nijenhuis, & Bouchard, 2007), including the original
data used by Thurstone and Thurstone (1941) to support their model of primary mental
abilities (Johnson & Bouchard, 2005b).
Psychometric fundamentals of the current research
Modified Model of Primary Mental Abilities. The psychometric model
operationalized in the current study is the ‘Modifiziertes Model der Primary Mental
Abilities’ or ‘Modified Model of Primary Mental Abilities’ (MMPMA) (Kersting,
Althoff, & Jäger, 2008). The model is an updated version of the model of Primary
Mental Abilities (Thurstone, 1938), adapted to the current Zeitgeist and theoretical and
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 6
methodological advances. Figure 1 represents a visual overview of the model. The first
stratum consists of specific ability tests such as ‘Working Efficiency’ and ‘Economic
Knowledge’. The second stratum, the heart of the model, comprises the primary mental
abilities. The third stratum consists of the higher-order factors ‘fluid intelligence’ (Gf)
and ‘crystallized intelligence’ (Gc), analogue to the eponymous second-stratum factors
in the CHC-model (Carroll, 1993) and the higher-order factors in the original Gf-Gc
model (Horn & Cattell, 1966).
Figure 1. Modified Model of Primary Mental Abilities based on Thurstone’s Model of
Primary Mental Abilities (1938) – constructed by Kersting et al. (2008).
Kersting et al. (2008) have made three important assumptions regarding their
model. First of all, ‘Reasoning’ (in their model also referred to as deductive reasoning)
is regarded as an operationalization of the mental abilities ‘Verbal Comprehension’,
‘Numeric Reasoning’ and ‘Spatial Reasoning’, that presumably pertain the same content
domain. Secondly, the primary mental ability factors are considered to be part of a
hierarchical nested-factors model with higher-order (Gf and Gc) and lower-order
factors. Thirdly, fluid intelligence and working memory are hypothesized to be highly
overlapping. The third assumption is elucidated by the substantial amount of recent
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 7
research on working memory and its relation to intelligence. Seemingly, Gf can be
substantially improved by increasing one’s working memory capacity through training
(see i.a. Basak, Boot, Voss, & Kramer, 2008; Borella, Carretti, Riboldi, & De Beni,
2010; Jaeggi et al., 2010; Mackey, Hill, Stone, & Bunge, 2011). Yet, the question on
how Gf and working memory exactly relate to each other is complex and still partly
unsolved (Nisbett et al., 2012).
MMPMA as an alternative. The Modified Model of Primary Mental Abilities
builds on the fundamentals of two of the four leading psychometric models of
intelligence. Kersting et al. (2008) integrated the Gf-Gc distinction of the CHC-theory
and the differentiation in lower-order abilities of the PMA theory into one
comprehensive model. Herewith, the question arises why this particular integrated
model would be a better fit – in a personnel selection context – than one of the four
leading psychometric models of intelligence.
As previously mentioned, Gottfredson (1997b) and Jensen (1998) demonstrated
the existence of a positive manifold of correlations between intelligence and cognitive
ability tests, which is evidence in favour of the model of general intelligence
(Spearman, 1904; Spearman, 1927). Yet, Van Der Maas and colleagues (2006) argue in
their study that this does not automatically imply the existence of just one single factor
of intelligence. Even Jensen (1998) himself notes that a distinction can be made
between intelligence as a learning ability (read: Gf) and intelligence as acquired
knowledge from learning (read: Gc), which is supported by research from Reeve &
Bonaccio (2011). This is in line with the findings that the psychometric fundamentals of
intelligence are hierarchically organized and that intelligence consists of multiple
dimensions (Carroll, 1993; Lang, Kersting, & Beauducel, 2016, McGrew, 2009; Reeve,
2004).
These findings do not, however, detract from the existence of a general factor at
the top of the hierarchy (see i.a. Lang et al., 2016; Lubinski, 2000). Dickens (2007)
explained the appearance and role of g in a multidimensional factor structure (e.g. as
premised by Carroll, 1993). He argues that “people who are better at any given
cognitive skill are more likely to end up in environments that cause them to develop a
wide range of skills” (Nisbett et al., 2012, p. 21). Consequently, a person who possesses
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 8
a great amount of general intelligence will more likely develop and possess a substantial
amount of specific mental abilities and skills.
The current study focuses on intelligence in a personnel selection context (cfr.
infra). In this respect, Schneider, Joel, and Newman (2015) plead for a more
differentiated approach towards the concept of intelligence in the field of human
resources. They mention the following reasons for their plea: (a) specific abilities
predict a statistically significant larger amount of variance than does the unitary g-
model, (b) according to the ‘compatibility principle’ specific abilities predict specific
types of job performance, enabling organizations to take into account these
differentiations in hiring decisions, and (c) the disadvantage of minority groups in
intelligence testing (see i.a. Nisbett et al., 2012) could be compensated by the
implementation of specific cognitive abilities in the tests. Also, Goldstein, Zedeck, and
Goldstein (2002) have shown that in order to predict job performance on the basis of
intelligence it is important to include more narrowly defined, lower-order cognitive
abilities, because a single predictor like g falls short.
One could furthermore plead that the presumably statistical superior g-VPR
model should be used (Johnson & Bouchard, 2005a). However, Hunt (2011) rightly
notes in his overview of human intelligence that statistical superiority should not be the
only criterion taken into account when selecting a suiting psychometric model; also the
practical implications of the model used should be considered. For him, and I quote, “It
makes sense [in personnel classification] to distinguish between candidates who do not
know but can learn (low Gc, high Gf) and those who already have the requisite
knowledge (high Gc).” (Hunt, 2011, p. 110). This is a plea in favour of a more CHC-
oriented model and hence the Modified Model of Primary Mental Abilities as an
alternative.
Measurements of intelligence
Major long-form intelligence tests. Two of the most widely known – if not the
most popular – maximum performance, individually administered intelligence tests are
the Wechsler Intelligence Scale for Children (WISC) and the Wechsler Adult
Intelligence Scale (WAIS) (Wechsler, 1949; Wechsler, 1955). They are both in their
fourth edition and are mainly used in educational and clinical settings. According to
Hunt (2011), Wechsler initially developed his intelligence tests from a more pragmatic
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 9
point of view rather than basing them on specific psychometric theories. As reported by
Williams, Weiss and Rolfhus (2003) the WISC-IV (English version) has strong internal
psychometric qualities with scale-reliability estimates ranging from .88 to .94 and
convergent scale-validity coefficients with the WISC-III ranging from .72 to .89. The
WAIS-IV-NL (Dutch translation) has strong psychometric qualities as well: split-half
reliability estimates of the subtests ranged from .75 to .97, test-retest correlation with a
mean retest interval of 4 months and 12 days of the full scale ranged from .94 to .95 and
alpha coefficients estimates for the full scale across all age groups equalled .97
(Pearson, 2012; Wechsler, 2008). Furthermore, regarding the construct validity of the
test, the total IQ scores of the WAIS-IV-NL strongly correlated (r = .90) with its
predecessor’s scores (WAIS-III-NL; Pearson, 2012). On average, the WISC takes 75
minutes to administer, the WAIS takes about 90 minutes (Wechsler, 2008).
On the other hand, Kaufman (1990) based his Kaufman Adolescent and Adult
Intelligence Test (KAIT), another maximum performance individual intelligence test,
on the Gf-Gc distinction (Horn & Cattell, 1966; Kaufman & Kaufman, 1993). His test
of intellectual functioning is also widely known and used (Hunt, 2011). The fluid,
crystallized and total IQ scales of the Dutch version (KAIT-IV-NL) have average alpha
reliability estimates ranging from .92 to .95, while the test-retest reliabilities (with a
retest interval of 3 months and 10 days) of these scales ranged from .80 to .89 (Mulder,
Dekker, & Dekker, 2004). As for the construct validity, total IQ scores of the KAIT-IV-
NL strongly correlated (r = .80) with the WAIS-IV-NL scores (Pearson, 2012).
According to the original manual, the core battery of the KAIT takes 58 to 73 minutes
to administer; the extended battery takes 83 to 102 minutes (Kaufman & Kaufman,
1993).
Raven Progressive Matrices. The Raven Progressive Matrix test (RPM; Raven,
1938; Raven & Raven, 2008) is a single-format, time-limited, inductive reasoning test,
frequently and widely used in i.a. European personnel selection contexts (Raven &
Raven, 2008). The RPM takes up to 45 minutes to complete and is an optimal measurer
of general intelligence (Jensen, 1998). Because the items of the RPM only consist of
figural patterns and thus no verbal items are included, the matrices heavily reduce
cultural and ethnic differences in IQ scores (Nisbett et al., 2012). This is a possible
explanation why these tests are popular in (European) selection procedures: minimizing
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 10
verbality in intelligence tests effectively lowers subgroup differences in test scores and
racioethnic disadvantages individuals experience in personnel selection (Ployhart &
Holtz, 2008). Practitioners in the field of I-O psychology generally agree that diversity
in organizations is very valuable and, as a result, intelligence tests that minimize the
subgroup differences are desirable (Murphy, Cronin, & Tam, 2003). In sum, the
relatively short administration time of 45 minutes – especially in comparison with its
widely used counterparts such as the KAIT – the minimized verbal component and its
strong relatedness to g are great advantages (see Raven & Raven, 2008).
Short-form measurements of intelligence. Since the existence of long-form
intelligence tests, people have questioned whether it is necessary to include all subtests
and test items in order to accurately assess intelligence (Doll, 1917). Early on, test
developers already introduced elements of time in their intelligence measures, linking
the speed someone correctly answers items to their intellectual capacities (e.g.
Thorndike, Bregman, Cobb, & Woodyard, 1926). To date, researchers still asks
themselves this question and pursue the development and evaluation of shorter
measurements of intelligence (Ziegler et al., 2014; also see i.a. Bilker et al., 2012;
Donnell et al., 2007). The main purpose of these short forms is mainly saving time and
cutting costs, without losing psychometric validity (Smith et al., 2000).
However, several researchers have previously contested and discouraged the
development and use of short forms. Wechsler (1967) posed that “reduction in the
number of subtests as a time-saving device is unjustifiable” (p. 37) and that one has to
find the time when desiring to pursue time-saving methods. Levy (1968) stated that
“many of the essential components of the real problem [when producing short forms]
have been forgotten” (p. 415). According to Levy (1968) the real problem comprised
the misbalance between time-efficiency and loss in validity, claiming that the loss in
validity is consistently too high to advocate the time gain.
From a nuanced perspective, Schipolowski et al. (2014) argue in their article on
the construction of short forms of intelligence measures that reliable and efficient short-
form assessment is certainly possible. Moreover, according to Ziegler and colleagues
(2014), it is a misunderstanding that short forms are obsolete or that they could not be
used for decision on both the group and individual level. Loss in validity can be
justified, as there are several degrees of shortening, and this loss shouldn’t be too high
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 11
per definition. One can a priori assume that extreme shortening (e.g. reducing a 150-
item test to only six items) will account for a large loss in validity and reliability,
whereas reasonable shortening (e.g. reducing it to 45 items) will account for a smaller
loss, given that the content domain of the items and the factor structure are preserved
(see i.a. Nunnally & Bernstein, 1994). Smith and colleagues (2000) furthermore assure
that well-constructed short-form intelligence tests – either just for screening purposes or
as substitutions for full-scale test batteries – have sufficient potential. According to
them, there is still a lot of room for improvement in terms of the methodology of short-
form development and validation, which is promising.
Clinical short-form measurements. Research on the psychometric qualities and
utility of existing short-form intelligence tests in a clinical context is already hopeful.
Jeyakumar, Warriner, Raval, and Ahmad (2004), for example, provided converted
reliability and validity data based on the original publications by Wechsler regarding six
short forms of the Wechsler scales including the Wechsler Abbreviated Scale of
Intelligence (WASI) (Psychological Corporation, 1999). They’ve found estimates of
internal consistency ranging from .88 to .98 and part-whole correlations ranging from
.72 to .96. Furthermore, Donnell et al. (2007) evaluated and found two reliable short
forms of the WAIS-III specifically constructed for the evaluation of elderly patients
suffering from neurocognitive impairment associated with dementia and patients with
motoric disabilities. Other research on abbreviated forms of the Wechsler scales also
provides evidence that these shortened forms can be implemented and used as reliable
intelligence tests in clinical settings where time is either very valuable or scarce (see i.a.
Adams, Smigielski, & Jenkins, 1984; Axelrod et al., 2001; Donders, 1997; Ward &
Ryan, 1996; Wymer, 2003).
Furthermore, comparisons of several short-form intelligence tests provide us
with a promising construct validity perspective as well. The study by Prewett (1995)
revealed correlations between the Matrix Analogies Test Short Form (MAT) (Naglieri,
1985) and the WISC-III, and between the Kaufman Brief Intelligence Test (K-BIT)
(Kaufman & Kaufman, 1990) and the WISC-III of .67 and .78, respectively. In addition,
Canivez (1995; 2005) found evidence for high concurrent validity of the K-BIT when
compared to the WISC-III in a student sample. In the study by Hays, Reas, and Shaw
(2002) in a sample of 85 psychiatric patients, the K-BIT also significantly and
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 12
substantially correlated with the WASI (r = .89). Finally, The Wide Range Intelligence
– another multi-dimensional, short-form intelligence test (Glutting, Adams, Sheslow,
2000) – was also significantly correlated with the WASI (r = .71) in the study by Bialik
(2008) within a sample of college students.
Baddeley’s and Wonderlic’s short form. Two of the more known short-form
measures of intelligence are Baddeley’s (1968) Three-Minute Reasoning Test and the
Wonderlic Personnel Test (WPT; Wonderlic, 1973; Wonderlic & Wonderlic, 1992).
The extremely short Three-Minute Reasoning Test (T-MRT) consists of a series of
simple questions that vary in verbal complexity. The allocated IQ score depends on the
number of correctly answered items within the three-minute time limit (Baddeley,
1968). The scores on the test significantly correlate with those on the British Army
verbal intelligence test (r = .59) and is therefore one of the first and fastest short-form
intelligence tests able to assess one’s verbal comprehension (Baddeley, 1968). The T-
MRT is a great example of a ‘speeded test’, meaning that it assesses one’s capacities to
answer relatively simple questions in a limited amount of time (Hunt, 2011).
In addition, the Wonderlic Personnel Test is a popular, highly compressed
intelligence test, usually administered in group, and consists of 50 multiple choice
questions with a time limit of twelve minutes. Participants are shown a balanced mix of
problem-solving, verbal and arithmetic items. The test is mainly used in an occupational
context, specifically personnel selection (Hunt, 2011), and highly correlates with the
WAIS, as evaluated in research conducted by Dodrill (1981) and Dodrill and Warner
(1988) in a variety of psychiatric and non-psychiatric subject samples. In other research,
Dodrill (1983) also found a five-year test-retest reliability estimate of .94. Split-half
reliabilities ranged between .88 and .94 (Wonderlic, 1973).
Short form of the Wilde Intelligenztest. Overall, the psychometric qualities –
reliability and validity in particular – of the abbreviated intelligence tests in the previous
section on short-form measurements of intelligence are promising. However, most of
these tests were either specifically constructed for use in clinical settings or examined in
clinical samples. Other tests, such as Baddeley’s T-MRT and the Wonderlic Personnel
Test are well suited for (employee) screening and selection (Hunt, 2011). Yet, a
moderate to high proportion of the WPT and T-MRT items require adequate verbal
comprehension and linguistic reasoning. This verbal component potentially leads to test
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 13
score differences in racioethnic subgroups (Nisbett et al., 2012; Sackett, Schmitt,
Ellingson, & Kabin, 2001), lowering the chances of individuals from these subgroups to
be selected and thus decreasing the diversity and selection rates for racial minorities
(Hough, Oswald, & Ployhart, 2001; Murphy, 2010; Ployhart & Holtz, 2008).
The intelligence test used in the current research, a short form of the Wilde
Intelligenztest (WIT-S), is a non-verbal intelligence test based on the ‘Modified Model
of Primary Mental Abilities’ (Kersting, Althoff, & Jäger, 2008). The goal of the test is
to measure applicants’ intelligence in a highly time-efficient manner and subsequently
to be able to differentiate between applicants on the basis of their capability to perform
well in their future job (cfr. infra; Schmidt & Hunter, 1996). Psychometric qualities of
its parent test, the WIT-1 (Jäger & Althoff, 1983) suffice. On the one hand, split-half
reliability estimates of the composite scale ranged from .97 to .98. On the other hand,
split-half coefficients of the individual scales ranged from .88 to .99, whereas alpha
coefficients ranged from .82 to .97. One-year retest reliability coefficients ranged from
.59 to .68, except for one of the retention subscales (r = .15). Construct and criterion
validity were investigated in numerous ways and in various settings (military/police,
civil services, management, health care, etc.), including selection contexts, and rendered
adequate results. See Jäger and Althoff (1983) for a detailed overview.
The WIT-S comprises 45 items from four second-stratum abilities, namely
‘Reasoning’ (general deductive reasoning), ‘Spatial Reasoning’, ‘Numeric Reasoning’
and ‘Perceptual Speed’. All four mental abilities are theoretically related to fluid
intelligence (Gf), which makes it possible to identify applicants who perhaps do not
possess the required knowledge (Gc) but have enough learning capabilities, and to
predict their ability to solve complex (work-related) problems (Agnello, Ryan, &
Yusko, 2015). This is an advantage over the Wonderlic Personnel Test, as recent
research by Hicks, Harrison and Engle (2015) has shown that the WPT does not directly
relate to fluid intelligence, while the overall relation to general intelligence remains
somewhat unclear as well.
Furthermore, the linguistic component of the WIT-S has been minimized,
similar to the Raven Progressive Matrices, and no items requiring verbal comprehension
or word fluency were integrated in the WIT-S. This is likely to be a diversity asset
compared to the Wonderlic Personnel Test, Baddeley’s Three-Minute Reasoning Test,
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 14
the Wide Range Intelligence Test, the Kaufman Brief Intelligence Test, and the
Wechsler Abreviated Scales of Intelligence, as noted by Hough and colleagues (2001).
The WIT-S consequently has the capabilities to be embedded in an increasingly
multicultural and international business environment (Reeve, Scherbaum, & Goldstein,
2015; Scherbaum & Goldstein, 2015) because the test items do not heavily rely on any
particular language. Moreover, the test can be completed with a minimum knowledge of
English – beside Dutch currently the only test language – which is more and more
becoming widely known and used as common language on a global scale, a.k.a. ‘lingua
franca’ (Jenkins & Leung, 2013).
Moreover, the WIT-S is a twelve-minute, hence time-restricted intelligence test.
Consequently, the test is considered to be a ‘speeded test’. In order to be formally
classified as a speeded test, at least 20% of the test takers should leave one or more
questions unanswered and at least one test takers should leave 25% of the questions (or
more) blank (Van der Linden, 2011a). Speeded tests measure the speed whereby a
participant is able to correctly answer relatively simple questions, as opposed to ‘power
tests’, which measure the precision with which a participant is able to correctly answer
relatively complex questions (Hunt, 2011; Van der Linden, 2011b). Recent research on
the latent constructs ‘speed’ and ‘power’ suggests that a clear distinction between the
two constructs can be made in psychometric tests on the basis of time and accuracy data
(Partchev, De Boeck, Steyer, 2011). Besides, the correlation between mental speed and
general intelligence is moderate but stable, and mental speed is also substantially related
to Gf (Sheppard & Vernon, 2008).
Intelligence testing in personnel selection
The importance of intelligence. Hunter and Hunter (1984) have shown that
general intelligence, or general mental ability (GMA) as they call it, is one of the best
single predictors of future job performance, but also future learning and training
abilities. Schmidt and Hunter (1998) later elaborated on the validity of various
assessment methods in personnel selection and demonstrated that the mean predictive
validity of GMA for job performance is .51 and the mean predictive validity of GMA
for the performance in job training programs is .56. A valid theoretical explanation for
the fact that GMA predicts a large amount of variance in future job performance is
underpinned by the well-documented research on the mediating role of job knowledge
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 15
in the relationship between GMA and job performance (Borman et al. 1993; Hunter &
Schmidt, 1996; Schmidt, 2002; Schmidt & Hunter, 1992; Schmidt & Hunter, 2004;
Schmidt, Hunter, & Outerbridge, 1986). Higher levels of GMA lead to higher levels of
job knowledge, which, in their turn, lead to higher levels of job performance.
Employees who are more intelligent are able to gain more job knowledge and they tend
to gain it more rapidly.
Furthermore, Schmidt and Hunter (1998) concluded that “the validity of the
personnel measure (or combination of measures) used in hiring is directly proportional
to the practical value of the method – whether measured in dollar value of increased
output or percentage of increase in output” (p. 273). Using the most appropriate and
valid assessment methods – intelligence tests in particular – thus has a lot of economic
value in today’s world of increased business competition and knowledge work
(Barkema, Baum, & Mannix, 2002; Murphy, 1999). Besides, the complexity and
ambiguity of jobs has been increasing rapidly, making intelligence an excellent choice
when selecting performance predictors (DeNisi, Hitt, Jackson, 2003; Gottfredson, 2002;
Lang, Kersting, Hülsheger, & Lang, 2010; Scherbaum et al., 2012).
Intuition in recruitment and selection. During the five years of our academic
education, before officially becoming industrial and organizational psychologists, our
professors frequently stress the fact that our curriculum is structured in such a way that
we will be able to operate as true scientists-practitioners in the field of human resources.
As Jex and Britt (2008) so inclusively put it: “Organizational psychologists use
scientific methods to study behavior in organizations. They also use this knowledge to
solve practical problems in organizations; this is the essence of the scientist-practitioner
model […]” (p. 19). Thus, in order to attain the full potential of organizations and their
employees, rigorous scientific research and practice should intertwine harmoniously.
Unfortunately, this is not always the case.
As previously mentioned, Schmidt and Hunter (1998) have shown that
intelligence tests substantially predict job performance, even better than do unstructured
interviews. Yet, human resource managers generally perceive unstructured interviews as
more effective than intelligence tests (Terpstra, 1996), although they know that there are
many downsides to these type of interviews (Rynes, Colbert, & Brown, 2002).
Moreover, unstructured interviews are far more popular and intensively used in
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 16
selection contexts than any other assessment method, including intelligence tests
(Buckley, Norris, & Wiese, 2000). Agnello et al. (2015) claim that intelligence is “one
of the most critical individual difference variables in human resources” (p. 47), but
practitioners still heavily rely on intuition. They are, in many cases, more reluctant to
using reliable and validated selection decision aids, including intelligence tests, than
they are to using their own gut feeling and so-called ‘experience in predicting human
behavior’ (Highhouse, 2008).
Interviews conducted by Miles and Sadler-Smith (2014), held to understand the
rationale of managers using their intuition as an indicator for performance, personality
and the person-environment fit in personnel selection, revealed a few interesting
insights. First of all, there were some complications with face validity. More objectively
measured assessment methods are generally well received by applicants, yet, not all of
them are perceived as appropriate to assess the specific selection criteria, eventually
increasing the use of intuition. In addition, intelligence tests are not generally seen as
comprehensive predictors of employee behavior (Murphy et al., 2003) and are often
even hackled for being decontextualized and restricted (Lievens & Reeve, 2012).
Secondly, the high costs of setting up assessment centers, including a great amount of
formal testing, was considered less attractive by many practitioners than using a more
intuitive approach that didn’t require financial capital. Thirdly, intuition was used to
complement the more rational and objective approach to recruitment and was integrated
as an additional selection decision aid, voiding valid test outcomes in some cases.
More importantly, and highly relevant to the current study, is the finding that
human resources practitioners wish to hire the most suited person for the job in a highly
time-efficient manner (Klimoski & Jones, 2008). In this respect, intuitive approaches
including a casual screening of an applicant’s résumé or a quick, unstructured interview
will take up less time than a full-scale intelligence test like the KAIT, but eventually fall
short. In order to perform personnel selection in an evidence-based manner, but also
taking into account the need for a less time-consuming hiring process, (re)new(ed),
time-saving intelligence assessment methods should be explored. This is in line with the
recent plea of Scherbaum and colleagues (2012), arguing that we should invest more
resources in the research on the measurement of intelligence, because in recent years,
the contribution of I-O psychologists to this research domain has steadily decreased. In
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 17
addition, Agnello and his colleagues (2015) stated that a decent amount of advancement
in psychometrics and improved measurement techniques have been discussed in modern
years, yet “much more research on many of these developments is needed” (p. 53).
Hypotheses
As for the potential utility of the WIT-S, quite a lot has already been mentioned
and discussed in the introduction. Firstly, the minimized verbal component potentially
potentially leads to decreased adverse impact in selection (Hough et al., 2001; Murphy,
2010; Ployhart & Holtz, 2008), being a strong asset in an internationalizing, complex,
technology driven and multicultural business environment (Reeve et al., 2015;
Scherbaum & Goldstein, 2015). Secondly, the four factors of the WIT-S focus on fluid
intelligence, which should make it possible to differentiate between applicants who can
learn but do not yet possess the necessary job knowledge and applicants who have the
requisite knowledge but only possess little ability to learn from training (Agnello et al.
2015; Hunt, 2011). Thirdly, ‘speeded’ intelligence tests have an economic and time-
saving advantage (Hunt, 2011), which is especially valuable in the current context of
increased knowledge work and augmented competition between organizations
(Barkema et al., 2002; Murphy, 1999; Scherbaum et al., 2012).
The empirical section of this study focusses on testing the general psychometric
qualities of the WIT-S. It is ought to provide us with an answer to the question whether
the short form of the Wilde Intelligenztest is a well-constructed, valid and reliable
alternative to the time-consuming full-scale intelligence tests. Thoroughly researching
these psychometric qualities is essential if we eventually want to use the WIT-S in a
personnel selection context (Schipolowski et al., 2014; Smith et al., 2000).
Test score differentiation. First of all, I want to explore whether there are any
subgroup differences in test scores for sex, age and educational level, which is
elaborated in the next three paragraphs. I also want to provide sufficient data and
information on the test scores of the WIT-S and the research sample in general.
Eventually, when I am able to conclude that the WIT-S is as a reliable and valid short-
form alternative, norm groups will be provided, enabling both academics and
practitioners to interpret the unstandardized test scores (Evers, Lucassen, Meijer, &
Sijtsma, 2009) and counter the potential adverse impact inherent to these group
differences (Sackett & Wilk, 1994).
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 18
According to current research on sex differences in intelligence, no evidence
exists that supports mean differences in general intelligence (Jensen, 1998; Johnson &
Bouchard, 2007). By contrast, in the extensive research conducted by Johnson and
Bouchard (2007), differences between males and females on various test of specific
mental abilities did show up, even though g masked up many of these dissimilarities in
previous research. Whilst males perform better on tests focussing on visuospatial
abilities including mental rotation (more Gf-oriented), females perform better on tests
measuring verbal and memory abilities (more Gc-oriented; Johnson & Bouchard, 2007;
Nisbett et al., 2012). We can thus expect males to attain higher scores on the WIT-S,
which presumably focusses entirely on Gf and consists of nearly 18% spatial items.
We can furthermore expect age differences in total test scores on the WIT-S. As
previously discussed, evidence in favour of a Gc-Gf distinction exists. A great amount
of this evidence is based on the difference in performance on fluid-based versus
crystallized-based intelligence tests dependent of age. Younger participants should be
able to perform better on the WIT-S as Gf-related intelligence reaches its summit in
early adulthood but then steadily declines in adulthood and onward (cfr. supra; Horn &
Cattell, 1967; Salthouse, 1996; Nisbett et al., 2012).
Beside differences in sex and age, dissimilarities in educational level are also
taken into consideration. The correlation between performance on tests measuring any
form of cognitive ability and total years of education received is generally high whereby
differences in intelligence account for approximately 30% of the outcome variance
(Neisser et al., 1996). Reasons for this are i.a. the education rewarding process –
encouraging students to continuously perform better – social status and income (Neisser
et al., 1996). We can eventually expect participants who have received (higher)
education to a greater extent to perform better on the WIT-S than participants who did
to a lesser extent or not at all.
Hypothesis 1: Mean WIT-S test scores between demographic subgroups will
differ significantly, whereby (a) younger test takers will, on average, perform
better than older test takers, (b) male test takers will, on average, perform better
than their female counterparts and (c) higher educated test takers will, on
average, perform better than lower or not educated test takers.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 19
Items and their latent factors. Secondly, an assessment of the item facility and
the potency of each item to discriminate between respondents will be executed. In
knowledge-based tests, item analysis classically focuses on the two above-mentioned
statistics (Rust & Golombok, 2014). On the one hand, it is important to show that each
item of the WIT-S possesses an adequate difficulty level and contributes to the test as a
whole (Smith et al., 2000). The more extreme the value of an item facility index
becomes, the more redundant the item is because the participants’ total scores would
have been practically the same as when the item was never included in the test. On the
other hand, items should be able to discriminate between participants (Emons, Sijtsma,
Meijer, 2007; Rust & Golombok, 2014). Items with relatively high, positive item-total
correlations discriminate well between high and low performing participants. Items with
low item-total correlations do not, meaning that the items make little to no contribution
as they are practically uncorrelated with the total test score as well as with most other
test items.
Hypothesis 2: Each individual item of the WIT-S (a) possesses an adequate
difficulty level and (b) discriminates between high and low performers,
conformingly reflecting their total test score.
Thirdly, the factor structure of the WIT-S will be examined. Smith et al. (2000)
urge researchers to assess item-factor scale correlations and the factor-level validity of
the content because they are necessary to demonstrate that the short form reproduces the
hypothesized factor structure. I should be able to empirically show that a higher-order
general factor (e.g. Gf) and four subfactors, namely Reasoning, Spatial Reasoning,
Numeric Reasoning and Perceptual Speed are to be found in the short form by the
means of factor analytical research. The analysis will focus on both exploratory as well
as confirmatory factor analysis. The latter should provide us with the most useful
information on the fit of the data with the theoretical model (Smith et al., 2000).
Hypothesis 3: The WIT-S will reproduce the premised factor structure
comprising Gf and four lower-order mental abilities: Reasoning, Spatial
Reasoning, Numeric Reasoning and Perceptual Speed.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 20
Reliability and construct validity. Fourthly, I want to explore “the degree to
which the items in the test are associated due to what they share in common” (Sijtsma,
2009, p. 179) and consequently whether the WIT-S is a reliable measure. In order to
make decisions in personnel selections, the subscales preferably possess only little
measurement error (Smith et al., 2000). Analogous to factor analysis, I should be able to
show that each item is strongly related to its higher-order (sub)factor(s). Both the
composite-scale reliability and individual-scale reliability coefficients should at least
reach moderately high values. In the current study, moderately high values are regarded
as reliability estimates equalling or exceeding .70, a value generally considered to be
acceptable (Nunnally & Bernstein, 1994). Given the original test characteristics (Jäger
& Althoff, 1983), the WIT-S should be able to attain such values. It is important,
however, to keep in mind that for reliability coefficients to be sufficient, the context of
the test and its future use should be taken into consideration (Cortina, 1993).
Hypothesis 4: The composite-scale and individual-scale reliability estimates of
the WIT-S will equal or exceed .70.
Fifthly, I want to examine the convergent validity and consequently to what
extent the short-form measure of intelligence relates to an already sufficiently validated,
extensive measure of intelligence. Theretofore the rationale is twofold (Smith et al.,
2000): (a) we need to be able to justify the savings in time in relation to potential loss in
validity and (b) careful, thorough examination of the short form’s validity is needed as
these have frequently been developed without prior, profound research. The latter is
illustrated by the finding that, previously, a great amount of investigators assumed that
all of the reliability and validity evidence of the original, full-length measure
automatically applied to the abbreviated form and that because the new measure is
shorter, less validity evidence was required (Smith et al., 2000). The current research
should show that the short form of the Wilde Intelligenztest substantially correlates with
a full-scale intelligence test, in this case the KAIT-IV-NL (Mulder et al., 2004).
In addition, I want to investigate the discriminant validity and consequently to
what extent the short-form differs from a test measuring another construct, such as
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 21
emotional intelligence. The latter is defined as the ability to perceive, assimilate,
understand and manage emotions (Mayer, Caruso, & Salovey, 1999). Meta-analysis
(Van Rooy, Viswesvaran, & Pluta, 2005) has shown that, typically, moderate
correlations (𝜌 = .34) exist between ability measures of emotional intelligence and
general cognitive ability tests. Moreover, these are typically higher for Gc-oriented
(verbal, perceptual) than Gf-oriented tests (Mayer, Roberts, & Barsade, 2008).
In the current study, I will investigate the correlation between the WIT-S and a
translated version of the Geneva Emotion Recognition Test (GERT), an emotion
recognition ability test (Schlegel, Grandjean, & Scherer, 2014). The psychometric
qualities of this test are promising, with alpha reliability estimates of the full scale
reaching up to .76. Furthermore, the GERT significantly correlates with the Mayer-
Salovey-Caruso Emotional Intelligence Test (MSCEIT; r = .47; Mayer, Salovey, &
Caruso, 2002), another performance-based test, and the Situational Test of Emotion
Understanding (STEU; r = .50; MacCann & Roberts, 2008), a situational judgement
trait measure of emotional intelligence (Schlegel et al., 2014; Université De Genève,
2014).
Hypothesis 5: The WIT-S will (a) significantly and substantially correlate with a
full-scale intelligence test, the KAIT and (b) weakly, yet significantly correlate
with an emotion recognition ability test, the GERT. Accordingly, the correlation
between the WIT-S and KAIT will surpass the correlation between the WIT-S
and GERT.
Exploring these two forms of validity will help to map the associations between
the latent construct measured by the WIT-S and two related constructs within its
nomological network. Hence, this part of the research will provide us with information
on the test’s construct validity (Cronbach & Meehl, 1955). Beside these correlations, the
item analysis and factor analytical research should produce additional evidence and
information about the interpretability of the construct validity estimates (Cronbach &
Meehl, 1955; Shepard, 1993).
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 22
METHOD
Sample
Participants were conveniently recruited in the spring of 2015 via undergraduate
psychology students. These students had to administer a test battery that consisted of
cognitive (e.g. WISC-III-NL and KAIT-IV-NL) and non-cognitive tests (e.g. Geneva
Emotion Recognition Test) as part of their curriculum at Ghent University (Universiteit
Gent, 2014). Each student had to look for two acquaintances from their own region and
ask them to participate in a psychological assessment. They were allowed to search for
participants within relatively broadly specified age categories (e.g. 18-25 years). Before
being allowed to start the autonomous work, the students had to demonstrate that they
master the principles of applying the test battery by means of a competence test.
In total, 35 of the original 353 participants were excluded from the research
because of incomplete data, leaving us with 318 valid cases. Either data from one or
multiple tests was completely missing (N = 26) or the participants’ demographic
information was incomplete and could not be retrieved from or traced back to the
original data file (N = 9). Participants ranged in age from 18-59 years (M = 36.28, SD =
12.53) and were comprised of 147 males and 171 females. For research purposes, the
sample was divided into two quasi-equal age categories in terms of sample size: the first
subgroup comprised the participants aged 18 to 34 (N = 157), the second subgroup
comprised the participants aged 35 to 59 (N = 161).
200 participants received lower education (N = 199) or none (N = 1); thereof 95
were male (47.50%) and 105 were female (52.50%). 118 received higher education;
thereof 52 were male (44.07%) and 66 were female (55.93%). Lower education
comprised primary education, secondary education and specialization courses after
finishing secondary education; higher education comprised post-secondary education.
All participants were proficient in Dutch and used it as a spoken language in their
everyday life (N = 312) or were native speakers (N = 308); 314 subjects had the Belgian
nationality and 4 had the Dutch nationality.
Measures
Short form of the Wilde Intelligenztest. The Short Form of the Wilde
Intelligenztest (WIT-S) was administered during one session through the web-based
research platform Qualtrics (www.Qualtrics.com). The WIT-S was administered among
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 23
other tests including the Situational Test of Emotional Understanding and the
Situational Test of Emotional Management (MacCann & Roberts, 2008), and the
Geneva Emotion Recognition Test (Schlegel et al., 2014). The WIT-S comprised 45
items originating from the original test (Der Wilde Intelligenztest; Jäger & Althoff,
1983). Each item was an operationalization of one or two of the following latent
constructs: Reasoning, Spatial Reasoning, Numeric Reasoning and Perceptual Speed.
The test was composed in such a way that the questions were of increasing difficulty;
questions towards the end of the test were supposedly more difficult than at the
beginning. Participants had twelve minutes to solve the speeded test; four countdown
timers evenly distributed throughout the test displayed how much time they had left.
Only one of the participants was able to answer all 45 questions in the twelve-minute
timeframe. The lowest number of questions answered was 9; the average number of
questions answered was 26.29 (SD = 6.14).
The way the WIT-S items were presented varied: 30 items were shown as
multiple choice questions with five choices each, the other fifteen items – numeric
sequences and algebraic exercises – were displayed as single-line text entry questions.
Participants could choose to leave an item blank and scroll down to the next item. An
example question of the Spatial Reasoning subscale was ‘Which figure needs to be
flipped (or mirrored) first before it can overlap with the other four figures?’ and of the
Numeric Reasoning subscale ‘Try to discover the underlying simple calculation rule in
this calculation task and estimate the right answer’. An example item of the Reasoning
subscale was ‘Which letters are next in this letter sequence?’ and of the Perceptual
Speed subscale ‘Which face is different compared to the other two faces?’. For further
data analysis, the answers were recoded where 0 = wrong and 1 = correct.
Geneva Emotion Recognition Test. The Geneva Emotion Recognition Test
(GERT), a recently constructed performance-based test featuring fourteen different
emotions, measured individual differences in emotion recognition. The ability measured
is considered to be a central component of emotional intelligence (Schlegel et al., 2014).
The Dutch translation of the test was administered during the same online session as the
WIT-S. The GERT consisted of dynamic, multimodal video fragments displaying a
various amount of affective states and expressions and comprised 70 short video
snippets in total. An example item was ‘How likely is it that the actor tried to express
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 24
sadness?’. After each video fragment, participants had to rate the related item on a 7-
point Likert scale ranging from ‘Not likely at all’ to ‘Very likely’. The GERT took about
15 to 20 minutes to administer; no time-limitations were imposed. All of the 318
participants completed the test. Reversed-scored items were recoded. The lower bound
alpha of the composite scale equalled .79, which is considered to be sufficient for
research purposes (Nunnally & Bernstein, 1994).
Kaufman Adolescent and Adult Intelligence Test. The full-scale Dutch
version of the Kaufman Adolescent and Adult Intelligence Test (KAIT-IV-NL) is a
widely used and thoroughly validated paper-and-pencil measure (Hunt, 2011; Mulder et
al., 2004; Pearson, 2012) and was administered by bachelor students to measure the
general intelligence of the participants. The test (extended battery) consisted of four
subtests operationalizing crystallized intelligence (Auditory Comprehension, Double
Meanings, Definitions and Famous Faces; α = .89), four subtests operationalizing fluid
intelligence (Rebus Learning, Mystery Codes, Logical Steps and Memory for Block
Designs; α = .55) and two measures for memory recall (Rebus Recall and Auditory
Recall). The alpha reliability estimate of the composite scale, comprising 10 subscales,
equalled .84 and was thus acceptable (Nunnally & Bernstein, 1994).
The tasks and exercises of the KAIT varied greatly. They ranged from listening
to and answering questions about recordings of news stories (Auditory Comprehension)
to integrating multiple types of word clues (Double Meanings), as well as using visually
and orally presented logical premises to solve problem statements (Logical Steps). It
took the participants between 70 and 240 minutes to complete the test (N = 313, M =
125.90, SD = 27.20). Answers, observations and raw scores of the KAIT were written
down by the undergraduate students during administration. The data was eventually
entered into multiple Excel worksheets, sent to academic assistants at the University of
Ghent and recoded for further statistical analysis.
Data analysis
IBM SPSS Statistics version 23 was used for descriptive and correlational
statistics, comparisons of means, item analysis, reliability analysis and exploratory
factor analysis. Unless specified otherwise, all correlations reported were based on
Pearson’s product-moment formula; tests of significance were two-tailed. Qualitative
labels for correlations and effect sizes (e.g. small, moderate, large) conformed with
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 25
Cohen’s (1988) guidelines. When group means were compared using the t test statistic
Levene’s Test for Equality of Variances was consulted to assess the assumption of
homoscedasticity. Depending on the significance level of Levene’s Test, the appropriate
t test statistic was reported (Brown & Forsythe, 1974). The ‘lavaan’ statistical package
for latent variable analysis (Rosseel, 2012) was used to perform confirmatory factor
analysis in R version 3.2.3 (R Core Team, 2015).
Item analysis. First of all, item facility indices (IFI) were calculated by dividing
the number of participants who had correctly answered the question by the total number
of participants. Values ranged between 0 and 1. The easier the question, the higher the
value of the item facility index, and vice versa. Rust and Golombok (2014) proposed the
following rules of thumb: (a) an item is too difficult and/or too few participants give a
correct answer if the IFI is less than .25, (b) an item is too easy and/or too many
participants give a correct answer if the IFI is greater than .75.
Secondly, item discrimination indices (IDI) were obtained by calculating the
correlation between an item and the total score for the test. Rust and Golombok (2014)
argue that a minimum correlation of .20 is required as items with low item-total
correlations (between –.20 and .20) do not sufficiently discriminate. Relatively high,
negative item discrimination indices indicate that the item presumably measures a
construct different from the construct that is intended to be measured. The item
discrimination index was computed by using the point-biserial correlation formula
(Tate, 1954). Additionally, item-rest correlations were computed by correlating each
item and the participant’s total score for the test, whereby the item itself was subtracted
from the total for the correlation.
Thirdly, item distractor analysis – investigation of the frequencies of the selected
responses per item – was executed for multiple choice items with an item facility index
less than .25 and was reported when conclusive. Additional regard was given to items
that did not meet any of the following criteria: (a) item-total correlation greater than or
equal to .20 and (b) item-rest correlation greater than or equal to .20.
Factor analysis. In order to identify the appropriate rotation strategy as well as
the optimum number of factors to retain in the final exploratory factor analyses (EFA),
an initial EFA was executed using the commonly applied ‘principal axis factoring’
method with varimax rotation (Bartholomew, Steele, Moustaki, & Galbraith, 2008). The
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 26
extracted factors were evaluated by inspecting the correlation matrix. Multiple
correlations between the factors were substantial (absolute value ≥ .32). Consequently,
there was sufficient overlap in variance between several factors (more than 10%) to
justify the use of an ‘oblique rotation’ method (Tabachnick & Fidell, 2007). Also, from
a theoretical viewpoint, correlations between the factors could be assumed (Jäger &
Althoff, 1983; Kersting et al., 2008).
Various methods were used to identify the number of factors to retain in the final
EFA (see Gorsuch, 1983): (a) investigation of the eigenvalues, looking for values
greater than 1 (Kaiser criterion; Kaiser, 1960), (b) visual inspection of the scree plot
(Cattell, 1966) and (c) parallel analysis (Horn, 1965b). The latter supposedly presents
the least variability and sensitivity to factor differentiation (Zwick & Velicer, 1986).
The analysis indicated meaningful factors when eigenvalues of the WIT-S data were
greater than those provided by random data comprising the same sample size and
number of factors (Lautenschlager, 1989). Random data and derived eigenvalues for
parallel analysis were generated through principal component analysis with 100random
correlation matrix replications using Patil, Singh, Mishra, and Todd Donavan’s (2007)
‘Parallel Analysis Engine to Aid Determining Number of Factors to Retain’ (version
9.1.3, seed 50).
Additional factor analyses were carried out to identify latent traits using the
principal axis factoring method. Extracted results reported were rotated by means of the
varimax procedure (uncorrelated factors, orthogonal solution) and direct oblimin
procedure (delta = 0; correlated factors, oblique solution). The Kaiser-Meyer-Olkin
Measure of Sampling Adequacy (KMO) and Bartlett's Test of Sphericity were consulted
to evaluate if the data were likely to pool on the factors and if the correlation matrix
equalled an identity matrix. Values greater than or equal to .70 for the KMO coefficient
and significant test statistics (p ≤ .05) for Bartlett’s Test of Sphericity are acceptable and
preferred (Field, 2009). The rotated results were eventually evaluated for the presence
of ‘simple structure’ (Thurstone, 1947) in order to advocate further interpretations
(Kline, 2002).
Next, multiple confirmatory factor analyses were executed to assess if the
hypothesized factor structure of the WIT-S adequately fitted with the data. One-factor,
two-factor and four-factor models were tested and compared to each other: (a) the one-
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 27
factor model represented Gf, (b) the two-factor model represented Reasoning,
comprising Numeric Reasoning and Spatial Reasoning, and Perceptual Speed and (c)
the four-factor model represented Reasoning, Numeric Reasoning and Spatial
Reasoning as individual latent factors. As these models specify binary endogenous
variables, a three-way weighted least squared approach (DWLS method) was used to
estimate the model parameters (Flora & Curran, 2004). The estimates reported were
robust variants of the computed test statistics. Parametrization of the model was
performed through factor scaling using Unit Loading Identification (ULI): one non-
standardized factor loading was fixed to 1 for each factor.
Multiple fit measures were consulted to evaluate model fit. Firstly, the chi
square test provided information on the consistency of the observed covariance with the
model-implied covariance. Insignificant 𝜒#$ test statistics (p ≥ .05) indicate good fit with
the covariance data (Kline, 2005). Secondly, the Comparative Fit Indices provided
information on the relative improvement in fit of the specified model versus the baseline
model. CFI values greater than .95 suggest good fit (Hu & Bentler, 1999). Thirdly, the
Root Mean Square Error of Approximation (RMSEA) was evaluated. RMSEA values
less than or equal to .010, .050, and .080 indicate excellent, good, and mediocre fit,
respectively (MacCallum, Browne, & Sugawara, 1996). Finally, the chi square
difference test – using Satorra and Bentler’s (2001) method – provided information on
the relative improvement in fit of the models when compared to each other. Significant
𝜒%$test statistics suggest differences in model fit (Kline, 2005).
Reliability analysis. In order to estimate the composite-scale and individual-
scale reliability levels of the WIT-S, Cronbach’s (1951) alpha and split-half reliability
coefficients were estimated. The latter was calculated using Spearman (1910) and
Brown’s (1910) formula, splitting up the test and scales in odd- and even-numbered
halves. Beside parallel-forms reliability, split-half reliability is considered to be one of
the only measures not weakened by the artefacts of speeded tests (Kline, 2000). In
addition, a special, corrected case of the split-half reliability coefficient was computed
using Gulliksen’s (1950) least restrictive correction for stepped-up odd-even reliability
estimates of partially speeded power tests:
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 28
𝑅( = 𝑅**+ −𝐾𝑀/ −𝑀/
$ − 𝑆/$
(𝐾 − 1)𝑆*$
where 𝑅( is the corrected reliability coefficient, 𝑅**+ is the stepped-up odd-even
reliability, K is a reasonable approximation for the number of items in the speeded part
of test, 𝑀/ is the mean of the U-scores – the number of unanswered items per
participant – 𝑆/$ is the variance of the U-scores and 𝑆*$ is the variance of the total error
scores – the number of incorrectly answered items plus the number of unanswered items
per participant (Gulliksen, 1950). Arbitrarily, the value of K equalled the number of
items completed by at least 75% of the participants (K = 23). Furthermore, the ratio
between 𝑀/, the mean of the U-scores, and 𝑆4$, the variance of the total test scores, was
also reported to assess whether the corrected reliability estimate reported was
satisfactory. High 𝑀//𝑆4$ values (i.e. > .30) indicate less satisfactory estimates
(Gulliksen, 1950).
Construct validity. Lastly, the premised interrelations of the WIT-S with
cognitive ability and emotion recognition measurements were explored by evaluating
the convergent and discriminant validity coefficients. Convergent validity was
measured by correlating the unstandardized WIT-S test scores with the unstandardized
KAIT-IV-NL Total, Fluid and Crystallized IQ test scores. In addition, the correlations
between the subscales of the WIT-S and the subscales of the KAIT’s Fluid IQ subscales
were investigated. Discriminant validity was measured by correlating the
unstandardized WIT-S test scores with the unstandardized GERT composite-scale
scores.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 29
RESULTS
Descriptive statistics
Table 1 provides an overview of the sample and subgroup sizes for age, gender
and educational level, and the mean age per demographic subgroup. The mean age of
the male test takers was 34.00 years (SD = 11.90) and 38.24 years (SD = 12.76) as for
the female test takers. The mean age of the participants in the 18-34 years category was
24.89 (SD = 3.99) and 47.39 (SD = 6.65) in the 35-59 years category. As illustrated by
Figure 2, 109 (34.28%) participants were aged 20 to 26 years, another 75 (23.58%)
participants were aged 46 to 54 years; these age groups are thus highly represented as
they equal 57.86% of the total sample, yet cover only 34.14% of the age distribution.
The mean age of the participants who received lower or none education was 36.04 (SD
= 13.01). On average, participants who received higher education were 36.96 years old
(SD = 11.72). Here, lower education comprised primary education, secondary education
and specialization courses; higher education comprised post-secondary education.
Figure 2. Age distribution of the sample compared to a uniform distribution curve.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 30
Table 1 Frequencies for Age, Gender and Education and Mean Age per Demographic Category
Age
N (%) Mean SD
Sample Age 18-34 years 35-59 years Gender Male Female Education a Lower or none Higher
318 (100.00)
157 (49.37) 161 (50.63)
147 (46.23) 171 (53.77)
200 (62.89) 118 (37.11)
36.28
24.89 47.39
34.00 38.24
36.04 36.96
12.53
3.99 6.65
11.90 12.76
13.01 11.72
Note. N = sample or subgroup size; SD = standard deviation of mean value. Numbers between brackets in italic typeface are percentages of total sample size. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. Table 2 Means and T Statistics per Demographic Category of the WIT-S Test Scores
Mean SD t
Sample Age 18-34 years 35-59 years Gender Male Female Education a Lower or none Higher
18.26
19.61 16.94
19.10 17.53
16.61 21.05
6.00
5.38 6.30
6.05 5.88
5.81 5.27
4.05**
2.34*
–6.81**
Note. N = 318. SD = standard deviation of mean value. The t tests measured the differences in mean test scores on the WIT-S per demographic category. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. * p < .05 ** p < .001
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 31
Table 3 Means and T Statistics per Demographic Category of the KAIT-IV-NL Test Scores
Total IQ Fluid IQ
Mean SD t Mean SD t
Sample Age 18-34 years 35-59 years Gender Male Female Education a Lower or none Higher
64.48
66.21 62.80
65.97 63.21
60.60 71.07
11.61
10.50 12.40
10.69 12.23
10.80 9.85
2.65*
2.12*
–8.62**
32.26
32.89 31.65
32.85 31.75
29.84 36.36
8.06
6.65 9.21
6.80 8.99
6.72 8.51
1.37
1.22
–7.56**
Note. N = 318. SD = standard deviation of mean value. The given means are mean values of the unstandardized test scores for total as well as fluid IQ. The t tests measured the differences in mean scores per demographic category. a Lower education comprised primary and secondary education; higher education comprised post-secondary education. * p < .05 ** p < .001
Table 2 and 3 provide an overview of the mean WIT-S and KAIT-IV-NL test
scores, respectively. The t statistics per demographic category are also included in these
tables. The t tests measured the difference between the participants’ mean test scores on
the WIT-S and KAIT-IV-NL per demographic category. The mean test score on the
WIT-S was 18.26 (SD = 6.00). Furthermore, the mean test score on the KAIT was 64.48
(SD = 11.61) for total IQ and 32.26 (SD = 8.06) for fluid IQ. The unstandardized mean
total score on the Geneva Emotion Recognition Test (GERT) was 396.02 (N = 318; SD
= 20.81). As subgroup differences for the GERT are not of interest in the context of the
current research, they are not provided in this section.
There were significant differences in mean WIT-S scores between the age,
gender and education subgroups (see Table 2). The effect sizes of these differences
were small to moderate. Firstly, the mean of the subgroup 18-34 years (M = 19.61, SD =
5.38) significantly differed from the subgroup 35-59 years (M = 16.49, SD = 6.30),
t(316) = 4.05, p < .001, d = .46. Secondly, the mean of the male test takers (M = 19.10,
SD = 6.05) significantly differed from the mean of the female test takers (M = 17.53, SD
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 32
= 5.88), t(316) = 2.34, p = .020, d = .26. Thirdly, the mean of the lower/not educated
test takers (M = 16.61, SD = 5.81) significantly differed from the mean of the higher
educated test takers (M = 21.05, SD = 5.27), t(316) = –6.81, p < .001, d = .76. These
findings provide support for Hypothesis 1a, 1b and 1c as subgroup differences in mean
test scores for sex, age and educational level were present. On average, younger, male
and higher educated participants performed significantly better on the WIT-S than
older, female and lower or not educated participants, respectively.
There were also significant differences in mean KAIT-IV-NL test scores on the
composite scale between the age, gender and education subgroups (see Table 3). Effect
sizes for age, gender and educational level were .30, .24, .97, respectively. Analogously
to the differences in mean scores on the WIT-S, younger, male and higher educated
participants, on average, had higher KAIT total IQ scores than older, female and lower
or not educated participants, respectively.
Contrarily, as for the differences in test scores on the Fluid IQ subscales, the
mean of the subgroup 18-34 years (M = 32.89, SD = 6.65) did not significantly differ
from the subgroup 35-59 years (M = 31.65, SD = 9.21), t(316) = 1.37, p = .171, d = .15.
In addition, the mean of the male participants (M = 32.85, SD = 6.80) did also not
significantly differ from the mean of the female participants (M = 31.75, SD = 8.99),
t(316) = 1.22, p = .225, d = .14. Analogously to the differences in mean scores on the
WIT-S, higher educated participants, on average, had higher KAIT fluid IQ scores than
lower or not educated participants. However, no significant subgroup differences in
mean scores were found for age and gender.
Analysis of the WIT-S items
Table 4 provides a comprehensive overview of the item facility indices, item
discrimination indices, item-rest correlations and percentages of participants who
completed a particular item of the WIT-S. Item facility indices ranged from .00 to .91;
the mean item facility index equalled .41 (SD = .31). Furthermore, item discrimination
indices ranged from –.08 to .61; the mean item discrimination index equalled .32 (SD =
.19). In addition, item-rest correlations ranged from –.14 to .55; the mean item-rest
correlation equalled .27 (SD = .17).
In total, 317 of the participants (99.69%) left one or more questions unanswered.
282 of the participants (88.68%) left 25% of the questions or more blank. Furthermore,
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 33
there were large, significant, negative correlations between the item number and the
item facility index (r = –.91, p < .001) and between the item number and the percentage
of participants who completed the particular item (r = –.95, p < .001). These results
support the premise that the test was composed in such a way that the questions were of
increasing difficulty and the presence of test speededness.
The following items did not meet any of the item facility, item discrimination
and item-rest correlation criteria as suggested by Rust and Golombok (2014): 4, 8, 13,
31, 37 and 39 through 45. Items 4 and 8 had an IFI greater than .75, suggesting that
these items were too easy and/or too many participants gave a correct answer. Items 13,
31, 37 and 39 through 45 had an IFI less than .25, suggesting that these items were too
difficult and/or too few participants gave a correct answer. Moreover, all of the above-
mentioned items’ item-total and item-rest correlations were below .20.
Further item distractor analysis for the multiple choice item 13 ‘Which letters
are next in this letter sequence? a m b b m c c m ? ?’ showed that the correct answer
was chosen 54 times (17.31% of total valid responses). However, distractor ‘1) dd’ was
chosen 232 times (74.36% of total valid responses). Besides a low item facility index
and highly frequently chosen distractor, its rc was negative and significant (rc = –.14, p
< .012). The other three distractors were chosen between 5 and 15 times. As for the
multiple choice item 31 ‘Which form can be made from the unfolded form on the left?’
item distractor analysis showed that the correct answer was chosen 29 times (27.10% of
total valid responses). In contrast, distractor ‘A’ was chosen 45 times (42.06% of total
valid responses). The other three distractors were chosen between 8 and 15 times.
Additional item distractor analysis for the multiple choice item 30 ‘Try to
discover the underlying simple calculation rule in this calculation task and estimate the
right answer without working out the calculation task completely. 536 888 372 : 2 = ?’
showed that the correct answer was chosen 28 times (22.22% of the total valid
responses). However, distractor ‘C) 268 444 186’ was chosen 54 times (42.86% of the
total valid responses). The other three distractors were chosen between 10 and 23 times.
Distractor analyses for items 37 and 39 through 45 were inconclusive as only 3 to 26
participants completed these items.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 34
Table 4 Item Facility Indices (IFI), Item Discrimination Indices (IDI), Item-Rest Correlations and Percentage of Items Completed of the WIT-S
Item IFI a IDI b rc c Completed d
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
Reasoning Spatial Numeric Perceptual Speed Numeric + Reasoning Numeric Perceptual Speed Numeric Perceptual Speed Reasoning Spatial Numeric Reasoning Perceptual Speed Perceptual Speed Numeric + Reasoning Numeric Reasoning Spatial Numeric Perceptual Speed Numeric Perceptual Speed Numeric + Reasoning Numeric Reasoning Spatial Numeric Spatial Numeric Spatial Reasoning Numeric Numeric + Reasoning Perceptual Speed Numeric Reasoning
.78
.64
.76
.91
.79
.70
.69
.89
.79
.82
.71
.68
.17
.78
.71
.46
.53
.68
.66
.63
.81
.47
.54
.18
.36
.27
.34
.19
.32
.17
.09
.26
.08
.11
.09
.06
.03
.36 .45 .35 .18 .39 .14 .47 .19 .24 .47 .50 .33 –.08 .34 .28 .34 .41 .48 .58 .52 .38 .51 .61 .42 .46 .56 .50 .49 .60 .39 .13 .47 .27 .36 .34 .34 .19
.29 .39 .29 .13 .33 .06 .41 .14 .17 .42 .44 .26 –.14 .28 .21 .26 .33 .42 .52 .46 .32 .44 .55 .36 .39 .50 .44 .44 .55 .34 .09 .41 .22 .31 .30 .30 .16
99.69 99.37 98.74 99.37 96.54 97.48 97.48 99.69 99.69 98.74 98.74 95.91 98.11 95.28 95.91 84.59 81.13 91.19 91.19 83.65 88.68 79.25 76.42 35.53 55.35 55.35 52.83 39.62 45.28 39.62 33.65 29.56 16.04 15.72 14.15 10.06 8.18
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 35
Table 4 (Continued)
Item IFI a IDI b rc c Completed d
38. 39. 40. 41. 42. 43. 44. 45.
Spatial Numeric Numeric + Reasoning Spatial Reasoning Numeric Numeric Perceptual Speed
.06
.02
.01
.01
.01
.00
.00
.01
.21 .07 –.07 .05 .04 .00 .09 .07
.17 .05 –.09 .03 .02 –.01 .08 .05
8.81 4.72 2.52 4.72 3.77 2.52 .94 3.14
Note. N = 318. Items were given a label indicating the factor they presumably operationalized. Underlined items did not meet the conditions specified in a, b and c. a IFI = Item Facility Index. IFI was calculated by dividing the number of participants correctly answering the question by the total number of participants. Indices in bold meet the following conditions: .25 ≤ IFI ≤ .75. b IDI = Item Discrimination Index. IDI equalled the correlation between the item and the participant’s total score. Indices in bold meet the following condition: IDI ≥ .20. c rc = item-rest correlation. rc equalled the corrected correlation between each item and the participant’s total score for the test; the item itself was hereby subtracted from the total for the correlation. Indices in bold meet the following condition: rc ≥ .20. d Percentage of participants who have completed the item, regardless of whether the question was answered correctly.
Hypothesis 2a was not supported by the results of the item analyses as the item
facility indices of 28 items were not included in the interval [.25; .75]. Besides, the
indices differed greatly between items and showed a few inconsistencies with the
premise that the questions increased in difficulty (e.g. items 4 and 8). Furthermore,
frequently endorsed detractors deflated the item facility indices in some cases (e.g.
items 13, 30 and 31). Hypothesis 2b was not supported by the findings either as the item
discrimination indices of 14 items were below the required item-total correlation
coefficient of .20; two indices were even negative.
Factor structure of the WIT-S
Exploratory factor analysis. The results of the exploratory factor analyses
generated a Kaiser-Meyer-Olkin Measure of Sampling Adequacy coefficient of .76.
Bartlett’s Test of Sphericity was significant (χ2(990, N = 318) = 3817.21, p < .001). The
data were thus likely to pool on the factors and the correlation matrix did not equal an
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 36
identity matrix. Communality estimates ranged from .12 to .65 (Mdn = .38); seventeen
communality coefficients were less than .30, which is considered to be low (Tabachnick
& Fidell, 2007) and most likely indicated low reliability levels of the individual factors
(Fabrigar, Wegener, MacCallum, & Strahan, 1999). Furthermore, 15 factors had
eigenvalues greater than 1. Investigation of the scree plot showed a cutoff between
factor 5 and 6 and Horn’s (1965b) parallel analysis produced scree plots that intersected
at an eigenvalue of 1.52, just above the eigenvalue of factor 6, which equalled 1.50 (see
Figure 3). On the basis of the theoretical fundamentals of the WIT-S discussed in the
introduction and the factor number identification methods, five factors were eventually
extracted and retained.
Table 5 provides an overview of the five-factor EFA of the WIT-S using the
principal axis factoring method with varimax rotation and Kaiser normalization. Direct
oblimin rotation rendered highly inconsistent pattern loadings; these results were not
reported here. Eigenvalues of factors 1 to 5 equalled 6.29, 3.63, 2.04, 1.93 and 1.87,
respectively. The rotated five-factor solution merely explained 28.02% of the variance
of the model. Factor loadings (FL) were considered to be moderate if FL ≥ |.30|, and
high if FL ≥ |.50|. Items 1 through 29 practically all loaded moderately to highly on the
first extracted factor − except for items 6, 9, 15 and 27, and items 4, 8 and 13, which
also did not meet the item analysis criteria (cfr. supra). Items 23 and 25 through 35
loaded moderately to highly on Factor 2. Items 35 through 39 loaded moderately to
highly on Factor 3. Two of the last items in the test (43 and 45) loaded highly on Factor
4. Factor 5 produced both moderately positive and negative inconsistent pattern
loadings.
Only three of Thurstone’s (1947) simple structure criteria were met. First of all,
each factor had at least five zero loadings (a loading between −.10 and .10; Gorsuch,
1983) and each pair of factors had items with salient loadings (a loading > |.30|; Kline,
2002) on one and zero loadings on the other. Furthermore, all pairs of factors had a
relatively small number of saliently loaded variables on both factors. However, one of
the items, item 39, did not produce at least one zero loading on one of the factors.
Moreover, each pair of factors had a proportion of zero loadings on both factors, yet the
proportion was not large for all pairs (e.g. four paired zero loadings on Factor 1 and 2).
Simple structure was thus not achieved.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 37
Figure 3. Parallel analysis scree plots based on Horn (1965b).
Overall, the hypothesized factor structure in Hypothesis 3 was not supported by
the results of the EFA. Firstly, the rendered factor structure did not meet the criteria for
simple structure, impeding interpretations. Secondly, a general latent factor (e.g. Gf)
could not be found as various factor loadings on the first factor were zero, negatively
salient or positively salient. Thirdly, the four premised mental abilities could not be
clearly distinguished on the basis of the above-mentioned findings. However, a clear
pattern was discovered as the first four factors represented specific parts of the test.
Factor 1, for instance, comprised of saliently loaded items situated in the first half of the
test, items situated in the middle of the test loaded moderately to highly on Factor 2, etc.
This suggests that the extracted, rotated factor structure was supposedly an artefact of
the test’s speededness and the increasing difficulty of the questions.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 38
Table 5 Rotated Factor Pattern Loadings of the Five-Factor Exploratory Factor Analysis of the WIT-S Data using Principal Axis Factoring with Varimax Rotation and Kaiser Normalization
Mental Ability Item Factor 1 Factor 2 Factor 3 Factor 4 Factor 5
Reasoning Numeric Numeric + Reasoning a Spatial
R-IQ1 R-IQ10 R-IQ13 R-IQ18 R-IQ26 R-IQ32 R-IQ37 R-IQ42 N-IQ3 N-IQ6 N-IQ8 N-IQ12 N-IQ17 N-IQ20 N-IQ22 N-IQ25 N-IQ28 N-IQ30 N-IQ33 N-IQ36 N-IQ39 N-IQ43 N-IQ44 NR-IQ5 NR-IQ16 NR-IQ24 NR-IQ34 NR-IQ40 S-IQ2 S-IQ11 S-IQ19 S-IQ27 S-IQ29 S-IQ31 S-IQ38 S-IQ41
.38 .50 −.19 .44 .35 .06 .10 −.04 .37 .19 .27 .33 .35 .52 .40 .30 .30 .07 −.04 .10 −.16 −.12 −.02 .44 .32 .34 −.07 −.21 .47 .58 .61 .25 .36 −.14 .05 .00
.00 .03 .01 .17 .48 .65 .00 .04 .06 −.13 −.13 .05 .20 .15 .27 .39 .47 .49 .49 .27 .24 .09 .18 .02 .07 .26 .44 .10 .05 .03 .12 .51 .51 .34 .04 .03
.03 .11 −.08 .08 −.02 .26 .61 .08 −.05 .01 .05 −.11 −.12 −.07 −.08 −.12 −.07 .26 .10 .59 .56 .03 .16 −.08 −.06 −.08 .29 .28 .11 .07 .08 −.01 .17 .08 .79 .05
−.07 −.03 .15 −.09 −.04 .09 .07 .01 −.04 −.03 .08 .15 .09 −.07 .10 .08 .01 .23 .20 .29 −.28 .68 −.03 .14 −.03 −.06 .03 −.21 −.07 −.10 −.11 −.10 −.04 −.15 .03 −.01
.05 −.05 .02 .02 .04 .20 .18 .28 −.15 −.22 −.19 −.14 −.22 .04 −.09 −.12 −.15 .03 −.16 .05 −.48 −.07 −.30 −.01 −.02 .02 .21 −.18 .03 .10 .21 .34 .34 .00 .06 .21
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 39
Table 5 (Continued)
Mental Ability Item Factor 1 Factor 2 Factor 3 Factor 4 Factor 5
Perceptual Speed
PS−IQ4 PS−IQ7 PS−IQ9 PS−IQ14 PS−IQ15 PS−IQ21 PS−IQ23 PS−IQ35 PS−IQ45
.18 .41 .27 .32 .26 .35 .45 .07 −.03
−.04 .16 −.06 .06 −.01 .16 .41 .33 .04
.05 .07 −.07 .01 .02 −.07 −.01 .43 .23
.06 −.05 .11 .07 .08 −.13 −.10 .29 .57
−.02 −.01 −.17 .02 .02 −.04 .20 .32 .07
Note. N = 318. Item labels refer to the operationalized latent construct (e.g. R for Reasoning) and item position in the test (e.g. IQ1 for item 1). Factor loadings ≥ |.30| were considered salient and are displayed in bold typeface. Underlined items did not meet the item analysis criteria (see Table 4). a These items presumably operationalized two mental abilities: Reasoning and Numeric Reasoning.
Confirmatory factor analysis. Results of the confirmatory factor analysis
showed that none of the premised models containing all 45 item variables provided
sufficient fit with the data (see Table 6). All 𝜒#$ test statistics were furthermore
significant (p < .001), indicating bad fit of the observed covariance with the model-
implied covariance. CFI values were less than .95, suggesting that the specified models
did not fit better with the data than the baseline models. Moreover, the RMSEAs were
greater than .080, indicating bad model fit in general. Possible explanations for the
misfit are that (a) large proportions of the (un)standardized factor loadings in all of the
three models were both positive and negative, and (b) numerous factor loadings were
insignificant (p > .05). Finally, the chi square difference test (see Table 7) showed that
the four-factor model fit better with the observed data than the one- and two-factor
models (p <. 001). No difference was found between the one- and two-factor model (p =
.106).
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 40
Table 6 Goodness-of-Fit Statistics of Full and Partial Models of the WIT-S
Chi Square
𝜒#$ df CFI RMSEA
One-Factor Model Full Partial Two-Factor Model Full Partial Four-Factor Model Full Partial
3018.17 1042.25
3014.68 1042.03
2937.28 989.17
945 495
944 494
939 489
.49 .82
.49 .82
.50 .84
.083 .059
.083 .059
.082 .057
Note. N = 318. Full models comprised all 45 items; partial models comprised the 33 items that met at least one of the item analysis criteria (see Table 4). CFI = Comparative Fit Index. RMSEA = Root Mean Square Error of Approximation. All chi square statistics were significant (p < .001).
Table 7 Comparisons of Model Fit using Scaled Chi Square Difference Tests
𝜒%$ df Δ𝜒%$ Δdf p
Full Models Four-Factor Two-Factor One-Factor Partial Models Four-Factor Two-Factor One-Factor Four-Factor Models Full Partial
7735.42 8090.88 8099.37
1197.05 1284.44 1284.51
1197.05 7735.42
939 944 945
489 494 495
489 939
−
34.75 2.61
−
36.36 .02
−
1210.34
− 5 1 − 5 1 −
450
−
< .001 .106
−
< .001 .885
−
< .001
Note. N = 318. Full models comprised all 45 items; partial models comprised the 33 items that met at least one of the item analysis criteria (see Table 4). A simple approximation of 𝜒%$ was calculated using the method of Satorra and Bentler (2001).
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 41
Because the ‘full’ models comprising all items did not fit well with the observed
data, additional CFAs were executed on partial models. The items that did not meet at
least one of the item analysis criteria (N = 12; cfr. supra) were hereby omitted from the
model, leaving us with 33 items in total per partial model. 𝜒#$ statistics were significant
for all partial models (p < .001) and CFI values did not exceed .95. The RMSEA ranged
between .057 and .059, indicating mediocre model fit. The chi square difference test
indicated improved fit of the four-factor partial model compared to the one- and two-
factor partial models (p <. 001; see Table 7). No improved fit of the two-factor model
against the one-factor model was found (p = .885). In fact, the four-factor partial model
showed better fit than all of the above-mentioned models. In the next two paragraphs,
parameter estimates (i.a. (un)standardized factor loadings and factor correlations) of the
best fitting model, the four-factor partial model, are provided.
All unstandardized factor loadings of the four-factor partial model were positive
and standard errors ranged between .13 and .36, which is considered to be low to
moderately low (Kline, 2005). In addition, not all standardized factor loadings of the
four-factor partial model were large (> .50) or significant (e.g. p = .467 for item 6).
They ranged between .35 and .77 for Reasoning, between .38 and .78 for Numeric
Reasoning (except for item 6, which was .06), between .50 and .92 for Spatial
Reasoning and between .19 and .76 for Perceptual Speed. Omitting the insignificantly
loaded item 6 from the model did not provide improved fit (p = .630), nor did
introducing a fifth factor comprising the five Numeric + Reasoning items (p = .713).
Figure 4 provides a simplified graphic overview of the standardized model
parameter estimates of the four-factor partial model; multiple items were left out from
the figure in order to provide a clear overview. Endogenous item variables are presented
within rectangles; exogenous latent factors are displayed within ellipses. Single headed
arrows reflect standardized factor loadings; double-headed arrows reflect factor
correlations. Each observed indicator was modelled to be directly influenced by exactly
one unobserved factor. All factor correlations exceeded .68, meaning that all factors
were related to each other. However, the latent-variable covariance matrix of the four-
factor partial model was not positive definite – the correlation between Reasoning and
Perceptual Speed equalled 1.02 – indicating that the sample size was either too small or
that there were problems with the model specification.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 42
Figure 4. Partial four-factor model with standardized parameter estimates. Multiple
items were omitted from the figure to provide a clear overview.
In sum, the hypothesized factor structure in Hypothesis 3 was not supported by
the results of the CFAs. Goodness-of-fit statistics for both full and partial models
indicated that (a) the observed covariance did not fit well with the model-implied
covariance (b) the premised models did not fit better with the data than the baseline
models. However, RMSEA values indicated mediocre fit of the partial models, with the
four-factor partial model showing the best overall fit. Hence, the hypothesized model
was not identified but the four-factor partial model at least showed improved fit.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 43
Reliability of the WIT-S
An overview of the WIT-S reliability estimates as well as the ratio between the
mean of the unanswered items per participant and the variance of the total test scores is
provided in Table 8. The composite-scale consisted of 45 items (α = .84). The
Reasoning, Numeric Reasoning, Spatial Reasoning and Perceptual Speed subscales
consisted of 13, 15, 8 and 9 items, respectively. Alpha coefficients ranged between .51
(Perceptual Speed) and .66 (Numeric Reasoning and Spatial Reasoning). The alpha
coefficient equalled .57 for the Reasoning subscale. Only the composite-scale alpha
coefficient exceeded .70, which was premised as acceptable. To put this relatively high
value in perspective – as alpha is known to produce higher levels of reliability if the
number of items in the equation is substantial (Cortina, 1993; Schmitt, 1996) – it is
worthwhile to note that the average inter-item correlation was small (r = .09).
Uncorrected split-half reliability estimates indicated higher levels of reliability
overall. The coefficient of the composite scale (𝑅**+ = .86), as well as the Numeric
Reasoning (𝑅**+ = .72) and Spatial Reasoning (𝑅**+ = .71) subscales exceeded .70 and
were thus viewed as acceptable. Split-half coefficients of the Reasoning (𝑅**+ = .67)
and Perceptual Speed (𝑅**+ = .55) subscales did not. By contrast, only the corrected
split-half reliability of the composite scale (𝑅( = .81) was acceptable. The corrected
split-halves of the individual subscales were generally negative. The values equalled
–.49, –.33, –.31 and –.16 for the Reasoning, Spatial Reasoning, Perceptual Speed and
Numeric Reasoning subscales, respectively.
Hypothesis 4 was only partially supported by the results of the reliability
analysis. The composite-scale lower bound alpha and both split-half estimates produced
sufficient levels of reliability, exceeding the premised value. However, except for the
uncorrected split-half estimates of the Numeric Reasoning and Spatial Reasoning
subscales, none of the reliability estimates of the subscales equalled or exceeded the
hypothesized value. In fact, the corrected individual-scale split-half estimates even
produced negative values. The high 𝑀//𝑆4$ ratios of the subscales indicated that the
negative corrected split-half reliability estimates were supposedly an artefact of the high
total number of unanswered items per participant.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 44
Table 8 Reliability Estimates and the Ratio between Mean Unanswered Items and Total Test Score Variance of the Composite Scale and Individual Scales of the WIT-S
Reliability Estimates
α 𝑅**+ 𝑅( 𝑀//𝑆4$
Composite-Scale Individual-Scale Reasoning a Numeric Reasoning Spatial Reasoning Perceptual Speed
.84
.57
.66
.66
.51
.86
.67
.72
.71
.55
.81
−.49 −.16 −.33 −.31
.52
1.53 1.27 1.23 .94
Note. N = 318. α = Cronbach’s (1951) lower bound alpha. 𝑅**+ = Split-half reliability of odd- and even-numbered halves. 𝑅( = Corrected split-half reliability coefficient, estimated by using Gulliksen’s (1950) 𝑅5-formula where K equalled 23. 𝑀//𝑆4$ = Ratio between the mean of the number of unanswered items per participant and the variance of the total test scores. Reliability estimates in bold were greater than .70. a Comprising items labelled ‘Reasoning’ and ‘Numeric + Reasoning’.
Construct validity of the WIT-S
Table 9 provides an overview of the correlations between the WIT-S, KAIT and
GERT. First of all, the correlations between the WIT-S and KAIT were significant and
large for both total IQ (r = .54, p < .001) and fluid IQ (r = .51, p < .001). These findings
provide support for Hypothesis 5a and hence convergent validity. The correlation
between the WIT-S and the crystallized subscale of the KAIT was furthermore
significant and moderate (r = .29, p < .001). Secondly, the correlation between the
GERT and WIT-S was generally smaller, yet significant (r = .19, p = .001), similar to
its correlation with the composite scale of the KAIT. The correlation between the
WIT-S and KAIT exceeded the correlation between the WIT-S and GERT (z = 5.17, p <
.001). These findings provide support for Hypothesis 5b and hence divergent validity.
Correlations between the WIT-S and KAIT fluid IQ subscales were small to
moderate, yet significant, ranging from .14 to .32. Reasoning most substantially
correlated with Rebus Learning (r = .21, p < .001), Logical Steps (r = .28, p < .001) and
Memory for Block Designs (r = .18, p = .002). Spatial Reasoning most substantially
correlated with Mystery Codes (r = .32, p < .001). The correlation between Spatial
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 45
Reasoning and Memory for Block Designs equalled .17 (p = .003) Although relatively
small, these significant correlations provide additional support for Hypothesis 5a.
Table 9 Correlations between the Short Form of the Wilde Intelligenztest, Kaufman Adolescent and Adult Intelligence Test and Geneva Emotion Recognition Test
KAIT
WIT-S Total IQ Fluid IQ Crystallized IQ
KAITTotal KAITFluid
KAITCrystallized
GERT
.54**
.51**
.29**
.19**
.89** .78** .20**
.70** .14*
.15**
Note. N = 318. WIT-S = Short form of the Wilde Intelligenztest. KAIT = Kaufman Adolescent and Adult Intelligence Test (Dutch translation). GERT = Geneva Emotion Recognition Test (Dutch translation). * p < .05 ** p ≤ .001
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 46
DISCUSSION
The objective of the current study was to provide an answer to the question
whether the short form of the Wilde Intelligenztest (WIT-S) possesses adequate
psychometric qualities and could hence provide practitioners with a reliable and valid
alternative to time-consuming, full-scale intelligence tests in a personnel selection
context. Firstly, I explored differences in mean scores for age, sex and educational level.
Secondly, I calculated item facility and item discrimination indices and performed
distractor analysis to assess the psychometric qualities of the items. Thirdly, I carried
out exploratory and confirmatory factor analyses to investigate the presence of the
premised latent factor structure and overall model fit. Finally, I evaluated reliability
estimates of the WIT-S and the construct interrelations within its nomological network,
namely general intelligence and emotional intelligence.
General findings and implications
Test score differentiation. In accordance to the literature (i.e. Neisser et al.,
1996; Nisbett et al., 2012), men, younger participants aged 18 to 34 years and higher
educated test takers attain, on average, significantly higher WIT-S scores than women,
older people aged 35 to 59 years and lower/not educated test takers. This can be
explained from a theoretical perspective as the WIT-S presumably measures fluid
intelligence (Kersting et al., 2008), defined as the ability to handle novel, uncommon
problems (Carroll, 1993). Males, adolescents, young adults as well as people who have
received (higher) formal education generally achieve better scores on Gf-oriented tests
than their female, older and less educated counterparts (Johnson & Bouchard, 2007;
Neisser et al., 1996; Nisbett et al., 2012). This theoretical explanation is furthermore
supported by the construct validity estimates (cfr. infra).
The results thus indicate that the WIT-S is, in se, biased towards male, younger
and higher educated people. This bias is detrimental, especially in a personnel selection
context. Even more so, it is desirable to minimize the problems practitioners face with
unequal subgroup selection rates (Ployhart & Holtz, 2008). Ideally, and in order to be
able to use the WIT-S in a selection context without inflating adverse impact in the
process, standardization and the development of norm groups should be performed
(Sackett & Wilk, 1994). However, standardization based on the current research would
be flawed because of the (a) skewed age distribution of the sample, (b) inconsistent
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 47
factor structure and (c) generally low reliability estimates. Hence, norm groups are not
provided here.
Items and their latent factors. In contrast to what was hypothesized, analysis
of the WIT-S items shows that a large number of items do not possess an adequate
difficulty level and some items insufficiently discriminate between test takers. Highly
easy or difficult items are redundant as a participant’s total score is practically the same
when the item is not included in the test (Rust & Golombok, 2014). In addition, items
that do not adequately discriminate between high and low performers make little to no
contribution to the test either (Emons et al., 2007; Rust & Golombok, 2014). Looking
for an answer as to why the item characteristics of the WIT-S show inadequate
psychometric qualities, I found two interesting, yet logical observations: (a) a very high
number of participants left considerable amounts of items blank and (b) the position of
the items in the test is, generally speaking, inversely proportional to the item difficulty.
These findings not only suggest the classification of the WIT-S as a speed test (Van der
Linden, 2011a) but also support the premise that the questions are of increasing
difficulty.
Item analysis consequently indicates that a substantial amount of items should
better be omitted from the WIT-S as they are too easy, too difficult, do not adequately
discriminate or because one or multiple of its distractor options are chosen too
frequently. Doing so for all items that do not meet the criteria would be inappropriate
because the test itself is time-restricted, on the one hand, and designed in such a way
that the items are of increasing difficulty, on the other hand. Some items, however,
should at least be revised. Especially the items subjected to distractor analysis, the items
with extreme item facility index values and the items that were completed by a
substantial amount of participants (i.e. 75%) but produced very low or even negative
item-total correlations.
In the current study, the WIT-S furthermore does not reproduce the theorised
factor structure including Gf and the four mental abilities Reasoning, comprising
Numeric Reasoning and Spatial Reasoning, and Perceptual Speed. On the contrary, both
full (including all items) and partial (including items with more robust psychometric
attributes) models do not sufficiently fit with the data. The four-factor partial model
does, however, provide improved fit, indicating that the four mental abilities are
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 48
interrelated to some extent. Additionally, a clear pattern was discovered through
exploratory factor analysis: the found structure is most likely an artefact of the test’s
speededness. This effect is also noted by Kline (2000) as item correlations in speeded
tests frequently are anomalies caused by timing and order. The analysis itself hereby
produces grouped item loadings on each factor. Although sufficiently conclusive, these
findings do not necessarily rule out the reproduction of the latent structure in future
research (cfr. infra).
Reliability and construct validity. The reliability estimates of the composite
scale of the WIT-S, even Gulliksen’s (1950) split-half correction for speed restrictions,
exceed the hypothesized value of .70. Contrarily, the estimates of the individual scales
do not, except for the uncorrected split-half estimates of the Numeric Reasoning and
Spatial Reasoning subscales. Acceptable interrelatedness of the items is thus present,
yet insufficient evidence for the reliability of the subscales is found in the current study.
In this respect, the literature discussed was more promising (see i.a. Bilker et al., 2012;
Donnell et al., 2007; Jeyakumar et al., 2004). Yet again, the results suggest that the
inacceptable subscale reliability estimates are likely due to the test’s speededness.
These low subscale reliability coefficients do not per definition entail that the
WIT-S is an unreliable measure. The results do imply that a small proportion of the test
variance is attributable to a general factor and that the items are interrelated to some
extent (Cortina, 1993; Sijtsma, 2009), which is contrarily not the case for the subscales
of the WIT-S. We have to be careful, though, to not exaggerate this general
interrelatedness. The sample size in the current study is large and it has been shown
that, along with low item-factor loadings – which applies in the current study, the
presence of systematic measuring errors possibly inflates reliability estimates (Cortina,
1993; Shevlin, Miles, & Walker, 2000).
In addition, the WIT-S has significant and substantial overlap with a full-form
measure of intelligence, the Kaufman Adolescent and Adult Intelligence Test (Kaufman
& Kaufman, 1993), and weak, yet significant overlap with an emotion recognition test,
the Geneva Emotion Recognition Test (Schlegel et al., 2014). The overlap between the
WIT-S and KAIT furthermore surpasses the overlap between the WIT-S and GERT.
The results consequently indicate convergence of the WIT-S with a general factor of
intelligence (g) or a more fluid-oriented form of intelligence (Gf), as measured by the
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 49
KAIT. The results furthermore indicate discrimination between the WIT-S and the
ability to adequately recognize emotion expressions, as measured by the GERT. This is
in line with previous research on the interrelatedness of short forms of intelligence
measures and similarly structured, long-form tests (see i.a. Canivez, 2005; Dodrill &
Warner, 1988; Prewett, 1995) and on the relationship between measures of general
intelligence and ability measures of emotional intelligence (Van Rooy et al., 2005).
Limitations
First of all, the skewedness of the age distribution of the participants is one of
the reasons why standardization and subsequently the construction of norm groups is
not desirable. Participants, who were mainly acquaintances of undergraduate students,
were conveniently selected as opposed to randomly sampled. In combination with the
broadly specified age categories in which students were allowed to search for
participants, the probability sampling resulted in the overrepresentation of certain age
categories. The lowered representativeness on the basis of age is consequently
detrimental for the external validity.
Secondly, only reliability measures estimating the interrelatedness of the items
are assessed (i.e. lower bound alphas and split-halves). Although providing these
estimates is not considered to be poor practice, solely relying on these measures is (see
i.a. Cortina, 1993; Schmitt, 1996; Sijtsma, 2009). For one thing, these estimates do not
imply the presence of unidimensionality, nor do they provide information on the
multidimensionality of the measure that is subject of the research (Cortina, 1993;
Schmitt, 1996). Moreover, no definite cutoff can be regarded as the ‘holy grail’ or ideal
level of reliability a test or one of its subscales should eventually reach (Schmitt, 1996).
Alternative forms of reliability should hence be explored. However, on a more positive
note, the current study does provide information on the item-total correlations, latent
factor structure and interrelatedness of the construct measured within its nomological
network, which is advocated for by Sijtsma (2009).
Thirdly, it is important to note that the estimations of convergent and
discriminant validity are not measured under perfect conditions. The tests administered
in the current study not only measure different constructs but also use different methods
to measure them. Here, the WIT-S represents a time-constrained intelligence test, the
KAIT is a full-length intelligence test and the GERT is a full-length emotion
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 50
recognition ability test. This differentiation in methodology can serve as a potential
source of construct-irrelevant variance (Arthur & Villado, 2008). When either the
construct or method is controlled for in the research design, the variance would have
been isolated, which permits more meaningful construct comparisons (Arthur &
Villado, 2008).
Finally, the inconsistent factor structure, low reliability estimates as well as the
large proportion of inadequate item properties are largely due to test speededness and
the way the test was designed and administered. As a result, it is not possible to draw
solid conclusions on the basis of the current study. However, this does not mean that the
test and its items are unreliable or produce an inconsistent factor structure per se.
Differently operationalized future research can undoubtedly provide academics as well
as practitioners with more information on the true reliability and latent structure of the
WIT-S.
Suggestions for future research
In the first place, I suggest that future research looks beyond Cronbach’s (1951)
time-honoured alpha and other measures of interrelatedness, especially when dealing
with test speededness and its resulting anomalies, as is the case with the WIT-S. In
similar scenarios it is more appropriate to look into parallel forms reliability, measuring
the correlation of the individual and composite scales of one form of the test with a
second form of the tests (Gulliksen, 1950; Kline, 2000). When doing so, researchers
should make sure that the content coverage of the test, scales and even items are
preserved across both forms to the greatest extent possible (see Smith et al., 2000).
Moreover, the WIT-S could be administered in a slightly speeded or even non-
speeded manner, similar to the administration of power tests. In this respect, reliability
coefficients of internal consistency like alpha and split halves or McDonald’s (1999) 𝜔
and Raykov, Dimitrov, and Asparouhov’s (2010) 𝜌 for categorical data would render
more meaningful estimates on both the composite-scale and individual-scale level. The
administration of a non- or slightly speeded version can also benefit the outcome of
factor analytical research. The artefacts caused by timing and order are hereby
neutralised, making it possible to unveil the true latent factor structure by means of
structural equation modelling instead of a flawed structure based on grouped item
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 51
loadings. It will furthermore help us to investigate whether the short form preserves the
content coverage of each factor in the multifactorial instrument (Smith et al., 2000).
Thirdly, a more complex approach comprising a full-fledged multitrait–
multimethod design containing several different constructs measured by various
methods is ideally elaborated in future research (Arthur & Villado, 2008; Campbell &
Fiske, 1959). This multitrait-multimethod design could comprise general intelligence
(measures of g as well as Gf), emotional intelligence and personality traits. These are
subsequently measured in their full-length and short form or even in a self-report or
peer-evaluation format. This way, one could control for anomalies of method as well as
construct differentiation. Moreover, the relatedness and overlap with its parent test, the
WIT (Jäger & Althoff, 1983), could be explored by means of independent test
administrations of both forms (using an appropriate test-retest interval) to elaborate
even further on its construct validity (see Smith et al., 2000).
Finally, I suggest that future research also looks beyond construct validity and,
once its factor structure and internal consistency are clear and sound, focusses on
classification rates and predictive validity. The goal of the WIT-S is namely to measure
an applicant’s intelligence in a personnel selection context. In this respect, it’s essential
to show that the same accuracy of classification, present in the parent form, is
transferred to the abbreviated form as well (Smith et al., 2000). Relating the WIT-S to
one or multiple criterions, for example job performance (see Schmidt & Hunter, 1998),
will furthermore contribute to the general validation process of the short-form measure.
Conclusion
Since the early years of test development, researchers have explored several
ways to shorten tests, because full-length measures are often time-consuming and costly
(Smith et al., 2000; also see Doll, 1917; Thorndike et al., 1926). In the current research,
I’ve investigated the potential of a short form of the Wilde Intelligenztest (WIT-S) for
personnel selection, where (a) time is often a crucial element (Klimoski & Jones, 2008)
and (b) being able to precisely predict an applicant’s future job performance is an
important objective (Highhouse, 2008; Schmidt & Hunter, 2004). In-depth
psychometric research on the reliability and validity of the WIT-S gives us a first
imprint of the potential employability of the abbreviated measure.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 52
First of all, the WIT-S differentiates between age, sex and educational level
subgroups whereby male, younger and higher educated test takers, on average, attain
better scores than their counterparts. This corresponds with earlier research in the field
of intelligence, where measures of fluid intelligence produce similar test characteristics
(cfr. supra). In addition, the items of the WIT-S are sufficiently interrelated as the
composite scale reaches satisfactory reliability levels. Finally, the convergence of the
WIT-S with a full-length measure of general intelligence (Kaufman Adolescent and
Adult Intelligence Test) and divergence with an emotional ability test (Geneva Emotion
Recognition Test) suggest that the short form measures a general factor of intelligence
to at least some extent.
By contrast, the current research is not able to show (a) that each item of the
WIT-S attains an adequate difficulty and discriminatory level, (b) that the WIT-S
preserves the factor structure of a multifactorial measure and (c) that the subscales of
the WIT-S reach satisfactory levels of reliability. Because reliability proceeds validity,
it can not be concluded here that the WIT-S is a reliable and valid alternative to full-
form intelligence tests. The way the test was designed and eventually administered is
likely the root cause of these limitations. Therefore, future research should (a) focus on
parallel forms reliability, (b) investigate the WIT-S in its non- or less speeded form and
(c) control for construct and method variations or use a more complex multitrait–
multimethod design. In sum, future research on the reliability and latent factor structure
of the WIT-S should provide more conclusive findings about its psychometric qualities
before the short-form intelligence test can be employed in a personnel selection context.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 53
REFERENCES
Abad, F. J., Colom, R., Juan-Espinosa, M., & Garcia, L. F. (2003). Intelligence
differentiation in adult samples. Intelligence 31(2), 157–166.
doi:10.1016/s0160-2896(02)00141-1
Ackerman, P. L. (1996). A theory of adult intellectual development: Process,
personality, interests, and knowledge. Intelligence, 22(2), 227–257.
doi:10.1016/s0160-2896(96)90016-1
Ackerman, P. L., Beier, M. E., & Boyle, M. O. (2005). Working memory and
intelligence: The same or different constructs? Psychological Bulletin, 131(1),
30–60. doi:10.1037/0033-2909.131.1.30
Adams, R. L., Smigielski, J., & Jenkins, R. L. (1984). Development of a Satz-Mogel
Short Form of the WAIS-R. Journal of Consulting and Clinical Psychology,
52(5), 908–908. doi:10.1037/0022-006x.52.5.908
Agnello, P., Ryan, R., & Yusko, K. P. (2015). Implications of modern intelligence
research for assessing intelligence in the workplace. Human Resource
Management Review, 25(1), 47–55. doi:10.1016/j.hrmr.2014.09.007
Arthur, W., & Villado, A. J. (2008). The importance of distinguishing between
constructs and methods when comparing predictors in personnel selection
research and practice. Journal of Applied Psychology, 93(2), 435–442.
doi:10.1037/0021-9010.93.2.435
Axelrod, B. N., Ryan, J. J., & Ward, L. C. (2001). Evaluation of seven-subtest short
forms of the Wechsler Adult Intelligence Scale-III in a referred sample. Archives
of Clinical Neuropsychology, 16(1), 1–8. doi:10.1093/arclin/16.1.1
Baddeley, A. D. (1968). A three-minute reasoning test based on grammatical
transformation. Psychonomic Science, 10(10), 341-342.
doi:10.3758/bf03331551
Barkema, H. G., Baum, J. A. C., & Mannix, E. A. (2002). Management Challenges in a
New Time. Academy of Management Journal, 45(5), 916–930.
doi:10.2307/3069322
Bartholomew, D. J., Steele F. S., Moustaki, I., & Galbraith, J. I. (2008). Analysis of
multivariate social science data. Boca Raton: Chapman & Hall/CRC.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 54
Basak, C., Boot, W. R., Voss, M. W., & Kramer, A. F. (2008). Can training in a real-
time strategy video game attenuate cognitive decline in older adults? Psychology
and Aging, 23(4), 765–777. doi:10.1037/a0013494
Bialik, D. (2008). Brief measures of cognitive abilities: A concurrent validity
investigation of the WRIT and WASI. Dissertation Abstracts International:
Section B. Sciences and Engineering, 68(12B), 8443.
Bilker, W. B., Hansen, J. A., Brensinger, C. M., Richard, J., Gur, R. E., Gur, R. C.
(2012). Development of Abbreviated Nine-Item Forms of the Raven’s Standard
Progressive Matrices Test. Assessment, 19(3), 354–369.
doi:10.1177/1073191112446655
Blair, C. (2006). How similar are fluid cognition and general intelligence? A
developmental neuroscience perspective on fluid cognition as an aspect of
human cognitive ability. Behavioral and Brain Sciences, 29, 109–125.
doi:10.1017/S0140525X06009034
Borella, E., Carretti, B., Riboldi, F., & De Beni, R. (2010). Working memory training in
older adults: Evidence of transfer and maintenance effects. Psychology and
Aging, 25(4), 767–778. doi:10.1037/a0020683
Boring, E. G. (1923). Intelligence as the tests test it. The New Republic, 35, 35–37.
Borman, W. C., Hanson, M. A., Oppler, S. H., Pulakos, E. D., Elaine, D., White L.A, &
Leonard A. (1993). Role of early supervisory experience in supervisor
performance. Journal of Applied Psychology, 78(3), 443–449.
doi:10.1037/0021-9010.78.3.443
Brown, D. T. (1994). Review of the Kaufman Adolescent and Adult Intelligence Test
(KAIT). Journal of School Psychology, 32(1), 85–99. doi:10.1016/0022-
4405(94)90030-2
Brown, M. B., & Forsythe, A. B. (1974). Robust Tests for the Equality of Variances.
Journal of the American Statistical Association, 69(346), 364–367.
doi:10.1080/01621459.1974.10482955
Brown, W. (1910). Some experimental results in the correlation of mental abilities.
British Journal of Psychology, 1904–1920, 3(3), 296–322. doi:10.1111/j.2044-
8295.1910.tb00207.x
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 55
Buckley, M. R., Norris, A. C., & Wiese, D. S. (2000). A brief history of the selection
interview: may the next 100 years be more fruitful. Journal of Management
History, 6(3), 113–126. doi:10.1108/eum0000000005329
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.
doi:10.1037/h0046016
Canivez, G. L. (1995). Validity of the Kaufman Brief Intelligence Test: Comparisons
With the Wechsler Intelligence Scale for Children – Third Edition. Assessment,
2(2), 101–111.
Canivez, G. L. (2005). Construct Validity of the Kaufman Brief Intelligence Test,
Wechsler Intelligence Scale for Children-Third Edition, and Adjustment Scales
for Children and Adolescents. Journal of Psychoeducational Assessment, 23(1),
15–34. doi:10.1177/073428290502300102
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.
Cambridge: Cambridge University Press. doi:10.1017/cbo9780511571312.002
Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical
experiment. Journal of Educational Psychology, 54(1), 1–22.
doi:10.1037/h0046743
Cattell, R. B. (1966). The Scree Test For The Number Of Factors. Multivariate
Behavioral Research, 1(2), 245–276. doi:10.1207/s15327906mbr0102_10
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences: Second
Edition. New York, NY: Lawrence Erlbaum Associates, Publishers.
doi:10.4324/9780203771587
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and
applications. Journal of Applied Psychology, 78(1), 98–104. doi:10.1037/0021-
9010.78.1.98
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16(3), 297–334. doi:10.1007/bf02310555
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52(4), 281–302. doi:10.1037/h0040957
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 56
Deary, I. J. (2000). Looking down on human intelligence: from psychometrics to the
brain. New York, NY: Oxford University Press.
doi:10.1093/acprof:oso/9780198524175.001.0001
DeNisi, A., Hitt, M. A., & Jackson, S. E. (2003). The knowledge-based approach to
sustaining competitive advantage. In S.E. Jackson, M.A. Hitt, & A.S. DeNisi
(Eds.), Managing knowledge for sustained competitive advantage (pp. 3–33).
San Francisco: John Wiley & Sons, Inc.
Detterman, D. K., & Daniel, M. H. (1989). Correlations of mental tests with each other
and with cognitive variables are highest in low IQ groups. Intelligence, 13(4),
349–360. doi:10.1016/s0160-2896(89)80007-8
Dickens, W. T. (2007). What is g? Washington, DC: Brookings Institution.
Dodrill, C. B. (1981). An economical method for the evaluation of general intelligence
in adults. Journal of Consulting and Clinical Psychology, 49(5), 668–673.
doi:10.1037/0022-006x.49.5.668
Dodrill, C. B. (1983). Long-term reliability of the Wonderlic Personnel Test. Journal of
Consulting and Clinical Psychology, 51(2), 316–317. doi:10.1037/0022-
006x.51.2.316
Dodrill, C. B., & Warner, M. H. (1988). Further studies of the Wonderlic Personnel
Test as a brief measure of intelligence. Journal of Consulting and Clinical
Psychology, 56(1), 145–147. doi:10.1037/0022-006x.56.1.145
Doll, E. A. (1917). A brief Binet-Simon scale. Psychological Clinic, 11, 197–211.
Donders, J. (1997). A short form of the WISC–III for clinical use. Psychological
Assessment, 9(1), 15–20. doi:10.1037/1040-3590.9.1.15
Donnell, A., Pliskin, N., Holdnack, J., Axelrod, B., & Randolph, C. (2007). Rapidly-
administered short forms of the Wechsler Adult Intelligence Scale – 3rd edition.
Archives of Clinical Neuropsychology, 22(8), 917–924.
doi:10.1016/j.acn.2007.06.007
Dumont, R. & Hagberg, C. (1994) Kaufman Adolescent and Adult Intelligence Test
(KAIT): Test Review. Journal of Psychoeducational Assessment, 12(2), 190–
196.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 57
Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2007). On the consistency of individual
classification using short scales. Psychological Methods, 12(1), 105–120.
doi:10.1037/1082-989x.12.1.105
Ericsson, K. A., Charness, N., Feltovich, P. J., & Hoffman, R. R. (2006). The
Cambridge handbook of expertise and expert performance. Cambridge:
Cambridge University Press.
Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K. (2009). COTAN
Beoordelingssysteem voor de kwaliteit van tests. Amsterdam: Nederlands
Instituut van Psychologen.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating
the use of exploratory factor analysis in psychological research. Psychological
Methods, 4(3), 272–299. doi:10.1037/1082-989x.4.3.272
Field, A. (2009). Discovering statistics using SPSS. London, England: Sage
publications.
Flora, D. B., & Curran, P. J. (2004). An Empirical Evaluation of Alternative Methods of
Estimation for Confirmatory Factor Analysis With Ordinal Data. Psychological
Methods, 9(4), 466–491. doi:10.1037/1082-989x.9.4.466
Flynn, J. R. (2007). What is intelligence? Beyond the Flynn effect. New York, NY:
Cambridge University Press.
Galton, F. (1869). Hereditary genius: An inquiry into its laws and consequences.
London, Great Britain: Macmillan and Co. doi:10.1037/13474-000
Glutting, J., Adams, W., & Sheslow, D. (2000). Wide Range Intelligence Test.
Wilmington, DE: Wide Range.
Goldstein, H. W., Zedeck, S., & Goldstein, I. L. (2002). g: Is This Your Final Answer?
Human Performance, 15(1–2), 123–142. doi:10.1080/08959285.2002.9668087
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.
Gottfredson, L. S. (1997a). Mainstream Science on Intelligence: An Editorial With 52
Signatories, History and Bibliography. Intelligence, 24(1), 13–23.
doi:10.1016/s0160-2896(97)90011-8
Gottfredson, L. S. (1997b). Why g matters: The complexity of everyday life.
Intelligence, 24(1), 79–132. doi:10.1016/s0160-2896(97)90014-3
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 58
Gottfredson, L. S. (2002). Where and Why g Matters: Not a Mystery. Human
Performance, 15(1), 25–46. doi:10.1207/s15327043hup1501&02_03
Gulliksen, H. (1950). The reliability of speeded tests. Psychometrika, 15(3), 259–269.
doi:10.1007/bf02289042
Hays, J., Reas, D., & Shaw, J. (2002). Concurrent validity of the Wechsler Abbreviated
Scale of Intelligence and the Kaufman Brief Intelligence Test among psychiatric
inpatients. Psychological Reports, 90, 355–359. doi:10.2466/pr0.90.2.355-359
Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure
in American life. New York: Free Press.
Hicks, K. L., Harrison, T. L., & Engle, R. W. (2015). Wonderlic, working memory
capacity, and fluid intelligence. Intelligence, 50, 186–195.
doi:10.1016/j.intell.2015.03.005
Highhouse, S. (2008). Stubborn Reliance on Intuition and Subjectivity in Employee
Selection. Industrial and Organizational Psychology, 1(3), 333–342.
doi:10.1111/j.1754-9434.2008.00058.x
Horn, J. L. (1965a). A rationale and test for the number of factors in factor analysis.
Psychometrika, 30(2), 179–185. doi:10.1007/bf02289447
Horn, J. L. (1965b). Fluid and crystallized intelligence: A factor analytic study of the
structure among primary mental abilities (Unpublished doctoral dissertation).
University of Illinois, IL.
Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and
crystallized intelligence. Journal of Educational Psychology, 57(5), 253–270.
doi:10.1037/h0023816
Horn, J. L., & Cattell, R. B. (1967). Age differences in fluid and crystallized
intelligence. Acta Psychologica, 26, 107–129. doi:10.1016/0001-
6918(67)90011-X
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, Detection and
Amelioration of Adverse Impact in Personnel Selection Procedures: Issues,
Evidence and Lessons Learned. International Journal of Selection and
Assessment, 9(1–2), 152–194. doi:10.1111/1468-2389.00171
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 59
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling: A Multidisciplinary Journal, 6(1), 1–55.
doi:10.1080/10705519909540118
Hunt, E. (2011). Human intelligence. Cambridge, England: Cambridge University
Press. doi:10.1017/cbo9780511781308
Hunter, J. E., & Schmidt, F. L. (1996). Intelligence and job performance: Economic and
social implications. Psychology, Public Policy, and Law, 2(3–4), 447–472.
doi:10.1037/1076-8971.2.3-4.447
Jacoby, L. L., Toth, J. P., & Yonelinas, A. P. (1993). Separating conscious and
unconscious influences on memory: Measuring recollections. Journal of
Experimental Psychology: General, 122(2), 139–154.
Jaeggi, S. M., Studer-Luethi, B., Buschkuehl, M., Su, Y.-F., Jonides, J., & Perrig, W. J.
(2010). The relationship between n-back performance and matrix reasoning —
implications for training and transfer. Intelligence, 38(6), 625–635.
doi:10.1016/j.intell.2010.09.001
Jäger A. O., & Althoff, K. (1983). Der WILDE-Intelligenz-Test (WIT): Ein
Strukturdiagnostikum: Handanweisung. Göttingen: Hogrefe.
Jenkins, J., & Leung, C. (2013). English as a Lingua Franca. The Companion to
Language Assessment, 1605–1616. doi:10.1002/9781118411360.wbcla047
Jenner, E. (1807). Classes of the Human Powers of Intellect. The Artist, 19, 1–7.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger
Publishers/Greenwood Publishing Group.
Jeyakumar, S. L. E., Warriner, E. M., Raval, V. V., & Ahmad, S. A. (2004). Balancing
the Need for Reliability and Time Efficiency: Short Forms of the Wechsler
Adult Intelligence Scale-III. Educational and Psychological Measurement,
64(1), 71–87. doi:10.1177/0013164403258407
Jex, S. M., & Britt, T. W. (2008). Introduction to Organizational Psychology. In S. M.,
Jex & T. W. Britt (Eds.), Organizational psychology: A scientist–practitioner
approach (pp. 1–20). Hoboken, NJ: Wiley.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 60
Johnson, W., & Bouchard, T. J., Jr. (2005a). The structure of human intelligence: It is
verbal, perceptual, and image rotation (VPR), not fluid and crystallized.
Intelligence, 33(4), 393–416. doi:10.1016/j.intell.2004.12.002
Johnson, W., & Bouchard, T. J., Jr. (2005b). Constructive replication of the visual
perceptual image-rotation model in Thurstone’s (1941) battery of 60 tests of
mental ability. Intelligence, 33(4), 417–430. doi:10.1016/j.intell.2004.12.001
Johnson, W., & Bouchard, T. J. (2007). Sex differences in mental abilities: g masks the
dimensions on which they lie. Intelligence, 35(1), 23–39.
doi:10.1016/j.intell.2006.03.012
Johnson, W., Bouchard, T. J., Jr., McGue, M., Segal, N. L., Tellegen, A., Keyes, M., &
Gottesman, I. I. (2007). Genetic and environmental influences on the Verbal-
Perceptual-Image Rotation (VPR) model of the structure of mental abilities in
the Minnesota study of twins reared apart. Intelligence, 35(6), 542–562.
doi:10.1016/j.intell.2006.10.003
Johnson, W., te Nijenhuis, J., & Bouchard, T. J., Jr. (2007). Replication of the
hierarchical visual-perception-rotation model in De Wolff and Buiten’s (1963)
battery of 46 tests of mental ability. Intelligence, 35(3), 69–81.
doi:10.1016/j.intell.2006.05.002
Kaiser, H. F. (1960). The Application of Electronic Computers to Factor Analysis.
Educational and Psychological Measurement, 20(1), 141–151.
doi:10.1177/001316446002000116
Kaufman, A. S. (1990). Assessing adolescent and adult intelligence. Needham Heights,
MA: Allyn & Bacon.
Kaufman, A. S., & Kaufman, N. L. (1990). Manual for the Kaufman Brief Intelligence
Test. Circle Pines, MN: American Guidance Service.
Kaufman A. S., & Kaufman N. L. (1993). Manual: Kaufman Adolescent and Adult
Intelligence Test. Circle Pines, MN: American Guidance Service.
Kersting, M., Althoff, K., & Jäger, A. O. (2008). WIT-2. Der Wilde-Intelligenztest.
Verfahrenshinweise. Göttingen: Hogrefe.
Klimoski, R., Jones, R. G. (2008). Intuiting in the selection context. Industrial and
Organizational Psychology, 1(3), 352–354. doi:10.1111/j.1754-
9434.2008.00061.x
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 61
Kline, P. (2000). Handbook of psychological testing. London: Routledge.
Kline, P. (2002). An Easy Guide to Factor Analysis. London: Routledge.
Kline, R. B. (2005). Principles and practice of structural equation modeling. New
York: Guilford Press.
Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the Shortcomings of
Shortened Tests: A Literature Review. International Journal of Testing, 13(3),
223–248. doi:10.1080/15305058.2012.703734
Lang, J. W. B., Kersting, M., & Beauducel, A. (2016). Hierarchies of factor solutions in
the intelligence domain: Applying methodology from personality psychology to
gain insights into the nature of intelligence. Learning and Individual
Differences, 47, 37–50. doi:10.1016/j.lindif.2015.12.003
Lang, J. W. B., Kersting, M., Hülsheger, U. R., & Lang, J. (2010). General mental
ability, narrower cognitive abilities, and job performance: The perspective of the
nested-factors model of cognitive abilities. Personnel Psychology, 63, 595–640.
doi:10.1111/j.1744-6570.2010.01182.x
Legree, P. J., Pifer, M. E., & Grafton, F. C. (1996). Correlations among cognitive
abilities are lower for higher ability groups. Intelligence, 23(1), 45–57.
doi:10.1016/s0160-2896(96)80005-5
Levy, P. (1968). Short-form tests: A methodological review. Psychological Bulletin, 69,
410–416. doi:10.1037/h0025736
Lautenschlager, G. J. (1989). A Comparison of Alternatives to Conducting Monte Carlo
Analyses for Determining Parallel Analysis Criteria. Multivariate Behavioral
Research, 24(3), 365–395. doi:10.1207/s15327906mbr2403_6
Lievens F., Reeve C. L. (2012). Where I-O Psychology Should Really (Re)start Its
Investigation of Intelligence Constructs and Their Measurement. Industrial and
Organizational Psychology, 5(2), 153–158. doi:10.1111/j.1754-
9434.2012.01421.x
Lubinski, D. (2000). Scientific and social significance of assessing individual
differences: ‘‘Sinking shafts at a few critical points.’’ Annual Review of
Psychology, 51(1), 405–444. doi:10.1146/annurev.psych.51.1.405
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 62
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and
determination of sample size for covariance structure modeling. Psychological
Methods, 1(2), 130–149. doi:10.1037/1082-989x.1.2.130
MacCann, C., & Roberts, R. D. (2008). New paradigms for assessing emotional
intelligence: Theory and data. Emotion, 8(4), 540–551. doi:10.1037/a0012746
Mackey, A. P., Hill, S. S., Stone, S. I., & Bunge, S. A. (2010). Differential effects of
reasoning and speed training in children. Developmental Science, 14(3), 582–
590. doi:10.1111/j.1467-7687.2010.01005.x
Mayer, J. D., Caruso, D. R., & Salovey, P. (1999). Emotional Intelligence Meets
Traditional Standards for an Intelligence. Intelligence, 27(4), 267–298.
doi:10.1016/S0160-2896(99)00016-1
Mayer, J. D., Salovey, P., & Caruso, D. R. (2002). Mayer-Salovey-Caruso Emotional
Intelligence Test. Toronto, Canada: Multi-Health Systems.
Mayer, J. D., Roberts, R. D., & Barsade, S. G. (2008). Human Abilities: Emotional
Intelligence. Annual Review of Psychology, 59(1), 507–536.
doi:10.1146/annurev.psych.59.103006.093646
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum.
McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing
on the shoulders of the giants of psychometric intelligence research. Intelligence,
37(1), 1–10. doi:10.1016/j.intell.2008.08.004
Miles, A., & Sadler-Smith, E. (2014). “With recruitment I always feel I need to listen to
my gut”: the role of intuition in employee selection. Personnel Review, 43(4),
606–627. doi:10.1108/pr-04-2013-0065
Mulder, J. L., Dekker, R., & Dekker P. H. (2004). Kaufman – Intelligentietest voor
Adolescenten en Volwassenen: Nederlandstalige bewerking van de Kaufman
Adolescent and Adult Intelligence Test: Handleiding. Leiden: PITS
Testuitgeverij.
Murphy, K. (1999). The challenge of staffing a post-industrial workplace. In D. Ilgen, &
E. Pulakos (Eds.), The changing nature of work performance: Implications for
staffing, personnel actions, and development (pp. 295–324). San Francisco:
Jossey-Bass.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 63
Murphy, K. (2010). How a broader definition of the criterion domain changes our
thinking about adverse impact. In J. Outtz (Ed.), Adverse impact (pp. 137–160).
San Francisco: Jossey-Bass.
Murphy, K. R., Cronin, B. E., & Tam, A. P. (2003). Controversy and consensus
regarding the use of cognitive ability testing in organizations. Journal of Applied
Psychology, 88(4), 660–671. doi:10.1037/0021-9010.88.4.660
Naglieri, J. A. (1985). Matrix Anologies Test – Short Form: Examiner’s Manual. San
Antonio, TX: Psychological Corporation.
Neisser, U. (1979). The Concept of Intelligence. Intelligence, 3(3), 217–227.
doi:10.1016/0160-2896(79)90018-7
Neisser, U., Boodoo, G., Bouchard, T. J., Jr., Boykin, A. W., Brody, N., Ceci, S. J.,
Halpern, D. F., Loehlin, J. C., Perloff, R., Sternberg, R. J., & Urbina, S. (1996).
Ingelligence: Knowns and Unknowns. American Psychologist, 51(2), 77–101.
doi:10.1037/0003-066x.51.2.77
Nisbett, R. E., Aronson, J., Blair, C., Dickens, W., Flynn, J., Halpern, D. F., &
Turkheimer, E. (2012). Intelligence: New Findings and Theoretical
Developments. American Psychologist, 67(2), 130–159. doi:10.1037/a0026699
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.).
New York: McGraw-Hill.
Partchev, I., De Boeck, P., & Steyer, R. (2011). How Much Power and Speed Is
Measured in This Test? Assessment, 20(2), 242–252.
doi:10.1177/1073191111411658
Patil, V. H., Singh, S. N., Mishra, S., & Todd Donavan, D. (2007). Parallel Analysis
Engine to Aid Determining Number of Factors to Retain [Computer software].
Retrieved from http://smishra.faculty.ku.edu/parallelengine.htm
Pearson. (2012). White paper WAIS-IV-NL: Psychometrische eigenschappen: Deel 2
van 3. The Netherlands, Amsterdam: Pearson.
Ployhart, R. E., Holtz, B. C. (2008). The diversity–validity dilemma: Strategies for
reducing racioethnic and sex subgroup differences and adverse impact in
selection. Personnel Psychology, 61(1), 153–172. doi:10.1111/j.1744-
6570.2008.00109.x
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 64
Psychological Corporation. (1999). Wechsler Abbreviated Scale of Intelligence. San
Antonio, TX: Pearson.
Prewett, P. N. (1995). A comparison of two screening tests (the Matrix Analogies Test –
Short Form and the Kaufman Brief Intelligence Test) with the WISC-III.
Psychological Assessment, 7(1), 69–72. doi:10.1037/1040-3590.7.1.69
R Core Team (2015). R: A language and environment for statistical computing. Austria,
Vienna: R Foundation for Statistical Computing.
Raven, J. C. (1938). Progressive Matrices: A perceptual test of intelligence. London:
H.K. Lewis.
Raven, J., Raven, J. (Eds.). (2008). Uses and abuses of intelligence: Studies advancing
Spearman and Raven’s quest for non-arbitrary metrics. New York: Royal
Fireworks Press.
Raykov, T., Dimitrov, D. M., & Asparouhov, T. (2010). Evaluation of Scale Reliability
With Binary Measures Using Latent Variable Modeling. Structural Equation
Modeling: A Multidisciplinary Journal, 17(2), 265–279.
doi:10.1080/10705511003659417
Reeve, C. L. (2004). Differential ability antecedents of general and specific dimensions
of declarative knowledge: More than g. Intelligence, 32(6), 621–652.
doi:10.1016/j.intell.2004.07.006
Reeve, C. L., & Bonaccio, S. (2011). The nature and structure of ‘‘intelligence.’’ In T.
Chamorro- Premuzic, A. Furnham, & S. von Stumm (Eds.), Handbook of
Individual Differences (pp. 187–216). Oxford, England: Wiley-Blackwell.
Reeve, C. L., Scherbaum, C., & Goldstein, H. (2015). Manifestations of intelligence:
Expanding the measurement space to reconsider specific cognitive abilities.
Human Resource Management Review, 25(1), 28–37.
doi:10.1016/j.hrmr.2014.09.005
Resnick, L. B. (1976). The Nature of Intelligence. Hillsdale, N.J.: Lawrence Erlbaum
Associates.
Rindler, S. E. (1979). Pitfalls in assessing test speededness. Journal of Educational
Measurement, 16(4), 261–270. doi:10.1111/j.1745-3984.1979.tb00107.x
Rosseel, Y. (2012). lavaan : An R Package for Structural Equation Modeling. Journal of
Statistical Software, 48(2). doi:10.18637/jss.v048.i02
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 65
Rust, J., & Golombok, S. (2014). Modern psychometrics: The science of psychological
assessment. New York: Routledge.
Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). HR professionals’ beliefs about
effective human resource practices: Correspondence between research and
practice. Human Resources Management, 41, 149–174.
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing
in employment, credentialing, and higher education: Prospects in a post-
affirmative-action world. American Psychologist, 56(4), 302–318.
doi:10.1037/0003-066x.56.4.302
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score
adjustment in preemployment testing. American Psychologist, 49(11), 929–954.
doi:10.1037/0003-066x.49.11.929
Salthouse, T. A. (1996). The processing-speed theory of adult age differences in
cognition. Psychological Review, 103(3), 403-428. doi:10.1037/0033-
295X.103.3.403
Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for
moment structure analysis. Psychometrika, 66(4), 507–514.
doi:10.1007/bf02296192
Scherbaum, C. A., & Goldstein, H. W. (2015). Intelligence and the modern world of
work. Human Resource Management Review, 25(1), 1–3.
doi:10.1016/j.hrmr.2014.09.002
Scherbaum, C. A., Goldstein, H. W., Yusko, K. P., Ryan, R., & Hanges, P. J. (2012).
Intelligence 2.0: Reestablishing a Research Program on g in I-O Psychology.
Industrial and Organizational Psychology, 5(2), 128–148. doi:10.1111/j.1754-
9434.2012.01419.x
Schipolowski, S., Schroeders, U., & Wilhelm, O. (2014). Pitfalls and Challenges in
Constructing Short Forms of Cognitive Ability Measures. Journal of Individual
Differences, 35(4), 190–200. doi:10.1027/1614-0001/a000134
Schlegel, K., Grandjean, D., & Scherer, K. R. (2014). Introducing the Geneva Emotion
Recognition Test: An example of Rasch-based test development. Psychological
Assessment, 26(2), 666–672. doi:10.1037/a0035246
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 66
Schmidt, F. L. (2002). The Role of General Cognitive Ability and Job Performance:
Why There Cannot Be a Debate. Human Performance, 15(1), 187–210.
doi:10.1207/s15327043hup1501&02_12
Schmidt, F. L., & Hunter, J. E. (1992). Development of a Causal Model of Processes
Determining Job Performance. Current Directions in Psychological Science,
1(3), 89-92. doi:10.1111/1467-8721.ep10768758
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in
personnel psychology: Practical and theoretical implications of 85 years of
research findings. Psychological Bulletin, 124(2), 262–274. doi:10.1037/0033-
2909.124.2.262
Schmidt, F. L., & Hunter, J. (2004). General Mental Ability in the World of Work:
Occupational Attainment and Job Performance. Journal of Personality and
Social Psychology, 86(1), 162–173. doi:10.1037/0022-3514.86.1.162
Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and
ability on job knowledge, work sample performance, and supervisory ratings of
job performance. Journal of Applied Psychology, 71(3), 432–439.
doi:10.1037/0021-9010.71.3.432
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment,
8(4), 350–353. doi:10.1037/1040-3590.8.4.350
Schneider, W. J., Joel, W., & Newman, D. A. (2015). Intelligence is multidimensional:
Theoretical review and implications of specific cognitive abilities. Human
Resource Management Review, 25(1), 12–27. doi:10.1016/j.hrmr.2014.09.004
Shepard, L. A. (1993). Evaluating Test Validity. Review of Research in Education, 19,
405. doi:10.2307/1167347
Sheppard, L. D., & Vernon, P. A. (2008). Intelligence and speed of information-
processing: A review of 50 years of research. Personality and Individual
Differences, 44(3), 535–551. doi:10.1016/j.paid.2007.09.015
Shevlin, M., Miles, J. N. V., Davies, M. N. O., & Walker, S. (2000). Coefficient alpha:
a useful indicator of reliability? Personality and Individual Differences, 28(2),
229–237. doi:10.1016/s0191-8869(99)00093-8
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 67
Sijtsma, K. (2009). Correcting Fallacies in Validity, Reliability, and Classification.
International Journal of Testing, 9(3), 167–194.
doi:10.1080/15305050903106883
Smith, G. T., Combs, J. L., & Pearson, C. M. (2012). Brief instruments and short forms.
In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J.
Sher (Eds.), APA handbook of research methods in psychology, Vol. 1 (pp. 395–
409). Washington, DC: American Psychological Association. doi:
10.1037/13619-021
Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the Sins of Short-Form
Development. Psychological Assessment, 12(1), 102–111. doi:10.1037//1040-
3590.12.1.102
Spearman, C. (1904). “General Intelligence,” Objectively Determined and Measured.
The American Journal of Psychology, 15(2), 201. doi:10.2307/1412107
Spearman, C. (1910). Correlation calculated with faulty data.British Journal of
Psychology, 1904-1920, 3(3), 271–295. doi:10.1111/j.2044-
8295.1910.tb00206.x
Spearman, C. (1927). The abilities of man. London: Macmillan.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Upper
Saddle River, NJ: Pearson Allyn & Bacon.
Tate, R. F. (1954). Correlation Between a Discrete and a Continuous Variable. Point-
Biserial Correlation. The Annals of Mathematical Statistics, 25(3), 603–607.
doi:10.1214/aoms/1177728730
Terpstra, D. E. (1996). The search for effective methods. HR Focus, 73, 16–17.
Thorndike, E. L., Bregman, E. O., Cobb, M. V., & Woodyard, E. (1926). The Relations
of Altitude to Width, Area, and Speed. In E. L. Thorndike, E. O. Bregman, M.
V. Cobb, & E. Woordyard (Eds.) The Measurement of Intelligence (pp. 388–
401). Columbia: Bureau of Publications Teachers College Columbia University.
doi:10.1037/11240-013
Thorndike E. L., Terman L. M., Colvin S. S., Freeman F. N., Pintner R., Ruml. B., &
Pressey. S. L. (1921). Intelligence and its measurement: A symposium. Journal
of Educational Psychology, 12(3), 123–147. doi:10.1037/h0076078
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 68
Thurstone, L. L. (1938). Primary mental abilities. Chicago: University of Chicago
Press.
Thurstone, L. L. (1947). Multiple factor analysis: A development and expansion of
vectors of the mind. Chicago: University of Chicago.
Thurstone, L. L., & Thurstone, T. G. (1941). Factorial studies of intelligence.
Psychometric monographs, 94(2).
Université De Genève. (2014). Measurement of emotion recognition ability and
emotional understanding [PDF]. Retrieved fromhttp://www.autumnschool-
emo-int.ugent.be
Universiteit Gent. (2014). Course Specifications: Psychodiagnostics II (H001330)
[PDF]. Retrieved from
http://studiegids.ugent.be/2014/EN/studiefiches/H001330.pdf
Van der Linden, W. J. (2011). Setting Time Limits on Tests. Applied Psychological
Measurement, 35(3), 183–199. doi:10.1177/0146621610391648
Van der Linden, W. J. (2011b). Test Design and Speededness. Journal of Educational
Measurement, 48(1), 44–60. doi:10.1111/j.1745-3984.2010.00130.x
Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M.,
& Raijmakers, M. E. (2006). A dynamical model of general intelligence: the
positive manifold of intelligence by mutualism. Psychological review, 113(4),
842-861. doi:10.1037/0033-295x.113.4.842
Van Rooy, D. L., Viswesvaran, C., & Pluta, P. (2005). An Evaluation of Construct
Validity: What Is This Thing Called Emotional Intelligence? Human
Performance, 18(4), 445–462. doi:10.1207/s15327043hup1804_9
Ward, L. C., & Ryan, J. J. (1996). Validity and time savings in the selection of short
forms of the Wechsler Adult Intelligence Scale – Revised. Psychological
Assessment, 8(1), 69–72. doi:10.1037/1040-3590.8.1.69
Wechsler, D. (1949). Wechsler Intelligence Scale for Children. San Antonio, TX: The
Psychological Corporation.
Wechsler, D. (1955). Manual for the Wechsler Adult Intelligence Scale. Oxford,
England: Psychological Corporation.
Wechsler, D. (1967). Manual for the Wechsler Preschool and Primary Scale of
Intelligence. New York: Psychological Corporation.
PSYCHOMETRIC QUALITIES OF A 12-MINUTE IQ TEST 69
Wechsler, D. (1975). Intelligence Defined and Undefined: A Relativistic Appraisal.
American Psychologist, 30(2), 135–139. doi:10.1037/h0076868
Wechsler, D. (2008). Wechsler adult intelligence scale – Fourth Edition (WAIS-IV). San
Antonio, TX: NCS Pearson.
Williams, P. E., Weiss, L .G, & Rolfhus, E. L. (2003). WISC-IV Technical Report #2
Psychometric Properties. San Antonio, TX: The Psychological Corporation.
Wonderlic, E. F. (1973). Wonderlic Personnel Test: Manual. Los Angeles: Western
Psychological Services.
Wonderlic, E. F., Wonderlic, C. F. (1992). Wonderlic Personnel Test user’s manual.
Libertyville, IL: Wonderlic Personnel Test, Inc.
Wymer, J. (2003). Utility of a clinically derived abbreviated form of the WAIS-III.
Archives of Clinical Neuropsychology, 18(8), 917–927. doi:10.1016/s0887-
6177(02)00221-4
Ziegler, M., Kemper, C. J., & Kruyen, P. (2014). Short Scales – Five
Misunderstandings and Ways to Overcome Them. Journal of Individual
Differences, 35(4), 185–189. doi:10.1027/1614-0001/a000148
Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for determining the
number of components to retain. Psychological Bulletin, 99(3), 432–442.
doi:10.1037/0033-2909.99.3.432