8/3/2019 2004073P
1/4
GENERALISABILITY OF THE PSYCHOMETRIC PROPERTIES
OF A PILOT SELECTION BATTERYAgns Kokorian
People Technologies, [email protected]
&
Colin Valsler,
Psytech Ltd, [email protected]
ThePilotAptitude (PILAPT) system
The development of the PILAPT computer-based system grew out of the meta-analysis
reported by Hunter and Burke, and the design principles have been described by Burke,Kitching and Valsler (1994). In summary, these design principles were as follows:
that the test designs be based on clearly understood measures of individual differences
that research has shown are relevant to pilot performance, either in training or in
operations. As such, PILAPT had to cover both handling skills (as required in ab
initio training) and CRM competencies (such as situational awareness and capacity).
that the test designs should assume no prior knowledge of flying, but should have
links to key pilot performance factors that are intuitive to both candidates and users.
that the test designs should allow for practice to avoid the influence of prior
experience of video games and give all candidates a level playing field to demonstrate
their potential.
that the overall battery should be efficient and avoid redundancy and nugatory
assessments.
Design work on PILAPT began in 1994 and has continued with new tests and new
scoring algorithms over the nine years since. Beginning with ab initio selection for the Royal
Air Force (RAF) University Air Squadrons, PILAPT has been evaluated through data
provided by air forces in Chile, Denmark, Portugal, Sweden, Norway and Italy as well ascivilian airlines and training schools in the UK, Europe and Asia.
PILAPT is a fully automated test delivery system built on the TEKS technology
developed by Psytech Ltd. The system caters for all aspects of the testing process from
candidate log on including the capture of biographical data, instructions, test administration,
test scoring, analysis of candidate performance, reporting, and data transfer to other systems.
The system has crash recovery and networking capabilities.
The PILAPT battery of tests developed to date includes:
Hands (10 minutes) the ability to process oral (verbal) rules to execute a visual
task quickly and accurately related to absorbing and using oral (e.g. radio
information) under pressure
mailto:[email protected]:[email protected]:[email protected]:[email protected]8/3/2019 2004073P
2/4
Patterns (10 minutes) the ability to ignore distracting information in order to make
quick and accurate decisions under time pressure related to maintaining focus on
critical information when confronted with ambiguous situations and pressure
Concentration (8 minutes) - the ability to maintain focus on a primary task when the
conditions for that task are constantly changing related to maintaining situational
awareness
Deviation Indicator (7 minutes) the ability to compensate for deviations in flight
parameters with a look-and-feel based on the flight path deviation indicator (FPDI)
related to basic handling skills
Trax (5 minutes) a pursuit tracking task requiring the candidate to work in a 3
dimensional environment related to advanced aircraft control
In addition to the tests above, primarily driven in design by ab initio requirements and
taking around 40 minutes in total, the PILAPT battery has been extended to include a mini-
test battery named Capacity designed to assess performance under increasing workload.
Capacity takes around 15 minutes to complete and comprises a primary handling task and two
secondary tasks involving visual and auditory information. Tasks are administered and
measured under a combination of single, dual and triple task load conditions, and the impact
of increased workload on the candidates performance is then analysed and reported using a
display similar to that shown in Figure 1 below. The data shown shows average performance
for Swedish fighter pilot applicants.
DI4TRIPLDI4DUA LDI4SINGL
Mean
50 0
40 0
30 0
20 0
10 0
Capac i ty underCapac i ty under
single task loadsingle task load
Capac i ty underCapac i ty under
tr iple task loa dtr iple task load
How m uch capac i tyHow much capac i ty
does the candidatedoes the candidate
retain as workloadretain as w orkload
increases?increases?
S IN G LE D U A L T R IP L E
Figure 1: Overview of what the PILAPT Capacity mini-battery measures
Reliability and construct validity evidence supporting PILAPT
This section of the paper provides a summary of the data collected on the PILAPT
tests to date in military context. Given that different tests are at different stages in the
development cycle, the evidence provided varies across PILAPT tests reflecting the iterative
cycle of development since 1994. The evidence is presented in three parts in line with
8/3/2019 2004073P
3/4
recommendations from professional bodies such as the American Psychological Association
(APA), British Psychological Society (BPS) and the International Test Commission (ITC).
First, evidence of test reliability (associated with accuracy and stability of scores) is presented
and followed by results from studies involving other marker tests of pilot aptitude (construct
validity). Papers two (Calanna and Serusi) and three (Kokorian, Valsler and Cabrera) presentcriterion validity data.
Reliability
The standard recommendation for the level of reliability required for tests used in
selection is a minimum coefficient of 0.7 (this in effect states that 70% of the variation in test
scores is true variation as intended in the tests design). Table 1 summarises the results of
reliability (internal consistency) analyses across various country and organisational sites using
the Schmidt-Hunter meta-analysis model. DI and Trax are not included in Table 1 as internal
consistency estimates of reliability are not suitable for these tests. Data on their test-retest
reliability is given below. Table 1 contains two versions of Hands, a longer 40-item versionand a shorter 25-item version.
Source Local
Sample
Hands Patterns Concentration
Chile 370 0.89 0.6
Denmark 1,212 0.90 0.69
Italy 108 0.87 0.72 0.76
Norway 232 0.92 0.71
Portugal 1,218 0.93 0.73
Sweden 762 0.94 0.71 0.83 (N=430)UK 585 0.91
Total Sample Size (N) 4,487 3,902 538
Sample Weighted Mean 0.92 0.70 0.82
90% Credibility 0.89 0.66 0.79
Table 1: Reliability results for PILAPT tests across various national sites
In addition to these results, a test-retest (stability) study was conducted in 1995 for the
RAF UAS (N=109). This study had a four month interval between test administrations and
yielded reliabilities of 0.80 for DI, 0.84 for Trax and 0.77 for Hands, and an overall test-retest
reliability of 0.91 for the sum of these three PILAPT test scores. All these data clearly showPILAPT tests exceed the minimum requirement of 0.7 reliability for use in pilot selection. As
an overall composite score for use in selection decisions, the PILAPT battery offers a
reliability of 0.9 and above.
Construct validity
This section presents the results of a study conducted in Denmark involving four
PILAPT tests DI, Hands, Patterns and Trax and a 15-test battery used to assess both
aircrew and ATC aptitudes. Data were available across all 19 tests for a sample of 632
applicants. The content of the 15-test battery was classified according to test content in line
8/3/2019 2004073P
4/4
with the classifications used by Hunter and Burke in their meta-analysis. This classification
then provides a direct test of the extent to which PILAPT is measuring pilot relevant predictor
constructs. The results are shown in Table 2.
Test Group DI Hands Patterns Trax OverallMathematical Reasoning .12 .31 .37.06 0.44
Numerical Speed & Accuracy .11 .29 .25 .03 0.35
Language .18 .14 .20 .08 0.29
General Reasoning .18 .33 .51 .13 0.57
Spatial .24 .38 .38 .17 0.53
Mechanical .27 .35 .40 .29 0.55
Memory .05 .23 .13-.09 0.27
Notes:
Overall column gives the regression of the Test Group onto the 4 PILAPT tests
Correlations in bold and italicised are significant at the 0.01 level
Table 2: Results for 632 Danish military applicants
Hunter and Burke identified the following predictor constructs as being the most
consistent and substantial predictors of pilot training success: perceptual speed, mechanical
reasoning, spatial reasoning, psychomotor and simulation based tests. The Danish data set did
not contain psychomotor or simulation based tests, but the results clearly show that PILAPT
is tapping the other predictor constructs identified by Hunter and Burke as critical to
predicting pilot training success. Corrected for average reliabilities in the Danish tests (0.8),
the overall regressions (furthest right hand column in Table 2) would range from 0.34 to 0.71.
Conclusion
This paper has summarised the R&D objectives as well as reliability and construct
validity evidence supporting the PILAPT battery. The consistency of the reliability estimates
across different national sites with different selection processes and different applicant
populations suggests that scores are generalisable across settings. Evidence of criterion
validity and transportability of validity is presented in the second and third papers of this
symposium.
References
Burke, E., Kitching, A., and Valsler, C. (1994). Computer-based assessment and the construction of
valid aviator selection tests. In N. Johnston, R. Fuller, and N. McDonald (Eds.). Aviation
Psychology: Training and Selection. Cambridge: Avery.
Hunter, D. R., and Burke, E. (1994). Predicting aircraft pilot-training success: A meta-analysis of
published research. Journal of Aviation Psychology, 4, 297-313.
Hunter, D. R., and Burke, E. (1995). Handbook of Pilot Selection. Cambridge: Ashgate.
Hunter, J. E., and Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in
research findings. Newbury Park, CA: Sage