Post on 11-Jan-2016
Unproctored Internet Testing Issues and an Applied Example
Rod McCloyHumRRO
Presentation to Minnesota Professionals for Psychology at Work (MPPAW)
Solera Restaurant -- Minneapolis, MN
November 18, 2008
Overview
• UIT Defined
• Issues Concerning UIT– Advantages– Disadvantages
• An Applied Example
• Final Points
UIT Defined
• Not the same as “proctor-free testing” – Non-traditional/alternative proctoring might be
in effect• Video cameras• Verification testing• Analysis of response patterns
“Internet-based testing, completed by a candidate without a traditional human proctor”
- Tippins (in press)
Issues Concerning UIT
Advantages
• $aving$!!– No travel costs for candidates– No proctoring costs (hiring, training, travel)– No hardware costs (purchase, distribution,
maintenance)– Lower test maintenance costs
• Administration advantages– Consistency, accuracy– Detailed response information (e.g., item
timing, accurate test timing and scoring)
• Other advantages– Speed (no waiting for appointments)– Company image (cutting-edge technology)– Expanded applicant base
• Applicant tests at home on own time
– Proctors ain’t all they’re cracked up to be• Sometimes unskilled/untrained/unmotivated
Advantages (cont.)
Disadvantages
• Technology?– Not so much these days, but some issues still (especially with
clocks/timing)
• CHEATING!
• Verification testing?– Evidence of cheating suggestive only, not certain– Most depend on equivalent tests, adaptive testing, large item
pools with known item parameters– Sensitivity paramount with cheating notification
• Company image?– Might applicants question the image of a company that offers
such programs, given cheating potential is so obvious?
Disadvantages (cont.)
• Might not those prone to cheat . . .– . . . do better on the test and thus get hired . . . – . . . only to cheat on the job?
• Test environment– Should provide environment for optimal
performance– But on-demand testing often occurs in
environments full of distractions, leading to reduced test performance
Disadvantages (cont.)
• Ethics– Some believe UIT is unethical (Pearlman, in
press; Tippins, in press)• Cite Principle 9.09 of the Ethics Code (2002)
– Relation of given score to norms?
“Psychologists who offer assessment or scoring services to other professionals accurately describe the purpose, norms, validity, reliability, and applications of the procedures and any special qualifications applicable to their use.”
• Ethics– Some cite Principle 9.01 of the Ethics Code to support use of UIT
– In essence, they interpret this to say, “If you’ve documented/explained UIT’s effects, you’re golden!”
Disadvantages (cont.)
“…psychologists provide opinions of the psychological characteristics of individuals only after they have conducted an examination of the individuals adequate to support their statements or conclusions. When, despite reasonable efforts, such an examination is not practical, psychologists document the efforts they made and the result of those efforts, clarify the probable impact of their limited information on the reliability and validity of their opinions, and appropriately limit the nature and extent of their conclusions or recommendations.”
Disadvantages (cont.)
• What do the Standards (1999) and Principles (2003) say?
Surprisingly little
Standard 5.2: “Modifications or disruptions of standardized test administration procedures or scoring should be documented.”
Standard 5.7: “Test users have the responsibility of protecting the security of test materials at all times.”
“For security reasons, the identity of all candidates should be confirmed prior to administration. Administrators should monitor the administration to control possible disruptions, protect the security of test materials, and prevent collaborative efforts by candidates. The security provisions, like other aspects of the Principles, apply equally to computer and Internet-administered sessions.”
Disadvantages (cont.)
• International Testing Commission Guidelines on Computer-Based and Internet-Delivered Testing– More specific– Guideline 45.3 recommends verification
testing for unproctored assessments
“For moderate and high stakes assessment (e.g., job recruitment and selection), where individuals are permitted to take a test in controlled mode (i.e. at their convenience in non-secure locations), those obtaining qualifying scores should be required to take a supervised test to confirm their scores (p. 55).”
Disadvantages (cont.)
“It is paradoxical that the concerns raised over risks of cheating in UIT, despite the fact that cheating is also an issue in proctored assessment, have resulted in the development of technologies, policies and procedures that potentially make online UIT more secure that traditional paper-and-pencil proctored tests”
-- Bartram (in press)
Disadvantages per Pearlman
• Verification testing– Adds steps (time/cost)– Basically tells examinee, “You can’t trust UIT results”
• Validity– If cheating reduces it, reduced applicant pool contains
fewer people but not better ones
• Standardization– UIT violates the daylights out of this– Is it a test?
“a measure or procedure in which a sample of an examinee’s behavior in a specified domain is obtained, evaluated, and scored using a standardized process” (SIOP, 2003, p. 71, emphasis added)].
Test Compromise
-- Tippins et al. (2006)
“Any Internet test that administers the same set of items to all examinees is asking to be compromised. At the very least, the items should be administered in a randomized order. It would be better yet to sample items from a reasonably large item pool.”
Tippins’s Five Camps
• UIT Never Acceptable– ID of test taker, likelihood of cheating, validity
of inferences from unproctored score, ethics
• UIT OK for Some Tests/Purposes– Non-cognitive; personal development,
practice testing– Cheating can also occur in proctored setting
Tippins’s Five Camps (cont.)
• Prevent Cheating before It Happens– Ways to prevent it (warnings, threats of retesting,
honor statements), or to make UIT score equal to proctored scores
– Ways to authenticate examinee identity, monitor their behavior, or halt testing if cheating indications occur
• Detect Cheating via Verification Testing or Statistical Means– Multinomial logit, compromise IRT, IPD
Tippins’s Five Camps (cont.)
• Accept UIT without Extraordinary Measures to Prevent/Detect Cheating – Argue for its utility– “Cheaper to accept costs of hiring a few
cheaters than to spend the money to ensure accurate individual assessment”
• Camps not mutually exclusive!
An Applied Example
Client
• Procter & Gamble (P&G)• Research-driven company• More than 138,000 employees in 80 countries
Current P&G Selection System
• Step 1: Requirements-Based Prescreening– Questions
• Based on job requirements• Developed by the hiring manager and recruiter• Delivered online as part of initial job application
• Step 2: Online, Unproctored Measures– Biographical data assessment
• Measures KSAs for Management, Researcher, and Office Administration job families
– Adaptive cognitive ability screen
Current P&G Selection System ( cont.)
• Step 3: Proctored Measures– Paper-and-pencil cognitive ability measures– Three structured behavioral interviews conducted by
trained, calibrated interviewers
Project Goal
• Develop a new cognitive ability test– Given via Internet– Unproctored, on-
demand – Computer-adaptive– New content
Why an Internet-Based CAT?
• Convenience and security– Ever-increasing candidate volumes
• 500,000+ applicants per year
– Item exposure control
• Greater accuracy per unit time
• Updated models of cognitive ability
– (e.g., Carroll, 1993)
• Applicant reactions– Convenience of on-demand online administration
– Reduced assessment times
How Adaptive Question Administration Works
Low High
Low High
Low High
Low High
We then keep presenting questions in this way until we are certain of the candidate’s true level of cognitive ability.
They get a harder question if they get it right and an easier question if they get it wrong.
We then move them up or down the continuum based on whether they get the question right.
Start
question right
question wrong
Stop
; question right
CAT-ASVAB Engine
• A major drawing point– Developed by Irwin Hom and colleagues– Undergirds DoD’s accessioning system– Selects items, updates ability estimates, monitors exposure
• P&G was prepared to license it
• Required modification to support iCAT demands– Calculating information on demand– Priority given to precision vs. item exposure– Resulted in psychometric support contract with DDI
To UIT or Not to UIT?
• Sources for determining decision– Candidates– Other companies and consultancies– Scientific literature
Candidate Reactions
• Studies of Potential/Actual Candidates– Study 1: 1,000+ university students, 20+
countries• Did not know P&G would use the process
– Study 2: Reactions from several thousand candidates via the career site
– Study 3: Part of test calibration research with 150,000+ P&G candidates
Candidates: Results
• UIT > > P&P, provided . . .– UIT created greater flexibility in taking tests– Faster responses on their progress/status
Industry
• Extant practice for UIT
• Not whether to use UIT, but how to . . .– develop– validate– use– . . . it most effectively
Literature: Best Practices
• Use of multiple supervised assessments to verify scores on our UIT biodata and cognitive assessments
• Partnership with consultancy and academic leaders in assessment research, development, and practice
• Extensive/rigorous translation of all assessments (40+ for the cognitive assessments) to ensure construct measurement rather than English language ability
• Concurrent validation of the tests with several thousand employees worldwide under unsupervised conditions for our UIT tools and supervised conditions for our supervised tests
• Cut scores that allow proper candidate flow to reduce false negative/positive results
• Development of advanced features for UIT biographical data assessment, including
– randomized item delivery– cultural scoring algorithms that enable appropriate measurement of candidates’ fit with
KSAs
Literature: Best Practices (cont.)
• Development of advanced features to maximize the precision and security of the UIT cognitive ability assessment– Adaptive item delivery of a small subset of the total item pool– Extensive item pools globally developed/screened with more
than 150,000 candidates via IRT calibration and DIF assessment– Item-level, globally calibrated timing to make cheating more
difficult– Ongoing live-item research to understand and detect item
parameter drift as an indication of item compromise– Ongoing item development and calibration– Item exposure control algorithms– Parallel item pools that can be interchanged, making it difficult
for candidates to acquire test content
• Instructions to candidates that – all responses should be made without any help from others– online test results would be verified with additional assessments
under supervised conditions
• Development of an assessment portal system to manage access, consistent messaging, and delivery of multiple UITs to all candidates globally
• Development of enhanced web-based delivery systems that minimize the ability of candidates to game the assessments for more time or capture the test questions
Literature: Best Practices (cont.)
Conclusion
• This set of activities/requirements demonstrates the rigor, resources, and cost required to effectively research, develop, and validate noncognitive and cognitive UIT measures
• Reproduction of this effort would require – Access to tens of thousands of respondents (both candidates
and employees) – A multi-million dollar budget– Extensive I-O psychology and technology resources for
specifying, designing, programming, and maintaining the online systems
• 4 years from design to implementation
Benefits to Candidates• On-Demand, 24/7 Access to Assessments
– Candidates manage the testing environment with administration timing that works for them
• Faster Status Updates– Candidates want to know if they are moving forward in the application
process– Primary request of P&G in the candidates’ reactions research
• Standardized Assessment Delivery– No issues related to human proctoring of assessments– Ensures all candidates receive the same test instruction, preparation
• Source-Independent Application Consideration– Source = university campus, career conference, etc.– Not true for paper-based system (access to candidates primarily based
on source
• Language Choice– Not managed by recruiter who may default to English
Benefits to P&G• Consistent Candidate Management Across Geographies,
Business Units – Increasingly important requirement as applications have grown from
25,000 in 1999 to 500,000+ in 2007
• Casting of Wider Talent Net– Permitting outreach to more diverse candidate pool
• Management of Multi-Hurdle, Multi-Assessment Selection System– Provides more holistic, complete assessment of candidates’ fit with
KSAs
• Controlled Messaging– Deliver messaging to candidates (and consumers) in a way that builds
overall Company equity
• Assessment Efficiency– Reduced screening times and costs for up-front assessments– Real-Time score results delivered to recruiters
Benefits to P&G• Role Enrichment
– For recruiters and employees who otherwise would have spent time delivering tests
• Improved Cognitive Test Item Security
• Consideration of Candidates’ Cultural Backgrounds– Used when scoring noncognitive assessments
Payoff
• No formal utility analysis yet
• Have seen significant cost reductions, improved process efficiencies
• Example: P&G will deliver approx. 10,000 fewer supervised p-&-p cognitive tests in Japan alone this year– New process isolates top talent faster with less cost and fewer
resources (internal, external)
• Favorably reviewed in audits by government agencies– Have stood up to legal challenges
• Bottom Line: In an era of reduced availability of talent and greater competition for it, we believe this system offers a defensible competitive advantage for P&G.
Participating Parties
Procter & Gamble • Robert E. Gibby• Andrew M. Biga• Angela K. Pratt• Jennifer L. Irwin
Michigan State University • Anthony S. Boyce
HumRRO• Rod McCloy• Suzanne Tsacoumis• Dan J. Putka• Irwin Hom• Andrea Sinclair• Matt Trippe
University of Minnesota • Stephan Dilchert• Deniz S. Ones
Logos Corporation• Magda Colberg
DDI• Kirk Fischer• Evan Sinar
Bowling Green State University• Alison Broadfoot
Project Activities
• Analyze data from extant P&G tests• Identify content domains for tests• Develop test content
– NR – drawn from current tests, WestEd– FR – new test; Ones/Dilchert; review– LBR – new test; Colberg/HumRRO
• Identify scoring/stopping/exposure rules• Conduct calibration study (numerous 9-item forms)• Establish item timings• Construct item pools• Support parallel DDI platform effort (Irwin)• Conduct QA of testing platform• Contribute enhanced functionality (DIF, IPD)• Develop criterion scales for validation study
Test Content
A mixture is created by combining chemicals A, B, and C in the proportion 3:1:2. In combining the chemicals, 2 additional gallons of chemical C were accidentally added to 15 gallons of chemical A and 5 gallons of chemical B. How many additional gallons of chemicals A and B must be added to accurately produce the mixture?
3 gallons of A and 1 gallon of B 2 gallons of A and 1 gallon of B 3 gallons of A and 2 gallons of B 5 gallons of A and 3 gallons of B 4 gallons of A and 2 gallons of B
The annual sales target in millions of units sold is 2.500 for Product B. The sales target for Product A is 90% that for Product B. This year, the two Products together exceeded their combined sales targets by 30%. What were the combined sales of Products A and B this year in millions of units?
4.275 5.175 5.700 5.850 6.175
Numerical ReasoningSample Questions:
• Analyze and solve
business-related
problems involving
numerical information
and data
• Strongest historical
P&G subtest
• Displayed best IRT
properties among
extant content
Intermediaries play a critical role in the distribution of a product. Intermediaries are classified as either wholesalers or retailers. Wholesalers sell a product to another business, which in turn resells the product to the final consumer. Retailers, on the other hand, sell a product directly to the final consumer. A company’s decision to use wholesalers or retailers depends on a variety of factors, including cost, target markets, and the nature of the product.
From the information given above, it can be validly concluded that:
There are at least some intermediaries that are neither wholesalers nor retailers.
A wholesaler does not sell a product directly to the final consumer.
Products directly sold to the final consumer are not sold via retailers.
If an organization is not an intermediary, then it is not a wholesaler.
• Using information known to be true to solve problems by logically deducing the valid conclusions
• Directly measures a candidate’s ability to quickly synthesize information
• Top-down process
Deductive Thinking Style – Logic-Based Reasoning
A B C D E
• Solving problems by determining what information is known when presented novel problems
• Assesses capacity to:• think outside of the box• uncover relations• apply learning of new
relations to solve the novel problem at hand
• Culture-fair test content to drive global diversity
• Bottom-up process
• Serves as Reasoning Screen
Inductive Thinking Style – Figural Reasoning
As of Today . . . .
Menu of choices with more efficiency built in
By July 2007 By July 2009
Paper & Pencil Reasoning Test
(supervised)
Computer-based Reasoning Test
(supervised)
- OR -
+
Online Cognitive Screen
(unsupervised)
Online Cognitive Screen
(unsupervised)
+
By July 2008
Paper & Pencil Reasoning Test
(supervised)
+
Online Cognitive Screen
(unsupervised)
Paper & Pencil Reasoning Test
(supervised)
Final Points
• Pandora’s out and dancing
• Proper use could save much money
• Proper development will cost much money
• Savings– Remember that utility a function of validity as
well as cost
• User buy-in important component
Final Points (cont.)
• Multiple-hurdle systems likely more hospitable environs for cognitive UITs
• Yet to be legal challenge– No precedent– Updated Guidelines, Standards, and Principles need
to address UIT – Arm yourselves for bear
• Many potential issues (test conditions, verification decisions)
• Documentation critical• Not for those who want something “quick
and dirty”
Thank you!!
P&G’s History of Assessing Cognitive Ability
Test Years In Use
Test Content
Synonyms Knowledge AnalogiesReading
ComprehensionData
InterpretationNumerical
Ability
Form M (1) ‘58 to 12/62
Form M (2) 1/63 to 9/65
Form M (3) 10/65 to 10/67
Form M (4) 11/67 to 1975
Form M (5) 1975 to 9/79
Form M (6) V1 – 6/79 to 1/80V2 – 2/80 to 12/83V3 - 1/84 to 9/89
Form M (7) 9/89 to 9/91
PST-91 10/91 to ‘98
A&T PST ‘95 to ‘98
PST-98 ‘98 to present
Analysis of Current Test Content
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2.6 -2.2 -1.8 -1.4 -1 -0.6 -0.2 0.2 0.6 1 1.4 1.8 2.2 2.6 3
Theta
P(1
|θ) Item 1
Item 2
Item 3
• Reading, Data Interpretation– Multiple items per stimulus– Violates local independence assumption
• Mathematics– Some items strong, but few with high b parameters
SolutionRetain Mathematics items
Write additional “high-b” content
Develop two new reasoning tests
Calibration Study
Calibration Study
• Item Calibration
• Item Pool Refinement– Differential Item Functioning
• Cross-culturally
• Race
• Gender
• Applicant Reactions
Calibration Study (continued)• Participant Recruitment
1. Open invitation from pg.com career site
2. Invitation after completion of online Biodata assessment
3. Email invitation to “in-process” candidates
• Measures– Demographics
– Applicant Reactions: Familiarity, Open-Ended Statements– Cognitive Items
• English only – Instructions in native language• 286 total 9-item forms – 1,300 items• Forms contain only a single type of item
• Sample Targets– ~750 responses per form
– ~1500 responses per item
– ~215,000 total respondents – Obtained ~140,000
Calibration Study: Sample
The research sample was to comprise approximately 215,500 respondents
This number was based on:• the number of forms (286) to be delivered, and• a requirement of 750 responses per form
– Yields ~1,500 responses to any one question
Subtest # Questions # FormsQuestions
per form # RespondentsResponsesper Form
Responses per Question
Inductive Reasoning 497 110 9 83,000 755 1,510
Deductive Reasoning 299 66 9 49,500 750 1,500
Numerical Reasoning 497 110 9 83,000 755 1,510
215,500
total respondents
Calibration Study: Sample (continued)
• This proposed sampling plan represents the first time that P&G has created a cognitive ability test using a global sample
– Prior cognitive ability tests dating back to the 1950’s (the various PSTs and Form Ms) used exclusively US samples and then were transported globally
– The assumption was that these tests functioned the same way independent of location
• With the new test, a global sample will be more important than ever because:
– The performance of questions is statistically modeled– Therefore, obtaining responses globally is the only way to ensure that the new iCAT will
function properly for local candidates
• The global sample outlined above will be structured into the following four “regions”:
• Asia – including AAI, NEA, and GC• Europe – including CEEMEA and WE• Latin America• North America
Previous Cognitive Research
Research for iCAT
Each box represents 3 items
Overlap of 3 items between 2 forms depicted by aligned white boxes
Form Construction and Item Chaining
Form Rationale• Reduced load on any
single participant = Increased participation
Chaining Rationale• Calibrate all items of the
same type along a common ability scale
Middle items (blue boxes) chained similarly in additional forms
Calculate item-total corrs., log-odds plots
Field-Test (FT) Items (Applicant Site)
Sift items into good/badBAD
GOOD
DIF/Rescale
Administer Items (Live Pilot)
Estimate new item parameters, Θs Seed one additional FT item as research proxy
Use Θs for validation study
Conduct DIF/IPD
BAD
Conduct DIF
Calibrate items within groups
Equate groups’ TCCs
Compare ICCs item-by-item
Identify items with DIF (d2)
New items
with DIF?
Yes
No
Remove those items with high DIF
High-DIF items held out of each subsequent equating iteration
APPLICANT REACTIONS
Applicant Reactions
• Trend towards increasingly sophisticated and distal pre-contact selection processes
• Linked to intentions to – accept job offers
– recommend the organization to others
– file legal complaints (Hausknecht, Day, & Thomas, 2004)
• Explored (quantitatively and qualitatively) applicant reactions to iCAT content
Figural Reasoning
Overall trends• Applicants recognize figural reasoning as an IQ test• Seen as difficult, but still received many positive reactions
F U N!!
Logic-Based Reasoning
• Overall trends– Applicants commented on features of the test, showing
a level of understanding (e.g. use of double negatives)– Applicants compared this test content to several
standardized tests
Numerical Reasoning
• Overall trends– Applicants discussed the need for calculators and paper – Specific categories were reported (algebra, fractions) – Fractions were least-liked
Applicant Reactions: Conclusions
• Figural reasoning – Received most positive feedback– Was seen as both difficult and fun– Least familiar content as reported by applicants
• Numerical reasoning – Only content type with strong positive correlation with familiarity
• Overall, applicants had positive reactions to iCAT content
A mixture is created by combining chemicals A, B, and C in the proportion 3:1:2. In combining the chemicals, 2 additional gallons of chemical C were accidentally added to 15 gallons of chemical A and 5 gallons of chemical B. How many additional gallons of chemicals A and B must be added to accurately produce the mixture?
O 3 gallons of A and 1 gallon of B
O 2 gallons of A and 1 gallon of B
O 3 gallons of A and 2 gallons of B
O 5 gallons of A and 3 gallons of B
O 4 gallons of A and 2 gallons of B
Machine X produces 214 items per hour running at 33% of its full capacity. Machine Y produces 378 items per hour running at 42% of its full capacity. If both machines were running at 80% of their full capacity, what would be the minimum number of hours needed to produce 14,250 units?
O 11 hours
O 12 hours
O 13 hours
O 14 hours
O 15 hours
The annual sales target in millions of units sold is 2.500 for Product B. The sales target for Product A is 90% that for Product B. This year, the two Products together exceeded their combined sales targets by 30%. What were the combined sales of Products A and B this year in millions of units?
О 4.275
О 5.175
О 5.700
О 5.850
О 6.175
• Analyze and solve
business-related
problems involving
numerical information
and data
• Strongest historical
P&G subtest
Analytical Thinking Style - Numerical Reasoning
• Based on the statistical need to obtain a minimum of 1500 responses per ~1300 research questions, the following research structure was constructed
A total of 286 forms (9 items per form) delivered research questions to respondents
• 110 forms each delivered Numerical Reasoning and Figural Reasoning items (497*2 = 994 questions researched)
– Expected 50%+ survival rate of research questions
• 66 forms delivered Logic-Based Reasoning items (299 questions researched)– Expected 80%+ survival rate of research questions
• Final live structure
• Numerical Reasoning: 240 items in three item pools • Logic-Based Reasoning: 240 items in three item pools• Figural Reasoning: 240 items in three item pools
The Research: Structure