ECOLT 2006 Slide 1 October 13, 2006 Prospectus for the PADI design framework in language testing...

ECOLT 2006 Slide 1October 13, 2006

Prospectus for the PADI design framework in language testing

ECOLT 2006, October 13, 2006, Washington, D.C.

PADI is supported by the National Science Foundation under grant REC-0129331. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Robert J. MislevyProfessor of Measurement & Statistics

University of Maryland

Geneva D. HaertelAssessment Research Area Director

SRI International


Some Challenges in Language Testing

Sorting out evidence about interacting aspects of knowledge & proficiency in complex performances

Understanding the impact of “complexity factors” and “difficulty factors” on inference

Scaling up efficiently to high volume tests—task creation, scoring, delivery

Creating valid & cost-effective low volume tests


Evidence-Centered Design

Evidence-centered assessment design (ECD) provides language, concepts, knowledge representations, data structures, and supporting tools to help design and deliver educational assessments,

all organized around the evidentiary argument an assessment is meant to embody.


The Assessment Argument

What kinds of claims do we want to make about students?

What behaviors or performances can provide us with evidence for those claims?

What tasks or situations should elicit those behaviors?

Generalizing from Messick (1994)


Evidence-Centered Design

With Linda Steinberg & Russell Almond at ETS » The Portal project / TOEFL» NetPASS with Cisco (computer network design &

troubleshooting)

Principled Assessment Design for Inquiry (PADI)» Supported by NSF (co-PI: Geneva Haertel, SRI)» Focus on science inquiry—e.g., investigations» Models, tools, examples


Cognitive design for generating tasks (Embretson)

Model-based assessment (Baker)

Analyses of task characteristics—test and TLU (Bachman & Palmer)

Test specifications (Davidson & Lynch)

Constructing measures (Wilson)

Understanding by design (Wiggins)

Integrated Test Design, Development, and Delivery (Luecht)

Some allied work

From Mislevy & Riconscente, in press

Assessment DeliveryAssessment DeliveryHow do students and tasks actually interact? How do we report examinee performance?

How do students and tasks actually interact? How do we report examinee performance?

Assessment Implementation


Conceptual Assessment Framework


Domain ModelingDomain Modeling

Domain AnalysisDomain Analysis What is important about this domain?What work and situations are central in this domain?What KRs are central to this domain?

What is important about this domain?What work and situations are central in this domain?What KRs are central to this domain?

How do we represent key aspects of the domain in terms of assessment argument.


Design structures: Student, evidence, and task models


How do we choose and present tasks, and gather and analyze responses?


Layers in the assessment enterpriseLayers in the assessment enterprise

Key ideas:Explicit relationshipsExplicit structuresGenerativity Re-usabilityRecombinabilityInteroperability

Key ideas:Explicit relationshipsExplicit structuresGenerativity Re-usabilityRecombinabilityInteroperability

















Expertise research, task analysis, curriculum, target use, critical incident

analysis, ethnographic studies, etc.

Expertise research, task analysis, curriculum, target use, critical incident

analysis, ethnographic studies, etc.

In language assessment, importance of…•Psycholinguistics•Sociolinguistics•Target language use

In language assessment, importance of…•Psycholinguistics•Sociolinguistics•Target language use

















Tangible stuffTangible stuff

e.g., what gets made and how it operates in testing situation

e.g., what gets made and how it operates in testing situation

















How do you get from here to here?How do you get from here to here?
















How do we choose and present tasks, and gather and analyze responses?We will focus today on two

“hidden” layers:We will focus today on two

“hidden” layers:
















How do we choose and present tasks, and gather and analyze responses?We will focus today on two

“hidden” layers:We will focus today on two

“hidden” layers:

Domain modeling, which concerns the Assessment

Argument

Domain modeling, which concerns the Assessment

Argument

















And the Conceptual Assessment Framework, which concerns generative

& re-combinable design schemas

And the Conceptual Assessment Framework, which concerns generative

& re-combinable design schemas

















More on theAssessment Argument

More on theAssessment Argument


PADI Design Patterns

Organized around elements of assessment argument Narrative structures for assessing pervasive kinds of

knowledge / skill / capabilities Based on research & experience , e.g.

» PADI: Design under constraint, inquiry cycles, representations» Compliance w. Grice’s maxims; cause/effect reasoning; giving

spoken directions

Suggest design choices that apply to different contexts, levels, purposes, formats » Capture experience in structured form» Organized in terms of assessment argument


A Design Pattern Motivated by Grice’s Relation Maxims

Attribute Value(s)

Name Grice’s Relation Maxim—Responding to a Request

Summary In this design pattern, an examinee will demonstrate following Grice’s Relation Maxim in a given language, by producing or selecting a response in a situation that presents a request for information (e.g., conversation).

Central claims In contexts/situations with xxx characteristics, can formulate and respond to representations of implicature from referents .

semantic implication pragmatic implication

Additional knowledge that may be at issue

Substantive knowledge in domain; Familiarity with cultural models; Knowledge of language


Characteristic features

The stimulus situation needs to present a request for relevant information to the examinee, either explicitly or implicitly.

Variable task features

Production or choice as response?If production, oral or written production required?If oral, single response to a preconfigured situation or part of an evolving conversation?If evolving conversation, open or structured interview? Formality of prepackaged products (multiple choice, video taped conversations, written questions or conversations, one to one or more conversations which are prepared by interviewers)Formality of information and task (concrete or abstract, immediate or remote, information requiring retrieval or transformation, familiar or unfamiliar setting and topic, written or spoken)If prepackaged speech stimulus: length, content, difficulty of language, explicitness of request, degree of cultural dependence.Content of situation (familiar or unfamiliar, degree of difficulty)Time pressure (e.g., time for planning and response)Opportunity for control the conversation

Grice’s Relation Maxims


Potential performances and work products

Constructed oral response

Constructed written or typed-in response

Answer to a multiple-choice question where alternatives vary

Potential features of performance to evaluate

Whether a student can formulate representations of implicature, as they are required in the given situation.

Whether a student can make a conversational contribution or express the idea towards the accepted direction.

Whether a student provides the relevant information as is required.

Whether quality of choice among alternatives offered for a production in a given situation satisfies the Relation Maxim.

Potential rubrics (later slide)

Examples (in paper)

Grice’s Relation Maxims


Some Relationships between Design Patterns and Other TD Tools

Conceptual models for proficiency &Task characteristic frameworks » Grist for design choices about KSAs & task

features» DPs present integrated design space

Test specifications» DPs for generating argument, design choices» Test specs for documenting, specifying choices

















More on the Conceptual Assessment FrameworkMore on the Conceptual Assessment Framework


Evidence-centered assessment design

The three basic models

Evidence Model(s) Task Model(s)

1. xxxxxxxx 2. xxxxxxxx3. xxxxxxxx 4. xxxxxxxx5. xxxxxxxx 6. xxxxxxxx

Student Model Stat model Evidence

rules

Technical specs that embody the elements suggested in the

design pattern

Technical specs that embody the elements suggested in the

design pattern



The three basic models




rules

Conceptual RepresentationConceptual Representation


Screen shot of user interface

User-Interface RepresentationUser-Interface Representation


High-level UML Representation of the PADI Object Model

UML Representation(sharable data structures,

“behind the screen”)

UML Representation(sharable data structures,

“behind the screen”)


What complex of knowledge, skills, or other attributes should be assessed?




rules



The NetPass Student Model

Can use same student model with different tasks.

Can use same student model with different tasks.

Multidimensional measurement model with selected aspects of

proficiency

Multidimensional measurement model with selected aspects of

proficiency


What behaviors or performances should reveal those constructs?





rules





rules



From unique student work product to evaluations of observable variables—

i.e., task-level “scoring”

From unique student work product to evaluations of observable variables—

i.e., task-level “scoring”


4 Responses and explanations are relevant as required for current purposes of the exchange and neither more elaborated than appropriate or insufficient for the context. They fulfill the demands of the task with at most minor lapses in completeness. They are appropriate for the task and exhibit coherent discourse.

3 Responses and explanations address the task appropriately and are relevant as required for current purposes of the exchange, but they may either more elaborated than required or fall short of being fully developed.

2 The responses and explanations are connected to the task, but are either markedly excessive in information supplied or not very relevant to the current purpose of the exchange. Some relevant information might be missing or inaccurately cast.

1 The responses and explanations are either grossly relevant or are very limited in content or coherence. In either case they may be only minimally connected to the task.

0 Speaker makes no attempt to respond or response is unrelated to the topic. A writing response at this level merely copies sentences from the topic, rejects the topic or is otherwise not connected to the topic. A spoken response is not connected to the direct or implied request for information.

Skeletal Rubric for Satisfaction of Quality Maxims


Re-usable (tailorable) to different tasks & projects

Can be multiple aspects of performance being rated.

May be 1-1 relationship with Student model Variables, but need not be.

That is, there can be multiple aspects of proficiency that are involved in probability of high / satisfactory/ certain style of response

Notes re Observable Variables





rules



Values of observable variables used to update probability distributions for

student-model variables via psychometric model—i.e., test-level scoring.

Values of observable variables used to update probability distributions for

student-model variables via psychometric model—i.e., test-level scoring.


An NetPass Evidence-Model Fragment for Design

Re-usable conditional-probability fragments and variable names for different tasks with

the same evidentiary structure.

Re-usable conditional-probability fragments and variable names for different tasks with

the same evidentiary structure.

Measurement models indicate which SMVs, in which combinations, affect which

observables. Task features influence which ones and how much, in structured

measurement models.

Measurement models indicate which SMVs, in which combinations, affect which

observables. Task features influence which ones and how much, in structured

measurement models.


What tasks or situations should elicit those behaviors?





rules


Representations to the student, and sources of variation


Task Specification Template -Determining Key Features (Wizards)

Setting CorporationConference

CenterUniversity

Building Length Less than 100mMore than 100m

Ethernet Standard 10BaseT100BaseT

Subgroup Name TeacherStudentCustomer

Bandwidth for a Subgroup Drop 10Mbps100Mbps

Growth Requirements GivenNA


Structured Measurement Models Examples of models

»Multivariate Random Coefficients Multinomial Logit Model (MRCMLM; Adams, Wilson, & Wang, 1997)»Bayes nets (Mislevy, 1996)»General Diagnostic Model (von Davier & Yamamoto)

By relating task characteristics to difficulty with respect to different aspects of proficiency, create tasks with known properties.

Can create families of tasks around same evidentiary frameworks; e.g., For “read & write” tasks, can vary characteristics of texts, directives, audience, purpose.


Structured Measurement Models Articulated connection between task

characteristics and models of proficiency Moves beyond “modeling difficulty”

»Traditional test theory a bottleneck in multivariate environment

Dealing with “complexity factors” and “difficulty factors” (Robinson)

»Model complexity factors as covariates for difficulty parameters wrt those aspects of proficiency they impact»Model difficulty factors as either SMVs, if target of inference, or as noise, if nuisance.


Advantages: A framework that…

Guides task and test construction (Wizards) Provides high efficiency and scalability By relating task characteristics to difficulty,

allows creating tasks with targeted properties

Promotes re-use of conceptual structures (DPs, arguments) in different projects

Promotes re-use of machinery in different projects


Evidence of effectiveness

Cisco»Certification & training assessment»Simulation-based assessment tasks

IMS/QTI»Conceptual model for standards for data structures for computer-based testing

ETS»TOEFL»NBPTS


Conclusion

Isn’t this just a bunch of new words for describing what we already do?


An answer (Part 1)

No.


An answer (Part 2)

An explicit, general framework makes similarities and implicit principles explicit:» To better understand current assessments…» To design for new kinds of assessment…

– Tasks that tap multiple aspects of proficiency– Technology-based tasks (e.g., simulations)– Complex observations, student models, evaluation

» To foster re-use, sharing, & modularity– Concepts & arguments– Pieces of machinery & processes (QTI)


For more information…

www.education.umd.edu/EDMS/mislevy/

Has links to PADI, Cisco, articles, etc.

(e.g., CRESST report on Task-Based Language Assessment.)

ECOLT 2006 Slide 1 October 13, 2006 Prospectus for the PADI design framework in language testing...

Documents

Transcript of ECOLT 2006 Slide 1 October 13, 2006 Prospectus for the PADI design framework in language testing...