Questions and Answers

25
Lessons from generating, scoring, and analyzing questions in a Reading Tutor for children Jack Mostow Project LISTEN (www.cs.cmu.edu /~ listen ) AAAI Symposium on Question Generation keynote, Nov. 5, 2011, Arlington, VA Questions and Answers The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A080628. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or the U.S. Department of Education. abou t Questions and Answers

description

Q uestions and Answers. a bout. Questions and Answers. Lessons from generating, scoring, and analyzing questions in a Reading Tutor for children Jack Mostow Project LISTEN ( www.cs.cmu.edu /~ listen ) AAAI Symposium on Question Generation keynote, Nov. 5, 2011, Arlington, VA. - PowerPoint PPT Presentation

Transcript of Questions and Answers

Mining data from Project LISTEN's Reading Tutor to analyze development of children's oral reading prosody*

Lessons from generating, scoring, and analyzing questions in a Reading Tutor for children

Jack MostowProject LISTEN (www.cs.cmu.edu/~listen)

AAAI Symposium on Question Generation keynote, Nov. 5, 2011, Arlington, VAQuestions and Answers

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A080628. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or the U.S. Department of Education. aboutQuestions and AnswersThis talk will review attempts over the years to automatically generate and administer questions in Project LISTENs Reading Tutor, to log, score, and analyze childrens responses, and to evaluate such questions. How can a tutor use questions? How can it generate them? How can it score childrens responses? What can they tell us? How can we evaluate questions? Which questions succeeded or failed?

1Questions about QuestionsTarget: What does it take to answer the questions?Purpose: Why ask the questions?Question type: In what form will the questions be output?Answer type: In what form will responses be input?Generation: How construct questions, answers, distracters?Modality: What channels will convey questions and answers?Assessment: How score answers? How generate feedback?Evaluation: How to tell how well questions serve purpose?11/5/112Jack Mostow keynoteSource: what is question about? from text/database/?2

What can questions target?ReadingDecoding: In Word Swap, click on the misread word [Zhang ITS 08]Comprehension: Click on the missing word. [Mostow TICL 04]Inter-sentential prediction: Which will come next? [Beck ITS 04]Monitor: Did that make sense?Self-question: Why was the country mouse surprised? [Chen AIED 09]Disengagement: hasty guessing [Beck AIED 05]VocabularyRecall: Which means most like ? [Aist 01]Recognize: Which word means ? [TICL 04]Remind: Definition cloze [Gates QG 11]Disambiguate: What does mean here?KnowledgeFact: Noiz arte du nahi duenak iritzia emateko aukera? [Aldabe QG 11]Skill: How many grams can a worker ant carry? [Williams QG 11]Concept:

The lid sat in the loft.

was

Jack Mostow keynote11/5/113Speech-print correspondence: Which word was misread? [WordSwap]

UNTIL WHEN do those who want to do so have the opportunity to express their views?

3How has Project LISTEN used questions?Assess comprehension [Mostow et al., TICL 04]Help comprehension [Beck et al., ITS 04]Assess engagement [Beck, AIED 05]Teach self-questioning [Mostow & Chen, AIED 09]Model self-questioning [Chen et al., ITS 10]Assess self-questioning [Chen et al., QG 11]Help vocabulary learning [Gates et al., QG 11]Jack Mostow keynote11/5/1141. Assess comprehension [TICL 04]Target: comprehend sentencePurpose: assess comprehension while readingSource: sentence in textQuestion type: clozeAnswer type: multiple choice among 4 words from textGeneration: randomly pick sentence, word, and distractersModality: play recorded sentence and words; click on oneAssessment: original word? immediate correctness feedbackEvaluation: correlate against standard comprehension testJack Mostow keynote11/5/115Examples:

Assess comprehension: M/C cloze (TICL)\listen\Documentation\Papers\ITS2002\ITS talks\ITS2002-cloze-assessment-with-animation.ppt

Assess vocab: Aist; matching (TICL)

Assist comprehension2004 vocab WSDGeneric Wh 3Wh-Sentence ordering

Remind taught item (Gates QG2011)

Assess engagement (Beck)

Model [in both senses] self-questioning (Chen)

5

Student starts reading a storyJack Mostow keynote11/5/116Jack Mostow keynoteNow and then during the story

11/5/117insert a cloze questionReading Tutor reads question and choices aloudTarget and distracters are words from same story

Jack Mostow keynote11/5/1188Jack Mostow keynote just before a sentence.

11/5/119What do cloze questions test?

Jack Mostow keynote11/5/1110Jack Mostow keynoteOct.01 Mar.02 evaluation dataReading Tutor asked 69,326 automated cloze questions364 students in grades 1-9 at 7 schools20-758 questions per student (median 136)98% of questions answered else Goodbye, Back, or timeout24-88% correct (median 61%); 25% = chanceHow much guessing? Hasty responses [J. Valeri]:3,078 (4.5%) faster than 3 seconds, only 29% correct3.9% per-student mean, but below 1% for most studentsGuessing rose from 1% in October to 11% by March11/5/1111Jack Mostow keynote

Reliability: Guttman split-half testSplit N responses into two halvesMatch by word and story levelReliability increases with N.83 for N10 (338 students).95 for N80 (199 students)Grade = 1..911/5/1112Jack Mostow keynoteWhat affects cloze difficulty?Similarity of distracters to answerPart of speech [Hensler & Beck, ITS 06]Semantic classConsistency with local contextConsistency with inter-sentential contextVocabulary level of answer and distractersSight words = 225 most frequent words of English [Dolch list]Easy words = 2263,000Hard words = 3,00125,000Defined words = marked as warranting explanationText level of storyGrade K, 1, 2, 3, 4, 5, 6, 7Cloze performance at 4 word levels x 8 text levels predicts Woodcock Reading Mastery Test comprehension (R = .84)11/5/11132a. Help comprehension [ITS 04]Target: comprehend text by questioningPurpose: scaffold comprehension while readingSource: noneQuestion type: Wh-Answer type: multiple choiceGeneration: scripted generic questions and choicesModality: play recorded prompt and words; click on oneAssessment: noneEvaluation: test efficacy on ensuing cloze performanceJack Mostow keynote11/5/1114Examples:

Assess comprehension: M/C cloze (TICL)\listen\Documentation\Papers\ITS2002\ITS talks\ITS2002-cloze-assessment-with-animation.ppt

Assess vocab: Aist; matching (TICL)

Assist comprehension2004 vocab WSDGeneric Wh 3Wh-Sentence ordering

Remind taught item (Gates QG2011)

Assess engagement (Beck)

Model [in both senses] self-questioning (Chen)

14Generic Wh- questions: initialGeneric prompt (meta-question): Click on a question you can answer, or click Back to reread the sentenceGeneric 1-word questions as choices:Who? What? When? Where? Why? How? So? Evaluation: failed 2002 user testMeta-question confusing1-word questions too vague to map to textJack Mostow keynote11/5/1115Generic Wh- questions: revisedRandomly pick a generic promptWhat has happened so far? When does this take place? List choices scripted for the promptfacts were given; a problem is being solved; in the present; in the future; in the past; It could happen in the past; I cant tell

Jack Mostow keynote11/5/11162b. Help comprehension [ITS 04]Target: comprehend text by predictingPurpose: scaffold comprehension while readingSource: noneQuestion type: Which will come next?Answer type: multiple choiceGeneration: answer = next sentence; distracters = 2 followingModality: play recorded prompt and sentences; click on oneAssessment: original sentence? immediate correctness feedbackEvaluation: 41% right; test efficacy on ensuing cloze performanceJack Mostow keynote11/5/1117Examples:

Assess comprehension: M/C cloze (TICL)\listen\Documentation\Papers\ITS2002\ITS talks\ITS2002-cloze-assessment-with-animation.ppt

Assess vocab: Aist; matching (TICL)

Assist comprehension2004 vocab WSDGeneric Wh 3Wh-Sentence ordering

Remind taught item (Gates QG2011)

Assess engagement (Beck)

Model [in both senses] self-questioning (Chen)

17Sentence prediction

Jack Mostow keynote11/5/11182003 evaluation dataReading Tutor asked 23,372 randomly inserted questions252 students in grades 1-4+? who used for an hour or more6,720 generic what, when, where1,865 Which will come next? sentence prediction15,187 clozeOn average, one question every 10 sentencesJack Mostow keynote11/5/1119Evaluation resultsVariableHelps/Hurtsp# 3W questions0.023# sentence predictions0.072# cloze questions=# recent 3W questions0.074# recent cloze questions=Time since prior question (sec)0.036Time since start of story (sec)0.14Jack Mostow keynote11/5/1120Effect of recent questions

Response time < 3 sec indicates disengagementUse to track (dis-)engagement [Beck AIED 05]Time since previous question (sec)Proportion of hasty responsesJack Mostow keynote11/5/1121Effect of question type

Number of prior questionsCloze performanceJack Mostow keynote11/5/1122Questions about QuestionsTarget: What does it take to answer the questions?Purpose: Why ask the questions?Question type: cloze? wh-/how/so/? find/compare/?Answer type: multiple choice? fill-in? open-ended?Generation: How construct questions, answers, distracters?Modality: menu? click? keyboard? speech? graphics? Assessment: How score answers? How generate feedback?Evaluation: How to tell how well questions serve purpose?GenerationModalityAssessment

GenerationQuestionAnswerDistracterModalityOutputInputAssessmentScoringFeedbackNoneTrivialSimpleComplexManualGenerationQuestionGenericClozeNLPScriptedAnswerNoneGiven= textNLPScriptedDistracterNoneGenericFrom textNLPScriptedModalityOutputMenuTextTTS, visualPrerecordedInputNoneClickKeyboardASR, gestureTranscribedAssessmentScoringNoneM/C= answerNLPManualFeedbackNoneGenericanswerNLPScriptedQG costs (lowest highest)Jack Mostow keynote11/5/1123Source: what is question about? from text/database/?23QG costs (lowest highest)NoneTrivialSimpleComplexManualGenerationQuestionGenericClozeNLPScriptedAnswerNoneGiven= textNLPScriptedDistracterNoneGenericFrom textNLPScriptedModalityOutputMenuTextTTS, visualPrerecordedInputNoneClickKeyboardASR, gestureTranscribedAssessmentScoringNoneM/C= answerNLPManualFeedbackNoneGenericanswerNLPScripted1. M/C cloze2a. Generic Wh-2b. Sentence prediction7. Vocabulary remindersJack Mostow keynote11/5/1124?

Chart40.60821292780.61682803530.60702567070.63064713060.67972350230.63221784780.64763552480.69337016570.63748079880.64451827240.71739130430.64354838710.63310961970.68750.64631043260.69037656950.65369649810.716535433160.64984227130.67187570.61578947370.7580.62962962960.7812590.696969697

3WSentence PredCloze

ModelModel whether student answered cloze correctly via logistic regressionFactors: user, cloze typecovariates:previously occuring in story: #3W questions, # sentence predictions, # clozesoccuring in prior 2 minutes: #3W questions, # clozesDelay since prior intervention (in seconds)Time since beginning of story (in minutes)Month of the year (Jan=1, Feb=2, ).Model controls for student differences by having student as a factorControls for progress through story (perhaps later questions are easier) by having time since beginning of story as covariateNagelkerke R2 = 0.185Likelihood Ratio Tests(P-values)Effect-2 Log Likelihood of Reduced ModelChi-SquaredfSig.BetaIntercept15840.788651788200.#3W15845.94089093825.1522391510.02321641660.05#sentence pred15844.02784511333.23919332510.07189602930.08# cloze15840.87831379430.089662006110.7646072854-0.005#3W recent15843.96992853973.181276751510.0744865071-0.07#cloze recent15841.15039282440.361741036210.5475408950.02time since last intervention (sec)15845.17753332184.388881533510.0361740364-0.001Month of year15841.14403352560.355381737410.5510823554-0.01Time since start of story15843.00231644122.21366465310.1367933788-0.013Cloze question type16229.2195929189388.430941130630defined-1.04hard-0.19easy-0.03sight0Student identity174131572.2781155854028.21E-138Joe -- updating the table above in its current position should automatically update the formatted table below because it uses formulasPositive Beta means more likely to get question correctBetadfChi-SquarepStudent identity.4021572.2781155850.000Cloze question type03388.430.000sight00easy-0.030hard-0.190defined-1.040# 3W questions0.0515.150.023# sentence prediction questions0.0813.240.072# cloze questions-0.00510.090.765# recent 3W questions-0.0713.180.074# recent cloze questions0.0210.360.548Month of year-0.0110.360.551Time since start of story-0.01312.210.137Time since last intervention (sec)-0.00114.390.036

Q1-3Q1: does 3W scaffold comprehension?Yes. Beta tells us so.Not a confound of time (time elapsed has no effect, if any effect then students do worse over time)Not a confound of simply asking questions. Cloze questions have no effect, sentence prediction is not significant (may be artifact)ConfirmQ2: do other question types have no effect on comprehension?Sentence prediction may help comprehension--need to investigate furtherCloze certainly has no effectMostly confirmQ3: is 3W effect transient?No.3W result seems to be from quantity, not timing (AIED model did not consider quantity)If anything, recent 3W questions may hurt. Could be impact on motivation (see question 4)Reject

Q4

graph of # questions vs. perf3WSentence PredCloze00.60821292780.61682803530.607025670710.63064713060.67972350230.632217847820.64763552480.69337016570.637480798830.64451827240.71739130430.643548387140.63310961970.68750.646310432650.6903765690.653696498160.71653543310.649842271370.6718750.615789473780.750.629629629690.781250.696969697100.5454545455

graph of # questions vs. perf000000000000000000000000000000

3WSentence PredCloze

MBD00A5E4E8.bin