Carnegie Mellon Mostow 12/7/2015, p. 1 The Sounds of Silence: Towards Automated Evaluation of...
-
Upload
wesley-lane -
Category
Documents
-
view
214 -
download
0
Transcript of Carnegie Mellon Mostow 12/7/2015, p. 1 The Sounds of Silence: Towards Automated Evaluation of...
CarnegieMellon
Mostow 04/21/23, p. 1
The Sounds of Silence:
Towards Automated Evaluation Towards Automated Evaluation of of
Student Learning in Student Learning in a Reading Tutor that Listensa Reading Tutor that Listens
Jack Mostow and Gregory AistJack Mostow and Gregory Aist
Project LISTEN, Carnegie Mellon Project LISTEN, Carnegie Mellon UniversityUniversity
http://www.cs.cmu.edu/~listenhttp://www.cs.cmu.edu/~listen
CarnegieMellon
Mostow 04/21/23, p. 2
Pilot study in urban elementary school
Goals:Goals:•Analyze extended use of Reading TutorAnalyze extended use of Reading Tutor• Identify opportunities for improvementIdentify opportunities for improvement
Protocol:Protocol:•Principal chose 8 lowest third-grade readersPrincipal chose 8 lowest third-grade readers•Aide took each kid daily to use Reading Tutor in Aide took each kid daily to use Reading Tutor in
small roomsmall room•Kid chose text to read (Kid chose text to read (Weekly ReaderWeekly Reader, poems, …), poems, …)
Milestones:Milestones:•Oct. 96: deployed Pentium, trained users, refined Oct. 96: deployed Pentium, trained users, refined
designdesign•Nov. 96: school pre-tested individuallyNov. 96: school pre-tested individually• June 97: school post-tested individuallyJune 97: school post-tested individually
CarnegieMellon
Mostow 04/21/23, p. 3
User may:User may:•click click BackBack•click click HelpHelp•click click GoGo•click wordclick word•readread
Tutor Tutor may:may:
•go ongo on•read wordread word•recue recue wordword
•read read phrasephrase
User-Tutor interaction(11/7/96 version used in pilot
study)
CarnegieMellon
Mostow 04/21/23, p. 4
Data recorded by Reading Tutor
Sessions from Nov. 96 to May 97 (excluding Sessions from Nov. 96 to May 97 (excluding outliers)outliers)•29 to 57 sessions per kid, averaging 14 29 to 57 sessions per kid, averaging 14
minutesminutes•Not used during vacations, downtime, Not used during vacations, downtime,
absencesabsences
6 gigabytes of data6 gigabytes of data•.WAV files of kids’ spoken utterances.WAV files of kids’ spoken utterances•.SEG files of time-aligned speech .SEG files of time-aligned speech
recognizer outputrecognizer output•.LOG files of Reading Tutor events.LOG files of Reading Tutor events
CarnegieMellon
Mostow 04/21/23, p. 5
What to evaluate?
Usability (can kids use it?)Usability (can kids use it?)•1993 Wizard of Oz experiments1993 Wizard of Oz experiments•Lab and in-school user tests of successive versionsLab and in-school user tests of successive versions
Assistiveness (do kids perform better with than Assistiveness (do kids perform better with than without?)without?)•1994 Reading Coach boosted comprehension by 1994 Reading Coach boosted comprehension by
~20%~20%•But:But: evaluation obtrusive, costly, sparse, evaluation obtrusive, costly, sparse,
subjective, noisysubjective, noisy
Learning (do kids improve over time?)Learning (do kids improve over time?)•Within tutor: Within tutor: this talkthis talk•On unassisted reading: pre-/post-test by schoolOn unassisted reading: pre-/post-test by school•More than with alternatives: More than with alternatives: future studiesfuture studies
CarnegieMellon
Mostow 04/21/23, p. 6
How should the Reading Tutor
evaluate learning?
Evaluation should beEvaluation should be•Ecologically validEcologically valid -- based on normal system -- based on normal system
useuse•AuthenticAuthentic -- student chooses material -- student chooses material•UnobtrusiveUnobtrusive -- invisible to student -- invisible to student•AutomaticAutomatic -- objective, cheap -- objective, cheap•FastFast -- computable in real-time on PC -- computable in real-time on PC•RobustRobust -- to student, recognizer, and tutor -- to student, recognizer, and tutor
behaviorbehavior•Data-richData-rich -- based on many observations -- based on many observations•SensitiveSensitive -- detect subtle effects -- detect subtle effects
So estimate improvement in So estimate improvement in assisted performanceassisted performance
CarnegieMellon
Mostow 04/21/23, p. 7
How to estimate performance?
AccuracyAccuracy = % of text words matched by = % of text words matched by recognizer outputrecognizer output•Coarse-grainedCoarse-grained•Sensitive to missed wordsSensitive to missed words•Doesn’t penalize requests for helpDoesn’t penalize requests for help
Inter-word latencyInter-word latency = time interval between = time interval between aligned text wordsaligned text words•Finer-grainedFiner-grained•Sensitive to hesitations, insertionsSensitive to hesitations, insertions•Robust to many speech recognizer Robust to many speech recognizer
errorserrors
CarnegieMellon
Mostow 04/21/23, p. 8
Estimation of accuracy and latency
(Nov. 96 example from video)
Text:Text:If the computer thinks you need help, it talks to If the computer thinks you need help, it talks to
you.you.
Student said: Student said: if the computer...takes your name...help if the computer...takes your name...help
it...take...s to youit...take...s to you
Recognizer heard: Recognizer heard: IF THE COMPUTER THINKS YOU IF THE HELP IT TO TO YOUIF THE COMPUTER THINKS YOU IF THE HELP IT TO TO YOU
Tutor estimated 81% accuracy; inter-word Tutor estimated 81% accuracy; inter-word latencies:latencies:
If the computer thinks you If the computer thinks you needneed…help, it …help, it talkstalks...to ...to you.you.
?? 43 3943 39 1 1 60 60 4141 226226 7 7 11 242242 1 cs 1 cs
CarnegieMellon
Mostow 04/21/23, p. 9
Improvement in accuracy and latency
(same kid reads “help” in May 97)
Text:Text:When some kids jump rope, they help other When some kids jump rope, they help other
people too.people too.
Student said: Student said: when some kids jump rope they help other people toowhen some kids jump rope they help other people too
Recognizer heard: Recognizer heard: WHEN SOME KIDS JUMP ROPE THEY HELP OTHER PEOPLE TOOWHEN SOME KIDS JUMP ROPE THEY HELP OTHER PEOPLE TOO
Tutor estimated 100% accuracy; inter-word Tutor estimated 100% accuracy; inter-word latencies:latencies:
When some kids jump rope, they help other When some kids jump rope, they help other people too.people too.
?? 1 10 34 19 77 9 1 1 10 34 19 77 9 1 34 1 cs34 1 cs
CarnegieMellon
Mostow 04/21/23, p. 10
Which performance improvements count?
Echoing the sentence doesn’t count.Echoing the sentence doesn’t count.•So look only at the first try.So look only at the first try.
Picking stories with easier words doesn’t Picking stories with easier words doesn’t count.count.•So look at changes on the same word.So look at changes on the same word.
Memorizing the story doesn’t count.Memorizing the story doesn’t count.•So look only at encounters of words in new So look only at encounters of words in new
contexts.contexts.
Remembering recent words doesn’t count.Remembering recent words doesn’t count.•So look only at the first time a word is So look only at the first time a word is
seen that day.seen that day.
CarnegieMellon
Mostow 04/21/23, p. 11
Accuracy increased 16% on same word
from firstfirst to lastlast day seen in new context
50%
60%
70%
80%
90%
mjt mtw mmd mrt mdc mgt mcr fbw
CarnegieMellon
Mostow 04/21/23, p. 12
Latency decreased 35% on same word
from firstfirst to lastlast day read in new context
0 cs
25 cs
50 cs
75 cs
100 cs
mjt mtw mmd mrt mdc mgt mcr fbw
CarnegieMellon
Mostow 04/21/23, p. 13
Is accuracy and latency estimation...
Ecologically valid?Ecologically valid? Reading Tutor used in school Reading Tutor used in school
Authentic?Authentic? kids choose stories kids choose stories
Unobtrusive? Unobtrusive? evaluate assisted reading invisibly evaluate assisted reading invisibly
Automatic?Automatic? align recognizer output against text align recognizer output against text
Fast?Fast? real-time on Pentium real-time on Pentium
Robust?Robust? to much student, recognizer, and tutor to much student, recognizer, and tutor behaviorbehavior
Data-rich?Data-rich? 10498 utterances, 139133 aligned 10498 utterances, 139133 aligned wordswords
Sensitive?Sensitive? detects significant but subtle effects detects significant but subtle effects (< 0.1 sec)(< 0.1 sec)
CarnegieMellon
Mostow 04/21/23, p. 14
Conclusion
Does the Reading Tutor help?Does the Reading Tutor help?•Yes, with assisted readingYes, with assisted reading•Transfers to unassisted reading!Transfers to unassisted reading!
Research questions:Research questions:•Who benefits how much, when, and Who benefits how much, when, and
why?why?•How should we improve the Tutor?How should we improve the Tutor?
For more information:For more information:•http://www.cs.cmu.edu/~listenhttp://www.cs.cmu.edu/~listen