Dr. Geoffrey J. Gordon: What can machine learning do for open education?

51
W HAT CAN MACHINE LEARNING DO FOR OPEN EDUCATION ? Geoff Gordon CMU Machine Learning [email protected]

description

One of the big promises of open and massively online education is easy data collection: we can record everything from students’ habits in reading and viewing lectures, to their participation in discussion groups, to their timing and performance on exercises. So, open education is a natural fit for machine learning—for example, we can use ML to predict future student performance, to select and sequence learning activities, and even to help grade some types of assignments. But there’s a lot more left to do: I’ll argue that even-bigger gains can come from ML that’s focused on understanding educational content and how students learn it, and on communicating this understanding to human educators. To achieve such understanding and communication, we need to take advantage of ML techniques including representation learning, structured learning, and exploration/experimentation. Dr. Geoffrey J. Gordon is an Associate Research Professor in the Department of Machine Learning at Carnegie Mellon University, and co-director of the Department’s Ph. D. program. He works on statistical machine learning, educational data, game theory, multi-robot systems, and planning in probabilistic, adversarial, and general-sum domains. His previous appointments include Visiting Professor at the Stanford Computer Science Department and Principal Scientist at Burning Glass Technologies in San Diego. Dr. Gordon received his B.A. in Computer Science from Cornell University in 1991, and his Ph.D. in Computer Science from Carnegie Mellon University in 1999.

Transcript of Dr. Geoffrey J. Gordon: What can machine learning do for open education?

  • 1. WHAT CAN MACHINE LEARNING DO FOR OPEN EDUCATION? Geoff Gordon CMU Machine Learning [email protected]

2. Civilization advances by extending the number of important operations which we can perform without thinking about them. Alfred North Whitehead, 1911 3. Geoff GordonOCWCApril 2014 CONTRIBUTION OF ML Machine learning can help us understand how students learn 4 4. Geoff GordonOCWCApril 2014 CONTRIBUTION OF ML Machine learning can help us understand how students learn Not just any ML, but latent variable (hidden feature) discovery 4 5. Geoff GordonOCWCApril 2014 CONTRIBUTION OF ML Machine learning can help us understand how students learn Not just any ML, but latent variable (hidden feature) discovery Not just any latent variables, but highly structured ones 4 6. Geoff GordonOCWCApril 2014 WHY BOTHER? Student feedback what does the student know? what are common causes for the mistake the student just made? Instructor feedback what do the students know? what skills does this course content address? what skills doesnt this course content address? Evaluation help design rubrics for (peer, instructor) grading cluster submissions by similar approach, skill level, Etc 5 7. Geoff GordonOCWCApril 2014 GOAL: UNDERSTAND HOW STUDENTS LEARN SOMETHING 6 10m 4m 5m 10m 3x + 4 = x + 10 John took Joe for a ride on his boat. ___ boat was blue with a red stripe. A/The/[] 8. Geoff GordonOCWCApril 2014 GOAL: UNDERSTAND HOW STUDENTS LEARN SOMETHING 6 10m 4m 5m 10m 3x + 4 = x + 10 John took Joe for a ride on his boat. ___ boat was blue with a red stripe. 70m2 x = 3 The A/The/[] 9. Geoff GordonOCWCApril 2014 EX: GEOMETRY TUTOR 7 http://www.carnegielearning.com/ 10. Geoff GordonOCWCApril 2014 STEP-LEVEL DATA Record right/wrong, timing, use of hints, 8 one step = ll in a box 11. Geoff GordonOCWCApril 2014 SIMPLEST MODEL: RASCH / 1-PARAMETER ITEM RESPONSE THEORY 9 ln pt 1 pt = it + jt = student mean (knowledge level) = item mean (easy/difcult) it = student ID jt = step ID pt = P(correct answer) 12. Geoff GordonOCWCApril 2014 SIMPLEST MODEL: RASCH / 1-PARAMETER ITEM RESPONSE THEORYStudents Steps in tutor Each entry: does student i get step j right? 1 10 predict 1 if i+j > 0 13. Geoff GordonOCWCApril 2014 STRUCTURE: SIMILARITY AMONG STEPS Learn a step map: each point = 1 step Steps over here are more similar to each other than to steps over here 14. Geoff GordonOCWCApril 2014 STRUCTURE: SIMILARITY AMONG STEPS Learn a step map: each point = 1 step Steps over here are more similar to each other than to steps over here hic sunt dracones 15. Geoff GordonOCWCApril 2014 HOW PRINCIPAL COMPONENTS ANAYSIS GOT FAMOUS Y1 Y2 Y3 ...Yn Users Movies Each entry: how many stars does user i give to movie j? 4 12 16. Geoff GordonOCWCApril 2014 RESULT OF FACTORING u1 u2 u3 ...un v1 vk Users MoviesBasis weights Basisvectors Low-d basis = latent variables! Basis vectors represent latent properties of movies, e.g.,is a comedy 13 17. Geoff GordonOCWCApril 2014 IN OUR CASE (STUDENT-STEP DATA) U1 U2 U3 . . . UN V1 VK Students Stepsbasis weights basisvectors Basis vectors are candidate eigenskills Weights are students knowledge levels 14 18. Geoff GordonOCWCApril 2014 DOES IT WORK? 15 steps about pentagons steps about circles other steps. Learned features let us predict held-out data better than chance ( = .3, p < 0.0001) step map 19. Geoff GordonOCWCApril 2014 DOES IT WORK? 15 steps about pentagons steps about circles other steps. Learned features let us predict held-out data better than chance ( = .3, p < 0.0001) Yes, sort of step map 20. Geoff GordonOCWCApril 2014 STRUCTURE: PRACTICE MAKES PERFECT PCA ignores step order clearly wrong Add model of student learning to PCA based on additive factor model [Draney et al., 1995] 16 21. Geoff GordonOCWCApril 2014 STRUCTURE: PRACTICE MAKES PERFECT PCA ignores step order clearly wrong Add model of student learning to PCA based on additive factor model [Draney et al., 1995] Result: predictions of held-out data get slightly better = .45 (p < 0.01 vs. plain PCA) Step map still looks the same Meh 17 22. Geoff GordonOCWCApril 2014 WHAT WE REALLY WANT To be understandable to us humans, latents need to be sparse and binary (is about circles,requires subtracting areas) Cant do this fully automatically from this small data set (only 59 students, 370 steps) Challenge: can we discover sparse, binary, understandable latents automatically from MOOC-scale data? 18 23. Geoff GordonOCWCApril 2014 KC HYPOTHESIS Knowledge comes in atomic units (KCs) Each KC is learned independently (no transfer) transfer among steps mediated by common KCs or prerequisite structure (cant learn algebra w/o knowing arithmetic) Each student has a (latent, scalar) prociency level for each KC learn/forget = transition to a higher/lower prociency level Learning a KC happens only through exposure to that KC problem, worked example, lecture, real life, 19 step 17: {A, B}step 23: {A, C} [Koedinger, Corbett, Perfetti. Cognitive Science, 2012] 24. Geoff GordonOCWCApril 2014 CONSEQUENCES OF KC HYPOTHESIS Mistakes are at KC level: select wrong KC; apply right KC to wrong data; mistake in application of KC identifying the KC at fault makes it easier to give student feedback If we can accurately determine list of KCs label instructional activities by KCs then we immediately know the quality/coverage of our content 20 25. Geoff GordonOCWCApril 2014 COMPOSE-BY-ADDITION 21 [Stamper & Koedinger,AIED 2011] 26. Geoff GordonOCWCApril 2014 COMPOSE-BY-ADDITION 22 [Stamper&Koedinger,AIED2011] 27. Geoff GordonOCWCApril 2014 WHY ARE SOME COMPOSE-BY-ADDITION STEPS HARDER? 23 compose by addition 28. Geoff GordonOCWCApril 2014 WHY ARE SOME COMPOSE-BY-ADDITION STEPS HARDER? 24 hard easy medium compose by addition 29. Geoff GordonOCWCApril 2014 WHY ARE SOME COMPOSE-BY-ADDITION STEPS HARDER? 24 hard easy medium compose by addition 30. Geoff GordonOCWCApril 2014 HYPOTHESIS: DIFFERENCE IS IN HOW MUCH PLANNING IS NEEDED 25 plan to compose subtract compose by addition 31. Geoff GordonOCWCApril 2014 KC DISCOVERY 26 t [4]. Other problems were unscaffolded and did not start with such hus students had to pose these subgoals themselves. Indeed the blips for y-addition (seen in the learning curve in Figure 2) do correspond with a ency of these unscaffolded problems. [Stamper&Koedinger,AIED2011] 32. Geoff GordonOCWCApril 2014 USE DATA-DRIVEN MODEL TO REDESIGN TUTOR New skill bars for planning skills skill bars are a tutor interface to show students where they are in acquiring skills Sequence for gentle slope adaptive fading of scaffolding New problems that focus on planning next slide 27 Combine areas Enter given values Find regular area Plan to combine areas Combine areas Subtract Enter given values Find regular area 33. Geoff GordonOCWCApril 2014 NEW PROBLEMS: ISOLATE PRACTICE ON PLANNING STEP Decompose complex problem into simpler ones 28 34. Geoff GordonOCWCApril 2014 NEW PROBLEMS: ISOLATE PRACTICE ON PLANNING STEP Decompose complex problem into simpler ones 28 35. Geoff GordonOCWCApril 2014 RESULTS More efcient: 25% less student time instructional time by step type Better learning of planning skills post-test %correct by item type 29 428 K.R. Koedinger et al (a) Fig. 4. Students using the rede 28 minutes) while actually spe learned these decomposition sk tion problems 5 Discussion and C Following our past demon discovered from data [8; 1 model to redesign an adapt ports the hypothesis. In p reached mastery (as demon . esigned tutor reached master ending more time on the criti kills as demonstrated by bett Conclusion nstrations that better cogn 1], we have tested the h tive tutor yields better stu particular, we found stud nstrated within the tutor a 428 K.R. Koedinger et al (a) . (b) time:minutes%correct [Stamper & Koedinger,AIED 2011] 36. Geoff GordonOCWCApril 2014 MORE STRUCTURE: WHATS IN A KC? So far, each KC is just present or absent in a student or problem Nothing to distinguish algebra KCs from ESL KCs Whats going on under the hood as a student solves a problem? 30 37. Geoff GordonOCWCApril 2014 RULE-BASED COGNITIVE MODEL 3(2x 5) = 9 6x 15 = 9 2x 5 = 3 6x 5 = 9 IF GOAL IS SOLVE A(BX+C) = D THEN REWRITE AS ABX + AC = D IF GOAL IS SOLVE A(BX+C) = D THEN REWRITE AS ABX + C = D IF GOAL IS SOLVE A(BX+C) = D THEN REWRITE AS BX+C = D/AKCs bug 31 What does it look like inside the students brain? maybe a rule-based system in which case KCs might correspond to rules :- president of US is Obama constant C on LHS of equation E :- move C to RHS of E 38. Geoff GordonOCWCApril 2014 RULE-BASED SYSTEM Aka production system: declarative knowledge held in working memory production rules match declarative knowledge and act on WM or external world Much cognitive modeling work endorses this claim explicitly or implicitly ACT-R, SimStudent, Russell & Norvig, ! But two problems: uncertainty handling, representation learning heres where more ML research can help! 32 I see 3x+5 = 8 if LHS has constant C then subtract C from both sides 39. Geoff GordonOCWCApril 2014 PROBLEM 1: UNCERTAINTY A DAY IN THE LIFE OF A RAT 33 Trial Bell? Light? Food? 1 2 3 4 40. Geoff GordonOCWCApril 2014 RAT AS BAYESIAN Priors over: how many trial types, sparsity of connections, reliability of connections, (This is a common architecture for medical diagnosis systems) 34 bell light food 1 2Trial types Observables 41. Geoff GordonOCWCApril 2014 QUIZ: ARE YOU SMARTER THAN A RAT? 35 Trial Bell? Light? Food? 1 2 3 4 100 42. Geoff GordonOCWCApril 2014 QUIZ: ARE YOU SMARTER THAN A RAT? 35 Trial Bell? Light? Food? 1 2 3 4 100 101 43. Geoff GordonOCWCApril 2014 AND THE RAT SAYS Both right! With more light-bell trials, evidence increases for a separate trial type. 36 Effect name 2nd-order conditioning Conditioned inhibition light-food trials many many bell-light trials few many test: bell predicts food? 44. Geoff GordonOCWCApril 2014 BAYESIAN RULE LEARNING IN CLASSICAL CONDITIONING Only fully Bayesian inference/learning captured both effects [Courville, Daw, Gordon,Touretzky, NIPS 2003] Few bell-light trials, 1 trial type: (bell, light, food) all associated More trials: (bell, light, no food) v. (light, food, no bell) 37 Number of bell-light trials 0 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1 Number of AX trials P(US | A, D ) P(US | X, D ) (a) Second-order Cond. P(food | light) P(food | bell) 45. Geoff GordonOCWCApril 2014 PROBLEM 2: REPRESENTATION LEARNING 38 Flaw with KC = rule: Many bugs come from weak features 46. Geoff GordonOCWCApril 2014 REPRESENTATION LEARNING Some student errors come from failure to correctly interpret (internally represent) a problem As student sees more and more examples like 3x + 5 = 8, gets better and better language model to explain them (build internal representation) > some KCs must correspond to features of the improved language model 39 47. Geoff GordonOCWCApril 2014 EXPERIMENT Present algebra examples to a machine learning system As part of learning, induce a language model (an unsupervised probabilistic context free grammar) for algebra equations Make output of language model (grammar nonterminals, e.g., SignedNumber) available as features of each example Use these features in simulated problem-solving to discover KCs [Li, Cohen, Koedinger, Matsuda, 2010] 48. Geoff GordonOCWCApril 2014 NEW COGNITIVE MODELS ARE MORE ACCURATE 41 [Li, Cohen, Koedinger, Matsuda, 2010] 49. Geoff GordonOCWCApril 2014 OPEN RESEARCH QUESTION Can we build a new generation of rule-based system that has rich uncertainty handling integrated representation learning and use it to help us model student learning? 42 50. Geoff GordonOCWCApril 2014 SUMMARY A key contribution of machine learning to education will be to help understand the educational content were creating and delivering Essential idea: ML models of structured latent variables Specically, build and test hypotheses about the knowledge, procedures and representations students use to solve problems latents = KCs, rules, representations, strategies, Need to link uncertainty handling (traditional domain of ML) to new, harder situations encountered in understanding student knowledge Exciting time for research in ML and education! 43