Serendipitous Language Learning in Mixed Reality · blending holograms with our surroundings to...

Serendipitous Language Learning inMixed Reality

Christian David VazquezMIT Media Lab75 Amherst Ave.Cambridge, MA 02139, [email protected]

Megan FuMIT CSAIL32 Vassar StreetCambridge, MA 02139, [email protected]

Afika Ayanda NyatiMIT CSAIL32 Vassar StreetCambridge, MA 02139, [email protected]

Takako AikawaMIT GSL77 Massachusetts Ave.Cambridge, MA 02139, [email protected]

Alexander LuhMIT CSAIL32 Vassar StreetCambridge, MA 02139, [email protected]

Pattie MaesMIT Media Lab75 Amherst Ave.Cambridge, MA 02139, [email protected]

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the Owner/Author.Copyright is held by the owner/author(s).CHI’17 Extended Abstracts, May 06–11,2017,Denver,CO,USACopyright © ACM 978-1-4503-4656-6/17/05http://dx.doi.org/10.1145/3027063.3053098

AbstractMixed Reality promises a new way to learn in the world:blending holograms with our surroundings to create con-textually rich experiences that find their place in our dailyroutine. Existing situated learning platforms limit the learnerand implicitly enforce the designer’s intent by hard-codingcontent and predetermining what elements in the environ-ment are actionable. In this paper, we define the frameworkof Serendipitous Learning in Mixed Reality as situated, in-cidental learning that occurs naturally in the user’s environ-ment and stems from the learner’s curiosity. This frameworkis explored within the context of second language acqui-sition. We present WordSense, a Mixed Reality platformthat recognizes objects in the learner’s vicinity and embedsholographic content that identifies the corresponding word,provides sentence and definition cues for practice, and dis-plays dynamic audiovisual content that shows example us-age. By employing markerless tracking and dynamicallylinked content, WordSense enables serendipitous languagelearning in the wild.

Author KeywordsAugmented Reality; Mixed Reality; Language Learning; Sit-uated Learning; Computer Vision; Serendipitous Learning

Late-Breaking Work CHI 2017, May 6–11, 2017, Denver, CO, USA

2172

ACM Classification KeywordsH.5.1 [H.5.1 Multimedia Information Systems]: Artificial,augmented, and virtual realities

IntroductionLearning a foreign language is a dynamic process that typ-ically begins in the classroom and ends in the real world,where we eventually validate our proficiency by interactingwith the environment and others. Oftentimes, learning op-portunities occur in the wild—incidentally—as we exploreour surroundings, or come across new media content. Moreoften than not, we lack the tools to harness these serendip-itous learning opportunities that present themselves in themost innocuous of moments, such as during a walk throughthe park or the commute back home from work.

A promising way to mediate learning opportunities on thego is Augmented Reality (AR), a blooming area of researchthat intends to augment the user’s physical world with in-formation. Applications range from enhancing workplaceperformance [21] and augmenting human capability [27] toimmersive recreational [20] and educational activities [26].AR paradigms align with Constructivist learning theory [18]as a mechanism to enable situated learning outside theclassroom, where the student can control the experiencethrough exploration.

AR empowers learning experiences by exploiting the ca-pabilities of meaningful context, which has been found toincrease retention of new material [2, 5]. Moreover, situatedinformation in AR has been shown to improve memorizationand recall [6]. Despite the role of context in AR, applicationsthat truly exploit the advantages of the learner’s context arechallenging due to limitations in sensing capabilities, theability to create meaning from sensor data, and the modalitythrough which context is made actionable.

Technical limitations in AR also hinder its potential for learn-ing. AR on mobile phones requires context switches thatincrease the cognitive load on the learner and deter the ca-pacity for immersion. Instead of looking at an augmentedworld, users are prompted to look at the world through thescreen of a smart-phone. AR headsets often superimposeimagery on a heads up display, attempting to blend contentwith reality. The lack of capable depth sensing, computervision, and optics technologies on AR devices has resultedin a plethora of artificial experiences that pragmatically of-fer little more than a hands-free portable display. Emergingplatforms for Mixed Reality(MR), such as Microsoft Hololens[19], Meta 2 [4], and Magic Leap [14] promise what ARlacked so far: a seamless blend between the real world andvirtual information.

In this work, we propose the framework of SerendipitousLearning within MR. That is, situated learning experiencesthat occur incidentally, and empower the learner’s intent inthe world. We present WordSense, a Serendipitous Learn-ing MR platform for Second Language(L2) acquisition thatautomatically recognizes objects in the wearer’s vicinity, andannotates them with their corresponding vocabulary words.The capabilities of the system are discussed in terms of thedifferent embedding modalities: text, 3D models, audio, andvideo. The work closes with a discussion of the challengesand opportunities within the scope of the proposed frame-work, noting potential areas of future work and study.

Related WorkA number of works have focused attention towards AR asa platform for situated language learning. Goodwin-Jones[8] offers an overview of the emerging AR technologies inthe language learning community within the last decade.We focus our discussion on two modalities for AR content:place-based and object-based.


2173

Place-based AR leverages the importance of the user’s cur-rent locality to deliver relevant and actionable materials inplaces that hold educational potential. The HELLO plat-form presented by Liu et al. [16] uses QR codes to enabletours that help students learn English as a foreign language(EFL). Similarly, Mentira [10] takes students to Los Griegosneighborhood in Albuquerque, New Mexico to solve a mur-der mystery while they learn Spanish. This AR experienceblends location cues, cultural exposure, and collaborativeactivities to engage students in situated learning throughgamification—a recurrent element in the field of AR Learn-ing [15]. Although the aforementioned works can deliverpowerful learning experiences they are limited to the loca-tions for which they have been engineered; a shortcomingthat we address in our serendipitous platform.

Object-based AR aims to use more granular information inthe user’s vicinity, often relying on cameras and sensors onthe device to use objects around the learner as meaning-ful context. Despite existing work on object recognition andtracking in AR [7, 25], most language learning projects stillfocus on marker-based solutions. Wagner [29] presents asystem that embeds flashcards with 3D models to createassociations between Kanji characters and concepts. Hsieet al. [12] developed an augmented book with pop-up con-tent to enhance the retention of new words. Santos et al.[24], recently introduced a platform for situated vocabularylearning that embeds text, 3D models, and audio content onobjects in the learner’s vicinity.

Unlike the approach we present in this paper, these afore-mentioned platforms rely on predefined content and markerbased tracking to deliver limited learning experiences thatimplicitly enforce the system architect’s intent by directing orinfluencing the learner’s actions. Horneker et al. [11] foundin a study that limiting interactions in AR to enforce the de-

signer’s expected usage model can lead to frustration andloss of engagement. Limiting the actionable elements in thelearner’s experience precludes curiosity and exploration asmotivation—powerful elements that can lead to more effec-tive learning. Greenwald et al. proposes a crowdsourcingsolution that allows for dynamic and markerless languagelearning in TagAlong [9], which leverages a remote com-panion that can annotate the user’s environment to enablelearning and remote collaboration. However, their platformlimits learning to instances where a suitable companion isavailable, while we provide an autonomous solution.

Serendipitous Learning in Mixed RealityPrior work in the area of AR learning tools for languagelearning often relies on hard-coded content. That is, thelearner is subjected to pre-established scenarios that leadto a pseudo incidental learning experience. However, out-side of these carefully crafted instances, the developedtechnologies cannot accommodate for the learner’s con-text that occurs naturally throughout their daily life. We callthis pseudo incidental learning because it does not reallyoccur as a fortuitous accident or unknowingly, but as astaged episode of events of which the learner is aware of.Many AR applications require the use of markers to identifyplaces on which to embed learning content, further break-ing the illusion that the learning experience happens "inci-dentally."

WordSense focuses on what we call Serendipitous Learn-ing (SL) experiences in Mixed Reality. Serendipitous Learn-ing refers to learning that occurs naturally in the learner’senvironment and stems from the user’s curiosity; contentis not fabricated or hardcoded, but instead is dynamicallylinked as an extension of the unique particularities of thelearner’s surroundings. SL is therefore situated, incidental,and aligned with pragmatics of constructivist learning the-


2174

ory. We highlight three interrelated features that define SLexperiences in Mixed Reality: contextual affinity, uninhibitedcuriosity, and dynamic content linking.

Contextual AffinitySerendipitous Learning in MR requires the system to under-stand the user’s environment and make it actionable as ameans to enhance learning. Furthermore, the embeddingis explicitly anchored in reality, placed in 3D space in sucha way that it establishes a clear association between real-ity and content. This allows the experience to be engagingand immersive. Most importantly, it allows users to learn inreal context; that is, it can establish powerful associationsbetween objects or situations and the learning materials.

Figure 1: Object acts as a hub formultimedia embeddings.

Figure 2: WordSense systemblock diagram.

Figure 3: Objects are identifiedautomatically and embedded withvocabulary words.

Uninhibited CuriositySerendipitous experiences endorse uninhibited curiosity. In-stead of using markers/cues that limit and guide learning in-stances, the process should allow the learner to decide andexplore things that truly interest him/her without feeling in-hibited by the architect’s intent. This allows the experienceto be truly incidental as opposed to choreographed. Thelearner should experience the notion or illusion thereof thatany object or event in the environment is a potential sourceof learning material as a means to diminish the learningtool’s influence on the learners intent or actions. Therefore,experiences are learner-centered, where interaction is initi-ated by the learner’s autonomous exploration of the world.

Dynamically Linked ContentWe define Dynamically Linked Content (DLC) as referencedmultimedia information that is linked through elements inthe learner’s environment. This means that learning ma-terial is not generated or manipulated by the applicationitself, but queried using context. DLC observes an interest-ing characteristic: the retrieved content is unknown—evento the architects of the learning experience—offering an el-

ement of surprise that can act as a powerful retrieval cueand fuel the learner’s motivation to explore. Furthermore,the content is abundant—an extension of the web’s pool ofknowledge, re-purposed for learning.

WordSense: Vocabulary Learning in the WildIn this section we describe WordSense, a Mixed Realityplatform developed to facilitate dynamic, markerless em-bedding of content on physical objects for vocabulary learn-ing. WordSense aims to enable Serendipitous Learningexperiences for vocabulary learning by harnessing the ef-fects of situated content on memory and association ofnew words. The system was prototyped using MicrosoftHololens for its depth sensing capabilities, which allow us toseamlessly blend reality with content to achieve contextualaffinity. The client connects to a remote server hosted in theAmazon Web Services platform to obtain dynamically linkedcontent from multiple sources and display it on the object,which acts as a hub for multimedia embeddings (Fig. 1).Figure 2 shows how the different components of the systeminteract to make this happen.

Object to WordWithin the scope of WordSense, the primordial context con-sists of the objects within the learner’s vicinity. An imageof the target object is captured using the Hololens’ frontfacing camera. The image is then forwarded to the GoogleCloud Vision API (GCV), which offers a series of imageanalysis services. Using environment meshes generatedby Hololens’ aggregated depth sensing data, we can dis-play the corresponding vocabulary word embedded directlyon top of the identified object (see Fig. 3). Although GCVreturns text in English, nouns can be translated effectivelyby querying services such as Google Translate to targetmultiple languages.


2175

Word to ObjectOptical Character Recognition (OCR) is also provided byGCV. A learner can observe written content in their environ-ment and scan it. The obtained text is then used to queryan open sourced database of 3D models [28]. These mod-els are then processed in the remote server and deliveredto the Microsoft Hololens in the form of 3D meshes to gen-erate 3D content on the fly directly in front of the text asshown in Figure 4.

Sentences and DefinitionsWord embeddings can be further enhanced by dynamicallylinking to an example sentence and definitions database. Anumber of definitions and sentences databases are avail-able to the public in multiple languages. In this project,we select entries arbitrarily from a pool of sentences avail-able in the Tatoeba database [23]. Figure 5 shows how thequeried content is subsequently displayed to the user abovethe embedded vocabulary word.

Figure 4: 3D models can beembedded on text to reinforceassociations.

Figure 5: Sentences embedded onthe target object.

Figure 6: Video from famousmovies is dynamically fetched toshow usage of the encounteredword.

Video ClipsWordSense can fetch a clip that portrays the usage of theword within cinematic content (Fig. 6). A database [17]that contains short video clips of famous movie quotes isqueried using the word to identify the time where new vo-cabulary is spoken. The clip is then fetched and streamedabove the object. When multiple clips exist for a particularvocabulary word, the content is randomized.

Audio and SpeechSpeech synthesis APIs for multiple languages are available.Therefore, connecting the recognized word to its pronuncia-tion is a straightforward process. Furthermore, we exploredseveral databases that contain recorded pronunciationsof words in different languages. As a result, WordSenseallows learners to hear and practice the pronunciation ofnewly encountered vocabulary.

Review InterfaceEvery object that is targeted by the WordSense applicationis stored for future on-the-move review. An image thumb-nail is stored alongside the corresponding vocabulary wordin a schema-less database solution offered by Google’sFirebase platform. A learner can access this content in aflashcard fashion to reinforce retention of new vocabulary(see Fig. 7).

DiscussionThe effect of glosses on second (L2) language vocabularyacquisition has been explored in many works [30, 13, 1],which generally agree that providing meaningful multime-dia annotations within the context of reading material has apositive effect on the retention of new words. Dual CodingTheory [3], supports this notion by arguing that informationis encoded in verbal and non-verbal cognitive processes.Encoding information in multiple mediums (e.g. images andtext) and strengthening referential connections betweenthese formats increases the probability of the learner recall-ing new vocabulary [22].

Mixed Reality embeddings can be thought of as an ana-logue of glosses within reading materials. Normally, a glossconsists of a definition associated to a new concept (vo-cabulary word) and links to a relevant context (sentence inreading material). In Mixed Reality, the embedded vocab-ulary is connected to its conceptual definition (real worldobject), establishing a strong association of the word withinthe user’s reality (relevant context).

WordSense’s Video Clip embeddings are a good exam-ple of serendipitous learning in Mixed Reality. Since objectrecognition is employed, the system has no requirement formarkers to identify actionable elements in the learner’s en-vironment. Given the plurality of video content in the web


2176

that is captioned, obtaining video that contains usage ofthe word is straightforward and creates the illusion that anyobject in the environment is a source of learning material.

Many challenges exist within the framework of Serendipi-tous Learning in Mixed Reality. Context awareness is stilllimited by machine learning and sensor technologies, hin-dering the capacity for contextual affinity under certain cir-cumstances. Within WordSense, limitations of object recog-nition manifest in two ways: an object is labeled incorrectlyor an object is classified loosely (e.g. a cat might be iden-tified as a mammal). A temporary solution to these prob-lems is to display the associated words in both L1 and L2to allow the learner to decide whether or not to trust theapplication. Because content is linked dynamically, the sys-tem might present embeddings that are not relevant—orworse—confuse learners by presenting erroneous content.Furthermore, since content is hosted remotely, latency canbe a problem under unfavorable network conditions.

Serendipitous Learning’s shortcomings are a tradeoff. Thetechniques presented in this work offer a more flexible ap-proach that is scalable in terms of content and offers a truermodality for incidental, situated learning. Nevertheless, thisflexibility comes at the cost of robustness. SL experiencesare not meant as a substitute for traditional learning meth-ods, but instead should be complementary to traditionalcurricula.

Figure 7: Review Interface topractice new vocabulary.

ConclusionIn this work we presented the framework of SerendipitousLearning as learning that occurs in the world, without pre-defined content, and uninhibited by the architect’s intent.WordSense was introduced as a Mixed Reality applicationthat enables serendipitous learning experiences for sec-ond language vocabulary learning. By combining Object

Recognition with the depth sensing capabilities of MicrosoftHololens, we were able to embed real world objects withtheir respective words without the need for markers or pre-defined content. We explored how object recognition allowsus to embed real words with audio, 3D models, video, andtext that enables the learner to create meaningful associ-ations between new vocabulary and concepts. Finally, wediscussed the affordances and challenges of SL throughthe scope of WordSense, highlighting the potential to en-able learner-centered activities in the wild.

Future WorkMoving forward, a series of user studies measuring the ac-quisition of new vocabulary using traditional and SL modal-ities would allow us to understand the advantages andshortcomings of the proposed framework. The capabilitiesof WordSense should be measured in depth to identify therate of failure in term of three factors: inability to identify ob-jects, retrieval of incorrect content, and retrieval of contentthat matches a word but not its intended meaning. A usermodel should be implemented to track the history of en-countered vocabulary, allowing the system to provide sen-tences or cues that build on learned words to teach moreelaborate concepts. Finally, a body of work remains in un-derstanding how Serendipitous Learning in MR can enablemore modalities for interactivity by introducing social andcollaborative elements.

AcknowledgmentsThis work was a collaboration between the MIT Global Lan-guages and Studies, MIT Media Laboratory, and KandaUniversity of International Studies (KUIS), Japan. We wouldlike to thank KUIS for its continuing support of this project.We also thank Louisa Rosenheck, Mina Khan, and Sang-won Leigh for insightful discussions regarding the topicspresented in this work.


2177

References[1] Samir AL JABRI. 2009. The effects of L1 and L2

glosses on reading comprehension and recalling ideasby Saudi students. (2009).

[2] Benedict Carey. 2014. How We Learn: The SurprisingTruth about When, where and why it Happens. PanMacmillan.

[3] James M Clark and Allan Paivio. 1991. Dual codingtheory and education. Educational psychology review3, 3 (1991), 149–210.

[4] Meta Company. 2017. Meta. (2017). https://www.metavision.com/.

[5] Matthew H Erdelyi and Jeff Kleinbard. 1978. HasEbbinghaus decayed with time? The growth of recall(hypermnesia) over days. Journal of Experimen-tal Psychology: Human Learning and Memory 4, 4(1978), 275.

[6] Yuichiro Fujimoto, Goshiro Yamamoto, Hirokazu Kato,and Jun Miyazaki. 2012. Relation between location ofinformation displayed by augmented reality and user’smemorization. In Proceedings of the 3rd AugmentedHuman International Conference. ACM, 7.

[7] Stephan Gammeter, Alexander Gassmann, LukasBossard, Till Quack, and Luc Van Gool. 2010. Server-side object recognition and client-side object trackingfor mobile augmented reality. In 2010 IEEE ComputerSociety Conference on Computer Vision and PatternRecognition-Workshops. IEEE, 1–8.

[8] Robert Godwin-Jones. 2016. Emerging TechnologiesAugmented Reality and Language Learning: From An-notated Vocabulary TO Place-Based Mobile Games.Language Learning & Technology 20, 3 (2016), 9–19.

[9] Scott W Greenwald, Mina Khan, Christian D Vazquez,and Pattie Maes. 2015. TagAlong: Informal Learn-

ing from a Remote Companion with Mobile Perspec-tive Sharing. 12th International Conference on Cog-nition and Exploratory Learning in Digital Age 2015(CELDA).

[10] Chris Holden and Julie Sykes. 2012. Mentira: Proto-typing language-based locative gameplay. In MobileMedia Learning. Springer-Verlag, 111–130.

[11] Eva Hornecker and Andreas Dünser. 2009. Of pagesand paddles: ChildrenâAZs expectations and mistakeninteractions with physical–digital tools. Interacting withComputers 21, 1-2 (2009), 95–107.

[12] Min-Chai Hsieh and Hao-Chiang Koong Lin. 2006. In-teraction design based on augmented reality technolo-gies for English vocabulary learning. In Proceedingsof the 18th International Conference on Computers inEducation, Vol. 1. 663–666.

[13] Jan H Hulstijn, Merel Hollander, and Tine Greidanus.1996. Incidental vocabulary learning by advancedforeign language students: The influence of marginalglosses, dictionary use, and reoccurrence of unknownwords. The modern language journal 80, 3 (1996),327–339.

[14] Magic Leap Inc. 2017. Magic Leap. (2017). https://www.magicleap.com/.

[15] Eric Klopfer and Kurt Squire. 2008. Environmental De-tectivesâATthe development of an augmented realityplatform for environmental simulations. EducationalTechnology Research and Development 56, 2 (2008),203–228.

[16] T-Y Liu. 2009. A context-aware ubiquitous learning en-vironment for language listening and speaking. Journalof Computer Assisted Learning 25, 6 (2009), 515–527.

[17] Bill MacDonald. 2017. Quotacle. (2017). http://quotacle.com/.


2178

https://www.metavision.com/

https://www.metavision.com/

https://www.magicleap.com/

https://www.magicleap.com/

http://quotacle.com/

http://quotacle.com/

[18] Jorge Martín-Gutiérrez, José Luís Saorín, ManuelContero, Mariano Alcañiz, David C Pérez-López, andMario Ortega. 2010. Design and validation of an aug-mented book for spatial abilities development in engi-neering students. Computers & Graphics 34, 1 (2010),77–91.

[19] Microsoft. 2017. Microsoft HoloLens. (2017). https://www.microsoft.com/microsoft-hololens/en-us.

[20] Nyantic. 2017. Catch pokemon in the real world withpokemon go! (2017). http://www.pokemongo.com/.

[21] SK Ong, Y Pang, and AYC Nee. 2007. Augmentedreality aided assembly design and planning. CIRPAnnals-Manufacturing Technology 56, 1 (2007), 49–52.

[22] Allan Paivio. 2006. Dual coding theory and education.In The Conference on Pathways to Literacy Achieve-ment for High Poverty Children. 1–20.

[23] Tatoeba project community. 2017. Tatoeba. (2017).https://tatoeba.org/eng.

[24] Marc Ericson C Santos, Takafumi Taketomi, GoshiroYamamoto, Ma Mercedes T Rodrigo, Christian Sandor,Hirokazu Kato, and others. 2016. Augmented realityas multimedia: the case for situated vocabulary learn-ing. Research and Practice in Technology EnhancedLearning 11, 1 (2016), 1.

[25] Rodrigo LS Silva, Paulo S Rodrigues, Diego Mazala,

and Gilson Giraldi. 2004. Applying Object Recognitionand Tracking to Augmented Reality for InformationVisualization. Technical Report. Technical report,LNCC, Brazil.

[26] Kurt D Squire and Mingfong Jan. 2007. Mad City Mys-tery: Developing scientific argumentation skills witha place-based augmented reality game on handheldcomputers. Journal of Science Education and Technol-ogy 16, 1 (2007), 5–29.

[27] Thad Starner, Steve Mann, Bradley Rhodes, JeffreyLevine, Jennifer Healey, Dana Kirsch, Rosalind W Pi-card, and Alex Pentland. 1997. Augmented realitythrough wearable computing. Presence: Teleoperatorsand Virtual Environments 6, 4 (1997), 386–398.

[28] Exocortex Technologies. 2017. Clara.io: Online 3DModeling, 3D Rendering, Free 3D Models. (2017).https://clara.io/.

[29] Daniel Wagner and Istvan Barakonyi. 2003. Aug-mented Reality Kanji Learning. In Proceedings ofthe 2Nd IEEE/ACM International Symposium onMixed and Augmented Reality (ISMAR ’03). IEEEComputer Society, Washington, DC, USA, 335–.http://dl.acm.org/citation.cfm?id=946248.946816

[30] Makoto Yoshii. 2006. L1 and L2 glosses: Their effectson incidental vocabulary learning. Language Learning& Technology 10, 3 (2006), 85–101.


2179

https://www.microsoft.com/microsoft-hololens/en-us

https://www.microsoft.com/microsoft-hololens/en-us

http://www.pokemongo.com/

https://tatoeba.org/eng

https://clara.io/

http://dl.acm.org/citation.cfm?id=946248.946816

Serendipitous Language Learning in Mixed Reality · blending holograms with our surroundings to...

Documents

Transcript of Serendipitous Language Learning in Mixed Reality · blending holograms with our surroundings to...