Overview: Requirements for implementing the AARDVARC vision

Title

Overview: Requirements for implementing the AARDVARC visionGary SimonsSIL International

AARDVARC Workshop911 May 2013, Ypsilanti, MI The contextA cross-cutting, NSF-wide initiative calledCyberinfrastructure Framework for 21st Century Science and Engineering (CIF21) Vision statementCIF21 will providea comprehensive, integrated, sus-tainable, and secure cyberinfrastructure to accelerate research and education and new functional capabili-ties in computational and data-intensive science and engineering, thereby transforming our ability to effectively address and solve the many complex problems facing science and society.2The funding programAARDVARC grant was awarded by NSFs program on Building Community and Capacity for Data-Intensive Research in the Social, Behavioral, and Economic Sciences and in Education and Human Resources (BCC-SBE/EHR)We seek to enable research communities to de-velop visions, teams, and prototype capabilities dedicated to creating and utilizing innovative and large-scale data resources and relevant analytic techniques to advance fundamental research for the SBE and EHR areas of research.3A three-stage programFunded projects focus on bringing together cross-disciplinary communities to work on the design of cyberinfrastructure for data-intensive research. [2012 and 2013]A selection (perhaps one-fourth) of these communities will be funded to develop prototypes of the facilities designed in Stage 1. [Beginning 2014, funding permitting]An even smaller number of projects will be funded to develop the actual facility.4Roadmap for current projectThe competition will be fierce across a wide range of disciplines.In order to succeed in the second stage of the program, we must write a top-25% proposal.Can we put ourselves in the shoes of potential re-viewers and anticipate what the likely critiques to an AARDVARC implementation proposal might be? If so, that could help us set an agenda for the problems we should be working on during the course of the current project.5Fast forward to implementationThe current AARDVARC proposal is not an implementation proposalHowever, reading it through that lens sheds light on what would need to be addressed if it were Reading the proposal in this way, I have imagined four show-stopping reviewer critiques that we want to be sure to avoidThis presentation discusses the requirements for an implementation proposal that would avoid these critiques6Critiques we want to avoidThe focus seems too narrow to be truly transformative.The issues of sustainability are not adequately addressed. It is not clear that automatic transcription of under-resourced languages is even possible.There is not an adequate story about how the community will work on a large scale to fill the repository.

71. Find the right framingVision of CIF21: transform our ability to effectively address and solve the many com-plex problems facing science and societyPotential critiqueThe AARDVARC focus seems too narrow to be truly transformative.RequirementA successful proposal will need to frame the proposed cyberinfrastructure in terms that non-linguists will embrace as truly transformative.8ProblemThe name AARDVARC frames the problem in terms of a repository for automatically annotated video and audio resourcesAmong non-linguists is a framing in terms of automatic annotation likely to rise to the top 25% of cross-cutting problems?Probably not since solving the transcription bottleneck puts the focus on a means to the end, rather than the end itselfThe true end is having a repository of data from every language9A more compelling framingThe AARDVARC name fails to name the main thing languageThe most fundamental problem for data-intensive research in the 21st century is that we lack a repository of interoperable data from every human languageAmong non-linguists, would a framing like that rise to the top 25% of cross-cutting problems?This seems much more likelyAnd others have already laid some groundwork10Human Language ProjectBuilding by analogy to the Human Genome Project, Abney and Bird have proposed a Human Language Project to the computational linguistics community:We present a grand challenge to build a corpus that will include all of the worlds languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. (Abney and Bird 2010)In two conference papers, they have argued the motivation for the project and specified basic formats for data11Language CommonsBuilding on the commons tradition, Bice, Bird, and Welcher have spearheaded the Language CommonsThe Language Commons is an international consortium that is creating a large collection of written and spoken language material, made available under open licenses. The content includes text and speech corpora, along with translations, lexicons and other linguistic resources that support large-scale investigation of the world's languages.Currently an open collection in the Internet ArchiveBrowse: http://archive.org/details/LanguageCommonsSubmit: http://upload.languagecommons.org/12

We need to join forcesAARDVARC, Human Language Project, and the Language Commons are variations on the same fundamental visionA repository of interoperable data from every human languageFacing fierce competition with other disciplinesWe are too small to have competing visions, we need a single vision that others will find compellingFor an implementation proposal, we should all join forces to create a grand vision of cyberinfrastructure for language-related research in the 21st century that will embrace every language13ReferencesThe Human Language Project: Building a universal corpus of the Worlds languagesSteven Abney and Steven Bird. 2010. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 88-97, Uppsala, SwedenTowards a data model for the Universal CorpusSteven Abney and Steven Bird. 2011. Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 120-127, Portland, USAThe Language Commons WikiEd Bice and others. 2010. Presentation at Wikimania 2010, Gdask, Poland The Rosetta Project and The Language CommonsLaura Welcher. 2011. Presentation posted on The Long Now Foundation blog.142. Ensure sustainabilityVision of CIF21: provide a sustainable ... cyberinfrastructurePotential critiqueThe issues of sustainability are not adequately addressed.RequirementA successful proposal will need to give a convincing plan for the sustainability of the infrastructure and the resources it houses.1516A repository is not enoughSimply building a repository does not ensure sustainabilityIt must also function as an archive that guarantees access far into the futureA huge NSF investment in the repository we envision would go to waste if it could notContinue operating after the grant money ran outSurvive the inevitable upgrades to hardware and system software at the host institutionRecover from a disaster (natural or institutional)1617Non-use is also wasteEven deeper than the sustained functioning of a repository is the sustained use of the resources it housesThe huge investment would also go to waste ifResources deteriorate or slip to obsolete formatsPotential users never discover relevant resourcesUsers are unable to access discovered resourcesUsers cannot make sense of resources they accessAccessed resources are not compatible with the computational working environments of users17Conditions of sustainable useA complete proposal would addresses the condi-tions of sustainable use (Simons & Bird 2008, sec. 3)Extant Preserved through off-site backup, refreshing copies, format migration, fixity metadataDiscoverable Adequate descriptive metadata accessed through open and easy-to-use searchAvailable User has rights to access as well as a means of access Interpretable Markup, encoding, abbreviations, terminology, methodologies are well documentedPortable File formats that are open (not proprietary) and work on all platforms18Checklist for responsible archivingA good proposal would measure up against the criteria of the TAPS Checklist (Chang 2010, pp. 136-7)Based on a review of mainstream tools for assessing archival practices, TAPS is a checklist of 16 points to help linguists evaluate whether a prospective home for their data will be a responsible archiveTarget Are the mission and audience a good fit?Access Will your audiences have adequate access?Preservation Is the archive following best practices for ensuring long-term preservation?Sustainability Is the institution well situated for the long term?19A repository or an aggregator?Or should the infrastructure have an aggregator at the center rather than a single repository?In todays web economy, being the aggregator (rather than a supplier) is the sweet spot (Simons 2007 paints a vision of such a cyberinfrastructure)This would require community agreement on:Metadata standards (content, format, protocol) OLAC provides a starting pointData standards (contents, formats, protocols) Universal Corpus provides a starting pointStill needs a self-service default repository e.g. Language Commons in Internet Archive20ReferencesToward a global infrastructure for the sustainability of language resources Gary Simons and Steven Bird. 2008. Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, 2022 November 2008, Cebu City, Philippines. Pages 87100. TAPS: Checklist for responsible archiving of digital language resources Debbie Chang. 2010. MA thesis, Graduate Institute of Applied Linguistics. Dallas, TX.Doing linguistics in the 21st century: Interoperation and the quest for the global riches of knowledge Gary Simons. 2007. Proceedings of the E-MELD/DTS-L Workshop: Toward the Interoperability of Language Resources, 1315 July 2007, Palo Alto, CA. 213. Focus on achievable automationPurpose of BCC-SBE/EHR:enable research communities to develop prototype capabilitiesPotential critiqueIt is not clear that automatic transcription of under-resourced languages is even possible.RequirementA successful proposal will need a compelling description of automated helps for annotation that can be implemented today.22The BCC-SBE/EHR visionBuilding Community and Capacity for Data-Intensive Research program is about activity in the present to support research in the future:23Present activitiesWe seek to enable research communities to develop visions, teams, and prototype capabilitiesPresent focusdedicated to creating and utilizing innovative and large-scale data resources and relevant analytic techniquesFuture resultto advance fundamental research for the SBE and EHR areas of research.Setting the right targetAutomated transcription of under-resourced languages is still in the futureIt is an advance in fundamental research that can be furthered by a data-intensive cyberinfrastructureThe follow-up proposal in the BCC program is an implementation proposal, not a research proposalIt must focus on the automated helps for annotation that we can implement immediatelyIt is not meant to be a request to support research on annotation tasks we cannot currently automateIt should implement a framework into which we can plug the latter as that research comes to fruit24Sorting the tasksDuring the AARDVARC project we shouldIdentify annotation tasks that we can automate nowPlan work modules for these in the proposed implementation grantIdentify annotation tasks that are clearly in the futurePursue research grants on these through the normal research programsImplementation proposal would mention supplying data to future research as within its broader impactsIdentify annotation tasks that are borderlineConduct proof-of-concept testing now to determine whether it belongs in the first set or the second setBreaking the bottleneckThe repository should embrace all strategies for breaking the transcription bottleneckFocus on the end of data in every language, as opposed to a particular means for getting itA promising new strategy is oral annotationWoodbury (2003) proposed this to turn a huge collection of tapes from 15 years of Cupik radio broadcasts into usable dataMake running oral translationsDo careful respeaking of hard-to-hear tapesThis inspired the development of BOLD: Basic Oral Language Documentation

26ReferencesDefining documentary linguistics Anthony Woodbury. 2003. In Peter Austin (ed.), Language Documentation and Description 1:35-51. London: SOAS.The rise of documentary linguistics and a new kind of corpus Gary Simons. 2008. Presented at 5th National Natural Language Research Symposium, De La Salle University, Manila, 25 Nov 2008.Basic Oral Language DocumentationD. Will Reiman. 2010. Language Documentation and Conservation, Vol. 4 , pp. 254-268 A scalable method for preserving oral literature from small languagesSteven Bird. 2010. Proceedings of the 12th International Conference on Asia-Pacific Digital Libraries, 5-14, Gold Coast, AustraliaTo BOLDly go where no one has gone beforeBrenda Boerger. 2011. Language Documentation and Conservation, Vol. 5 , pp. 208-23327Original recordingon first recorderCareful respeackingon second recorderOriginal played back (with pauses) into left channelRespoken on mike into right channelExample of respeaking28

From fieldwork ofWill Reiman on Kasanga [cji] language, Guinea-BissauA known best practice in field methodsInstructions for the Recording of Linguistic DataIn Bouquiaux and Thomas (1976), trans. Roberts (1992). Studying and Describing an Unwritten Language. Dallas: Summer Institute of Linguistics.Go over this spontaneous recording, either with the narrator himself or with a qualified speaker, in order to have it repeated sentence by sentence, in a careful, relatively slow, yet normal manner, and to have it whistled (tone languages). (p. 180)Goes on to describe method using 2 tape recordersThis method may be even more essential today as we prepare recordings for automatic transcriptionBOLD:PNGA project led by Steven Bird; see www.boldpng.infoTrained university students to use low-cost digital recorders to go back to their home villages to make recordings and to annotate them orallyProblems: Managing all the files on all the recorders did not scaleTwo recorder annotation was too complicated30

Working on solutionsLanguage Preservation 2.0: Crowdsourcing Oral Language Documentation using Mobile Deviceshttp://lp20.org/They have developed an Android app, AikumaFiles shared within community via Internet or local Wi-Fi hub; supports voting for what to releaseAnnotate on a single device with a simple two-button toolBlog post containing two demovideos from Birds currentfield trip in the Amazon31

4. Foster global collaborationPurpose of BCC-SBE/EHR: enable research communities to creat[e] new, large-scale, next-generation data resourcesPotential critiqueThere is not an adequate story about how the community will work on a large scale.RequirementA successful proposal will need a compelling account of how a global community of researchers, speakers, and citizen scientists will collaborate to fill the repository with annotated resources.32The real challengeBuilding the repository is one thing, but filling it with resources from most languages will be quite another Funded staff will be able to implement the repository, but it will take thousands of volunteers to really fill itRealizing the vision will depend onMobilizing the research community to participateMobilizing speaker communities to participateMobilizing citizen scientists to participateBuilding an infrastructure that supports collaboration among all these players on a global scale33Resources as open-endedRepository must support open-ended annotationAfter initial deposit, other players should be able toAdd careful respeakingAdd a translation (either oral or written)Add a transcription (of text or of translation)Add a translation of the translationInvoke an automatic transcription or translationCheck and revise the automatic outputEach addition should be a separate deposit (with its own metadata) that links back to what it annotates (i.e., stand-off markup)34Resource workflowThe types and languages of the complete set of annotations associated with a resource comprise the state of that resourceThe annotation tasks are operators on that stateEach annotation task has a prerequisite statePerforming the task changes the state of the resourceThis defines an implicit workflowFor any resource, there is a set of possible next tasksThe infrastructure needs to manage that workflow3536Supply and demandWe need to match up two things:The huge demand for annotation tasks to be done all of the possible next tasks for all resourcesThe supply of people worldwide who could do themOur infrastructure needs to be a marketplace that matches supply with demand E.g., eBay, eHarmony, mTurk.comMatch a users language profile to find next tasks to doE.g., TEDs Open Translation Project using AmaraWeb tool to segment videos and add subtitles 140 languages, ~10,000 translators, >50,000 translations36If we build it They wont necessarily come!In addition to describing the infrastructure we would implement to match supply and demand, a compelling proposal would also:Describe the plans for organizing the people who participate (including governance)Describe plans for mobilizing the various target communities: researchers, speakers, citizensDescribe incentives for participation, especially ones that are built into the design of the infrastructure37ConclusionThe AARDVARC project gives us the opportunity to build the vision and plans for a sustainable cyberinfrastructure toCollect and provide access to interoperable data resources from every human languageHarness automation wherever possible to add the needed transcriptions and translationsCreate a marketplace that will permit thousands worldwide to collaborate in performing the annotation tasks that cannot be automatedThus transforming our ability to address and solve language-related problems facing science and society

Overview: Requirements for implementing the AARDVARC vision

Documents

Transcript of Overview: Requirements for implementing the AARDVARC vision