Overview: Requirements for implementing the AARDVARC...

38
Overview: Requirements for implementing the AARDVARC vision Gary Simons SIL Interna*onal AARDVARC Workshop 9–11 May 2013, Ypsilan?, MI

Transcript of Overview: Requirements for implementing the AARDVARC...

Page 1: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Overview: Requirements for implementing the

AARDVARC vision

Gary  Simons  SIL  Interna*onal    AARDVARC  Workshop  9–11  May  2013,  Ypsilan?,  MI    

Page 2: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

The context w A  cross-­‐cuDng,  NSF-­‐wide  ini?a?ve  called  

§  Cyberinfrastructure  Framework  for  21st  Century  Science  and  Engineering  (CIF21)      

w Vision  statement  §  “CIF21  will  provide  a  comprehensive,  integrated,  sus-­‐tainable,  and  secure  cyberinfrastructure  to  accelerate  research  and  educa*on  and  new  func*onal  capabili-­‐*es  in  computa*onal  and  data-­‐intensive  science  and  engineering,  thereby  transforming  our  ability  to  effec*vely  address  and  solve  the  many  complex  problems  facing  science  and  society.”  

2

Page 3: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

The funding program w  AARDVARC  grant  was  awarded  by  NSF’s  program    on  Building  Community  and  Capacity  for  Data-­‐Intensive  Research  in  the  Social,  Behavioral,  and  Economic  Sciences  and  in  Educa?on  and  Human  Resources  (BCC-­‐SBE/EHR)  § We  “seek  to  enable  research  communi*es  to  de-­‐velop  visions,  teams,  and  prototype  capabili*es  dedicated  to  crea*ng  and  u*lizing  innova*ve  and  large-­‐scale  data  resources  and  relevant  analy*c  techniques  to  advance  fundamental  research  for  the  SBE  and  EHR  areas  of  research.”  

3

Page 4: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

A three-stage program 1.  Funded  projects  focus  on  bringing  together  

cross-­‐disciplinary  communi?es  to  work  on  the  design  of  cyberinfrastructure  for  data-­‐intensive  research.  [2012  and  2013]  

2.  A  selec?on  (perhaps  one-­‐fourth)  of  these  communi?es  will  be  funded  to  develop  prototypes  of  the  facili?es  designed  in  Stage  1.  [Beginning  2014,  funding  permiDng]  

3.  An  even  smaller  number  of  projects  will  be  funded  to  develop  the  actual  facility.  

4

Page 5: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Roadmap for current project w  The  compe??on  will  be  fierce  across  a  wide  range  of  disciplines.  

w  In  order  to  succeed  in  the  second  stage  of  the  program,  we  must  write  a  top-­‐25%  proposal.  

w  Can  we  put  ourselves  in  the  shoes  of  poten?al  re-­‐viewers  and  an?cipate  what  the  likely  cri?ques  to  an  AARDVARC  implementa?on  proposal  might  be?    

w  If  so,  that  could  help  us  set  an  agenda  for  the  problems  we  should  be  working  on  during  the  course  of  the  current  project.  

5

Page 6: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Fast forward to implementation w The  current  AARDVARC  proposal  is  not  an  implementa?on  proposal  §  However,  reading  it  through  that  lens  sheds  light  on  what  would  need  to  be  addressed  if  it  were    

w Reading  the  proposal  in  this  way,    §  I  have  imagined  four  show-­‐stopping  reviewer  cri?ques  that  we  want  to  be  sure  to  avoid  

§  This  presenta?on  discusses  the  requirements  for  an  implementa?on  proposal  that  would  avoid  these  cri?ques  

6

Page 7: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Critiques we want to avoid 1.  The  focus  seems  too  narrow  to  be  truly  

transforma?ve.  

2.  The  issues  of  sustainability  are  not  adequately  addressed.    

3.  It  is  not  clear  that  automa?c  transcrip?on  of  under-­‐resourced  languages  is  even  possible.  

4.  There  is  not  an  adequate  story  about  how  the  community  will  work  on  a  large  scale  to  fill  the  repository.  

7

Page 8: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

1. Find the right framing w Vision  of  CIF21:  “transform  our  ability  to  effec*vely  address  and  solve  the  many  com-­‐plex  problems  facing  science  and  society”  

w Poten?al  cri?que  §  The  AARDVARC  focus  seems  too  narrow  to  be  truly  transforma?ve.  

w Requirement  §  A  successful  proposal  will  need  to  frame  the  proposed  cyberinfrastructure  in  terms  that  non-­‐linguists  will  embrace  as  truly  transforma?ve.   8

Page 9: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Problem w The  name  AARDVARC  frames  the  problem  in  terms  of  a  repository  for  automa?cally  annotated  video  and  audio  resources  §  Among  non-­‐linguists  is  a  framing  in  terms  of  automa?c  annota?on  likely  to  rise  to  the  top  25%  of  cross-­‐cuDng  problems?  

§  Probably  not  since  solving  the  transcrip?on  bocleneck  puts  the  focus  on  a  means  to  the  end,  rather  than  the  end  itself  

w The  true  end  is  having  a  repository  of  data  from  every  language   9

Page 10: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

A more compelling framing w The  AARDVARC  name  fails  to  name  the  main  thing    —  language  §  The  most  fundamental  problem  for  data-­‐intensive  research  in  the  21st  century  is  that  we  lack  a  repository  of  interoperable  data  from  every  human  language  

w Among  non-­‐linguists,  would  a  framing  like  that  rise  to  the  top  25%  of  cross-­‐cuDng  problems?  §  This  seems  much  more  likely  §  And  others  have  already  laid  some  groundwork  

10

Page 11: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Human Language Project w  Building  by  analogy  to  the  Human  Genome  Project,  Abney  and  Bird  have  proposed  a  Human  Language  Project  to  the  computa?onal  linguis?cs  community:  §  “We  present  a  grand  challenge  to  build  a  corpus    that  will  include  all  of  the  world’s  languages,  in  a  consistent  structure  that  permits  large-­‐scale  cross-­‐linguis?c  processing,  enabling  the  study  of  universal  linguis?cs.”  (Abney  and  Bird  2010)  

w  In  two  conference  papers,  they  have  argued  the  mo?va?on  for  the  project  and  specified  basic  formats  for  data  

11

Page 12: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Language Commons w  Building  on  “the  commons”  tradi?on,  Bice,  Bird,  and  Welcher  have  spearheaded  the  Language  Commons  §  “The  Language  Commons  is  an  interna?onal  consor?um  that  is  crea?ng  a  large  collec?on  of  wricen  and  spoken  language  material,  made  available  under  open  licenses.  The  content  includes  text  and  speech  corpora,  along  with  transla?ons,  lexicons  and  other  linguis?c  resources  that  support  large-­‐scale  inves?ga?on  of  the  world's  languages.”  

w  Currently  an  open  collec?on  in  the  Internet  Archive  §  Browse:  hcp://archive.org/details/LanguageCommons  §  Submit:  hcp://upload.languagecommons.org/   12

Page 13: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

We need to join forces w  AARDVARC,  Human  Language  Project,  and  the  Language  Commons  are  varia?ons  on  the  same  fundamental  vision  §  A  repository  of  interoperable  data  from  every  human  language  

w  Facing  fierce  compe??on  with  other  disciplines  §  We  are  too  small  to  have  compe?ng  visions,  we  need  a  single  vision  that  others  will  find  compelling  

§  For  an  implementa?on  proposal,  we  should  all  join  forces  to  create  a  grand  vision  of  cyberinfrastructure  for  language-­‐related  research  in  the  21st  century  that  will  embrace  every  language   13

Page 14: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

References w  The  Human  Language  Project:  Building  a  universal  corpus  of  the  

World’s  languages  Steven  Abney  and  Steven  Bird.  2010.  Proceedings    of  the  48th  Annual  Mee*ng  of  the  Associa*on  for  Computa*onal  Linguis*cs,  88-­‐97,  Uppsala,  Sweden  

w  Towards  a  data  model  for  the  Universal  Corpus  Steven  Abney  and  Steven  Bird.  2011.  Proceedings    of  the  4th  Workshop  on  Building  and  Using  Comparable  Corpora,  120-­‐127,  Portland,  USA  

w  The  Language  Commons  Wiki  Ed  Bice  and  others.  2010.  Presenta?on  at  Wikimania  2010,  Gdańsk,  Poland    

w  The  Roseca  Project  and  The  Language  Commons  Laura  Welcher.  2011.  Presenta?on  posted  on  The  Long  Now  Founda?on  blog.  

14

Page 15: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

2. Ensure sustainability w Vision  of  CIF21:    

§  “provide  a  …  sustainable  ...  cyberinfrastructure”  w Poten?al  cri?que  

§  The  issues  of  sustainability  are  not  adequately  addressed.  

w Requirement  §  A  successful  proposal  will  need  to  give  a  convincing  plan  for  the  sustainability  of  the  infrastructure  and  the  resources  it  houses.  

15

Page 16: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

16

A repository is not enough w Simply  building  a  repository  does  not  ensure  sustainability  §  It  must  also  func?on  as  an  archive  that  guarantees  access  far  into  the  future  

w A  huge  NSF  investment  in  the  repository  we  envision  would  go  to  waste  if  it  could  not  §  Con?nue  opera?ng  aner  the  grant  money  ran  out  §  Survive  the  inevitable  upgrades  to  hardware  and  system  sonware  at  the  host  ins?tu?on  

§  Recover  from  a  disaster  (natural  or  ins?tu?onal)  

Page 17: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

17

Non-use is also waste w Even  deeper  than  the  sustained  func?oning    of  a  repository  is  the  sustained  use  of  the  resources  it  houses  

w The  huge  investment  would  also  go  to  waste  if  §  Resources  deteriorate  or  slip  to  obsolete  formats  §  Poten?al  users  never  discover  relevant  resources  §  Users  are  unable  to  access  discovered  resources  §  Users  cannot  make  sense  of  resources  they  access  §  Accessed  resources  are  not  compa?ble  with  the  computa?onal  working  environments  of  users  

Page 18: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Conditions of sustainable use w  A  complete  proposal  would  addresses  the  condi-­‐?ons  of  sustainable  use  (Simons  &  Bird  2008,  sec.  3)  §  Extant  —  Preserved  through  off-­‐site  backup,  refreshing  copies,  format  migra?on,  fixity  metadata  

§  Discoverable  —  Adequate  descrip?ve  metadata  accessed  through  open  and  easy-­‐to-­‐use  search  

§  Available  —  User  has  rights  to  access  as  well  as  a  means  of  access    

§  Interpretable  —  Markup,  encoding,  abbrevia?ons,  terminology,  methodologies  are  well  documented  

§  Portable  —  File  formats  that  are  open  (not  proprietary)  and  work  on  all  plaqorms   18

Page 19: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Checklist for responsible archiving w  A  good  proposal  would  measure  up  against  the  criteria  of  the  TAPS  Checklist  (Chang  2010,  pp.  136-­‐7)  §  Based  on  a  review  of  mainstream  tools  for  assessing  archival  prac?ces,  TAPS  is  a  checklist  of  16  points  to  help  linguists  evaluate  whether  a  prospec?ve  home  for  their  data  will  be  a  responsible  archive  

§  Target  —  Are  the  mission  and  audience  a  good  fit?  §  Access  —  Will  your  audiences  have  adequate  access?  §  Preserva7on  —  Is  the  archive  following  best                    prac?ces  for  ensuring  long-­‐term  preserva?on?  

§  Sustainability  —  Is  the  ins?tu?on  well  situated  for                    the  long  term?   19

Page 20: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

A repository or an aggregator? w Or  should  the  infrastructure  have  an  aggregator  at  the  center  rather  than  a  single  repository?  §  In  today’s  web  economy,  being  the  aggregator  (rather  than  a  supplier)  is  the  sweet  spot  (Simons  2007  paints  a  vision  of  such  a  cyberinfrastructure)  

§  This  would  require  community  agreement  on:  § Metadata  standards  (content,  format,  protocol)  —              OLAC  provides  a  star?ng  point  

§  Data  standards  (contents,  formats,  protocols)  —              Universal  Corpus  provides  a  star?ng  point  

§  S?ll  needs  a  self-­‐service  default  repository    §  e.g.  Language  Commons  in  Internet  Archive   20

Page 21: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

References w  Toward  a  global  infrastructure  for  the  sustainability  of  language  

resources    Gary  Simons  and  Steven  Bird.  2008.  Proceedings  of  the  22nd  Pacific  Asia  Conference  on  Language,  Informa*on  and  Computa*on,  20–22  November  2008,  Cebu  City,  Philippines.  Pages  87–100.    

w  TAPS:  Checklist  for  responsible  archiving  of  digital  language  resources    Debbie  Chang.  2010.  MA  thesis,  Graduate  Ins?tute  of  Applied  Linguis?cs.  Dallas,  TX.  

w  Doing  linguis?cs  in  the  21st  century:  Interopera?on  and  the  quest  for  the  global  riches  of  knowledge    Gary  Simons.  2007.  Proceedings  of  the  E-­‐MELD/DTS-­‐L  Workshop:  Toward  the  Interoperability  of  Language  Resources,  13–15  July  2007,  Palo  Alto,  CA.    

21

Page 22: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

3. Focus on achievable automation w Purpose  of  BCC-­‐SBE/EHR:  

§  “enable  research  communi*es  to  develop  …  prototype  capabili*es”  

w Poten?al  cri?que  §  It  is  not  clear  that  automa?c  transcrip?on  of  under-­‐resourced  languages  is  even  possible.  

w Requirement  §  A  successful  proposal  will  need  a  compelling  descrip?on  of  automated  helps    for  annota?on  that  can  be  implemented  today.  

22

Page 23: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

The BCC-SBE/EHR vision w  Building  Community  and  Capacity  for  Data-­‐Intensive  

Research  program  is  about  ac?vity  in  the  present  to  support  research  in  the  future:  

23

Present activities

We “seek to enable research communities to develop visions, teams, and prototype capabilities

Present focus

dedicated to creating and utilizing innovative and large-scale data resources and relevant analytic techniques

Future result

to advance fundamental research for the SBE and EHR areas of research.”

Page 24: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Setting the right target w  Automated  transcrip?on  of  under-­‐resourced  languages  is  s?ll  in  the  future  §  It  is  an  advance  in  fundamental  research  that  can  be  furthered  by  a  data-­‐intensive  cyberinfrastructure  

w  The  follow-­‐up  proposal  in  the  BCC  program  is  an  implementa?on  proposal,  not  a  research  proposal  §  It  must  focus  on  the  automated  helps  for  annota?on  that  we  can  implement  immediately  

§  It  is  not  meant  to  be  a  request  to  support  research  on  annota?on  tasks  we  cannot  currently  automate  

§  It  should  implement  a  framework  into  which  we  can  plug  the  lacer  as  that  research  comes  to  fruit   24

Page 25: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Sorting the tasks w During  the  AARDVARC  project  we  should  

§  Iden?fy  annota?on  tasks  that  we  can  automate  now  § Plan  work  modules  for  these  in  the  proposed  implementa?on  grant  

§  Iden?fy  annota?on  tasks  that  are  clearly  in  the  future  § Pursue  research  grants  on  these  through  the    normal  research  programs  

§  Implementa?on  proposal  would  men?on  supplying  data  to  future  research  as  within  its  broader  impacts  

§  Iden?fy  annota?on  tasks  that  are  borderline  § Conduct  proof-­‐of-­‐concept  tes?ng  now  to  determine  whether  it  belongs  in  the  first  set  or  the  second  set  

Page 26: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Breaking the bottleneck w  The  repository  should  embrace  all  strategies  for  breaking  the  transcrip?on  bocleneck  §  Focus  on  the  end  of  data  in  every  language,    as  opposed  to  a  par?cular  means  for  geDng  it  

w  A  promising  new  strategy  is  oral  annota?on  §  Woodbury  (2003)  proposed  this  to  turn  a  huge  collec?on  of  tapes  from  15  years  of  Cup’ik  radio  broadcasts  into  usable  data  § Make  running  oral  transla?ons  §  Do  careful  respeaking  of  “hard-­‐to-­‐hear  tapes”  

§  This  inspired  the  development  of  BOLD:    § Basic  Oral  Language  Documenta?on     26

Page 27: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

References w  Defining  documentary  linguis?cs    

 Anthony  Woodbury.  2003.  In  Peter  Aus?n  (ed.),  Language  Documenta*on  and  Descrip*on  1:35-­‐51.  London:  SOAS.  

w  The  rise  of  documentary  linguis?cs  and  a  new  kind  of  corpus    Gary  Simons.  2008.  Presented  at  5th  Na*onal  Natural  Language  Research  Symposium,  De  La  Salle  University,  Manila,  25  Nov  2008.  

w  Basic  Oral  Language  Documenta?on  D.  Will  Reiman.  2010.  Language  Documenta*on  and  Conserva*on,    Vol.  4  ,  pp.  254-­‐268    

w  A  scalable  method  for  preserving  oral  literature  from  small  languages  Steven  Bird.  2010.  Proceedings  of  the  12th  Interna*onal  Conference  on  Asia-­‐Pacific  Digital  Libraries,  5-­‐14,  Gold  Coast,  Australia  

w  To  BOLDly  go  where  no  one  has  gone  before  Brenda  Boerger.  2011.  Language  Documenta*on  and  Conserva*on,  Vol.  5  ,  pp.  208-­‐233  

27

Page 28: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

w  Original  recording  on  first  recorder  

w  Careful  respeacking  on  second  recorder  §  Original  played  back  (with  pauses)  into  len  channel  

§  Respoken  on  mike  into  right  channel  

Example of respeaking

28

From  fieldwork  of  Will  Reiman  on    Kasanga  [cji]  language,  Guinea-­‐Bissau  

Page 29: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

A known best practice in field methods w  Instruc*ons  for  the  Recording  of  Linguis*c  Data  

§  In  Bouquiaux  and  Thomas  (1976),  trans.  Roberts  (1992).  Studying  and  Describing  an  Unwri]en  Language.  Dallas:  Summer  Ins?tute  of  Linguis?cs.  

§  “Go  over  this  spontaneous  recording,  either  with  the  narrator  himself  or  with  a  qualified  speaker,  in  order  to  have  it  repeated  sentence  by  sentence,  in  a  careful,  rela?vely  slow,  yet  normal  manner,  and  to  have  it  whistled  (tone  languages).”  (p.  180)  

§  Goes  on  to  describe  method  using  2  tape  recorders  

w  This  method  may  be  even  more  essen?al  today  as  we  prepare  recordings  for  automa?c  transcrip?on  

Page 30: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

BOLD:PNG

w  A  project  led  by  Steven  Bird;  see  www.boldpng.info  w  Trained  university  students  to  use  low-­‐cost  digital  

recorders  to  go  back  to  their  home  villages  to  make  recordings  and  to  annotate  them  orally  

w  Problems:    §  Managing  all  the  files  on  all  the  recorders  did  not  scale  §  Two  recorder  annota?on  was  too  complicated   30

Page 31: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Working on solutions w  Language  Preserva?on  2.0:  Crowdsourcing  Oral  Language  Documenta?on  using  Mobile  Devices  §  hcp://lp20.org/  

w  They  have  developed  an  Android  app,  Aikuma  §  Files  shared  within  community  via  Internet  or  local  Wi-­‐Fi  hub;  supports  vo?ng  for  what  to  release  

§  Annotate  on  a  single  device    with  a  simple  two-­‐bucon  tool  

w  Blog  post  containing  two  demo  videos  from  Bird’s  current  field  trip  in  the  Amazon   31

Page 32: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

4. Foster global collaboration w Purpose  of  BCC-­‐SBE/EHR:    

§  “enable  research  communi*es  …  to  creat[e]  new,  large-­‐scale,  next-­‐genera*on  data  resources”  

w Poten?al  cri?que  §  There  is  not  an  adequate  story  about  how  the  community  will  work  on  a  large  scale.  

w Requirement  §  A  successful  proposal  will  need  a  compelling  account  of  how  a  global  community  of  researchers,  speakers,  and  ci?zen  scien?sts  will  collaborate  to  fill  the  repository  with  annotated  resources.   32

Page 33: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

The real challenge w  Building    the  repository  is  one  thing,  but  filling  it  with  resources  from  most  languages  will  be  quite  another    §  Funded  staff  will  be  able  to  implement  the  repository,  but  it  will  take  thousands  of  volunteers  to  really  fill  it  

w  Realizing  the  vision  will  depend  on  §  Mobilizing  the  research  community  to  par?cipate  §  Mobilizing  speaker  communi?es  to  par?cipate  §  Mobilizing  ci?zen  scien?sts  to  par?cipate  §  Building  an  infrastructure  that  supports  collabora?on  among  all  these  players  on  a  global  scale  

33

Page 34: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Resources as open-ended w  Repository  must  support  open-­‐ended  annota?on  w  Aner  ini?al  deposit,  other  players  should  be  able  to  

§  Add  careful  respeaking  §  Add  a  transla?on  (either  oral  or  wricen)  §  Add  a  transcrip?on  (of  text  or  of  transla?on)  §  Add  a  transla?on  of  the  transla?on  §  Invoke  an  automa?c  transcrip?on  or  transla?on  §  Check  and  revise  the  automa?c  output  

w  Each  addi?on  should  be  a  separate  deposit  (with  its  own  metadata)  that  links  back  to  what  it  annotates  (i.e.,  stand-­‐off  markup)   34

Page 35: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Resource workflow w  The  types  and  languages  of  the  complete  set  of  annota?ons  associated  with  a  resource  comprise  the  state  of  that  resource  

w  The  annota?on  tasks  are  operators  on  that  state  §  Each  annota?on  task  has  a  prerequisite  state  §  Performing  the  task  changes  the  state  of  the  resource  

w  This  defines  an  implicit  workflow  §  For  any  resource,  there  is  a  set  of  possible  next  tasks  §  The  infrastructure  needs  to  manage  that  workflow  

35

Page 36: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

36

Supply and demand w We  need  to  match  up  two  things:  

§  The  huge  demand  for  annota?on  tasks  to  be  done  —  all  of  the  possible  next  tasks  for  all  resources  

§  The  supply  of  people  worldwide  who  could  do  them  

w Our  infrastructure  needs  to  be  a  marketplace  that  matches  supply  with  demand    §  E.g.,  eBay,  eHarmony,  mTurk.com  

w Match  a  user’s  language  profile  to  find  next  tasks  to  do  §  E.g.,    TED’s  Open  Transla?on  Project  using  Amara  § Web  tool  to  segment  videos  and  add  sub?tles    §  140  languages,  ~10,000  translators,  >50,000  transla?ons  

Page 37: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

If we build it … w  They  won’t  necessarily  come!  

w  In  addi?on  to  describing  the  infrastructure  we  would  implement  to  match  supply  and  demand,  a  compelling    proposal  would  also:  §  Describe  the  plans  for  organizing  the  people    who  par?cipate  (including  governance)  

§  Describe  plans  for  mobilizing  the  various  target  communi?es:  researchers,  speakers,  ci?zens  

§  Describe  incen?ves  for  par?cipa?on,  especially  ones  that  are  built  into  the  design  of  the  infrastructure  

37

Page 38: Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Conclusion w  The  AARDVARC  project  gives  us  the  opportunity  to    

build  the  vision  and  plans  for  a  sustainable  cyberinfrastructure  to  §  Collect  and  provide  access  to  interoperable  data  resources  from  every  human  language  

§  Harness  automa?on  wherever  possible  to  add  the  needed  transcrip?ons  and  transla?ons  

§  Create  a  marketplace  that  will  permit  thousands  worldwide  to  collaborate  in  performing  the  annota?on  tasks  that  cannot  be  automated  

w  Thus  transforming  our  ability  to  address  and  solve  language-­‐related  problems  facing  science  and  society