Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

34
Truth is a Lie CrowdTruth: The 7 Myths of Human Annota9on Lora Aroyo

description

Big data is having a disruptive impact across the sciences. Human annotation of semantic interpretation tasks is a critical part of big data semantics, but it is based on an antiquated ideal of a single correct truth that needs to be similarly disrupted.We expose seven myths about human annotation, most of which derive from that antiquated ideal of truth, and dispell these myths with examples from our research.We propose a new theory of truth, Crowd Truth, that is based on the intuition that human interpretation is subjective, and that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.

Transcript of Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Page 1: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Truth  is  a  Lie  CrowdTruth:  

The  7  Myths  of  Human  Annota9on    

Lora  Aroyo  

Page 2: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 3: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 4: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 5: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 6: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Human  annota9on  of  seman)c  interpreta)on  tasks  as  cri)cal  part  of  cogni)ve  systems  engineering  

– standard  prac)ce  based  on  an9quated  ideal  of  a  single  correct  truth  

– 7  myths  of  human  annota)on  

– new  theory  of  truth:    CrowdTruth  

Take  Home  Message  

Lora Aroyo

Page 7: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

I  amar  prestar  aen...  •  amount  of  data  &  scale  of  computa9on  available  have    increased  by  a  previously  inconceivable  amount  

•  CS  &  AI  moved  out  of  thought  problems  to  empirical  science  

•  current  methods  pre-­‐date  this  fundamental  shi?  

•  the  ideal  of  “one  truth”  is  a  lie  

•  crowdsourcing  &  seman9cs        together  correct  the  fallacy  and  improve  analy)c  systems  

The  world  has  changed:  there  is  a  need  to  form  a  new  theory  of  truth  -­‐  appropriate  to  cogni)ve  systems  

Lora Aroyo

Page 8: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Seman)c  interpreta)on  is  needed  in  all  sciences  

–  Data  abstracted  into  categories    

–  PaIerns,  correla9ons,  associa9ons  &  implica9ons  are  extracted  

Cogni9ve  Compu9ng:  providing  some  way  of  scalable  seman)c  interpreta)on  

Seman9c  Interpreta9on  

Lora Aroyo

Page 9: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  Humans  analyze  examples:  annota)ons  for  ground  truth  =  the  correct  output  for  each  example  

•  Machines  learn  from  the  examples  

•  Ground  Truth  Quality:  

–  measured  by  inter-­‐annotator  agreement    

–  founded  on  ideal  for  single,  universally  constant  truth  

–  high  agreement  =  high  quality  –  disagreement  must  be  eliminated  

Tradi9onal  Human  Annota9on  

Lora Aroyo

Current  gold  standard  acquisi9on  &  quality  evalua9on  are  outdated  

Page 10: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  Cogni)ve  Compu)ng  increases  the  need  for  machines  to  handle  the  scale  of  data    

•  Results  in  increasing  need  for  new  gold  standards  able  to  measure  machine  performance  on  tasks  that  require  seman)c  interpreta)on  

Need  for  Change  Lora Aroyo

The  New  Ground  Truth  is  CrowdTruth    

Page 11: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  One  truth:  data  collec)on  efforts  assume  one  correct  interpreta)on  for  every  example  

•  All  examples  are  created  equal:  ground  truth  treats  all  examples  the  same  –  either  match  the  correct  result  or  not  

•  Detailed  guidelines  help:  if  examples  cause  disagreement  -­‐  add  instruc)ons  to  limit  interpreta)ons  

•  Disagreement  is  bad:  increase  quality  of  annota)on  data  by  reducing  disagreement  among  the  annotators  

•  One  is  enough:  most  of  the  annotated  examples  are  evaluated  by  one  person  

•  Experts  are  beIer:  annotators  with  domain  knowledge  provide  beIer  annota)ons  

•  Once  done,  forever  valid:  annota)ons  are  not  updated;  new  data  not  aligned  with  previous  

   7  Myths  

myths  directly  influence  the  prac)ce  of  collec)ng  human  annotated  data;  Need  to  be  revisited  in  the  context  of  new  changing  world  &  in  the  face  of  a  new  

theory  of  truth  (CrowdTruth)  

Lora Aroyo

Page 12: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

current  ground  truth  collec)on  efforts  assume  one  correct  interpreta)on  

for  every  example  

the  ideal  of  truth  is  a  fallacy  for  seman9c  interpreta9on  and  needs  to  be  changed  

1.  One  Truth  

What  if  there  are  MORE?  

Lora Aroyo

Page 13: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Lora Aroyo

Cluster  1   Cluster  2   Cluster  3   Cluster  4   Cluster  5   Other  passionate,   rollicking,   literate,   humorous,  silly,   aggressive,  fiery,   does  not  fit  into  rousing,   cheerful,  fun,   poignant,  wis9ul,   campy,  quirky,   tense,  anxious,   any  of  the  5  confident,   sweet,  amiable,   bi>ersweet,   whimsical,  wi>y,   intense,  vola?le,   clusters  boisterous,   good-­‐natured   autumnal,   wry   visceral      rowdy       brooding              

Choose  one:    

Which is the mood most appropriate for each song?

one  truth?  

Results  in:  

(Lee  and  Hu  2012)  

Page 14: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  typically  annotators  are  asked  whether  a  binary  property  holds  for  each  example  

•  o?en  not  given  a  chance  to  say  that  the  property  may  par9ally  hold,  or  holds  but  is  not  clearly  expressed  

•  mathema9cs  of  using  ground  truth  treats  every  example  the  same  –  either  match  correct  result  or  not  

•  poor  quality  examples  tend  to  generate  high  disagreement  

 

disagreement  allows  us  to  weight  sentences  =  the  ability  to  train  &  evaluate  a  machine  more  flexibly  

 

2.  All  Examples  Are  Created  Equal  

What  if  they  are  DIFFERENT?  

Lora Aroyo

Page 15: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

ANTIBIOTICS are the first line treatment for indications of TYPHUS. With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.

clearly  treats  

disagreement  can  indicate  vagueness  &  ambiguity  of  sentences  

less  clear  treats  

Is TREAT relation expressed between the highlighted terms?  

 equal  training  data?  

Lora Aroyo

Page 16: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  Perfuming  agreement  scores  by  forcing  annotators  to  make  choices  they  may  think  are  not  valid  

•  Low  annotator  agreement  is  addressed  by  detailed  guidelines  for  annotators  to  consistently  handle  the  cases  that  generate  disagreement  

•  Remove  poten9al  signal  on  examples  that  are  ambiguous  

 precise  annota)on  guidelines  do  eliminate  disagreement  but  do  not  

increase  quality  

3.  Detailed  Guidelines  Help  

What  if  they  HURT?  

Lora Aroyo

Page 17: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

 

 

disagreement  can  indicate  problems  with  the  task  

Instruc9ons  Your  task  is  to  listen  to  the  following  30  second  music  clips  and  select  the  most   appropriate  mood   cluster   that   represents   the  mood  of   the  music.  Try  to  think  about  the  mood  carried  by  the  music  and  please  try  to  ignore  any   lyrics.   If   you   feel   the   music   does   not   fit   into   any   of   the   5   clusters  please  select  “Other”.  The  descrip)ons  of  the  clusters  are  provided  in  the  panel   at   the   top   of   the   page   for   your   reference.   Answer   the   ques)ons  carefully.  Your  work  will  not  be  accepted  if  your  answers  are  inconsistent  and/or  incomplete.  

Which mood cluster is most appropriate for a song?

restric2ng  guidelines  help?  (Lee  and  Hu  2012)  

Lora Aroyo

Page 18: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  rather  than  accep)ng  disagreement  as  a  natural  property  of  seman)c  interpreta)on  

•  tradi)onally,  disagreement  is  considered  a  measure  of  poor  quality  because:  –  task  is  poorly  defined  or    –  annotators  lack  training  

 this  makes  the  elimina9on  of  

disagreement  the  GOAL  

4.  Disagreement                    is  Bad  

What  if  it  is  GOOD?  

Lora Aroyo

Page 19: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

 

ANTIBIOTICS are the first line treatment for indications of TYPHUS. à agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. à agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. à agreement 50%

disagreement  bad?  disagreement  can  reflect  the  degree  of  clarity  in  a  sentence  

Lora Aroyo

Does each sentence express the TREAT relation?

Page 20: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  over  90%  of  annotated  examples  –  seen  by  1-­‐2  annotators    

•  small  number  overlap  –  to  measure  agreement  

 five  or  six  popular  

interpreta9ons  can’t  be  captured  by  one  or  two  

people  

5.  One  is  Enough  

What  if  it  is  NOT  ENOUGH?  

Lora Aroyo

Page 21: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

One  Quality?  

accumulated  results  for  each  rela)on  across  all  the  sentences    

20  workers/sentence  (and  higher)  yields  same  rela9ve  disagreement  

Lora Aroyo

Page 22: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  conven9onal  wisdom:  human  annotators  with  domain  knowledge  provide  beIer  annotated  data,  e.g  –  medical  texts  should  be  annotated  

by  medical  experts  

•  but  experts  are  expensive  &  don’t  scale  

mul9ple  perspec9ves  on  data  can  be  useful,  beyond  what  experts  believe  is  salient  or  correct    

6.  Experts  Are  BeIer  

What  if  the  CROWD  IS  BETTER?  

Lora Aroyo

Page 23: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

experts  beIer  than  crowd?  

•  91% of expert annotations covered by the crowd •  expert annotators reach agreement only in 30% •  most popular crowd vote covers 95% of this

expert annotation agreement

Lora Aroyo

What is the (medical) relation between the highlighted (medical) terms?

Page 24: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

•  perspec9ves  change  over  9me  –  old  training  data  might  contain  examples  that  are  not  valid  or  only  par)ally  valid  later  

•  con9nuous  collec9on  of  training  data  over  )me  allows  the  adapta)on  of  gold  standards  to  changing  )mes  –  popularity  of  music    –  levels  of  educa)on  

7.  Once  Done,  Forever  Valid  

What  if  VALIDITY  CHANGES?  

Page 25: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

 

OSAMA  BIN  LADEN used money from his own construction company to support the MUHAJADEEN in Afghanistan against Soviet forces.

 

forever  valid?  both  types  should  be  valid  -­‐  two  roles  for  same  en9ty                                                            

-­‐  adapta9on  of  gold  standards  to  changing  9mes  

1990:  hero  2011:  terrorist  

Lora Aroyo

Which are mentions of terrorists in this sentence?

Page 26: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

crowdtruth.org Jean-­‐Marc  Côté,  1899  

Page 27: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

crowdtruth.org

•  annotator disagreement is signal, not noise. •  it is indicative of the variation in human

semantic interpretation of signs •  it can indicate ambiguity, vagueness,

similarity, over-generality, as well as quality

Page 28: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

crowdtruth.org

Page 29: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 30: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

hIp://crowd-­‐watson.nl  

The  Team  2013  

Page 31: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Page 32: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

The Crew 2014

Page 33: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

The  (almost  complete)  Team  2014  

Page 34: Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

crowdtruth.org

lora-aroyo.org slideshare.com/laroyo

@laroyo