Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Post on 15-Nov-2014

979 views 0 download

Tags:

description

Big data is having a disruptive impact across the sciences. Human annotation of semantic interpretation tasks is a critical part of big data semantics, but it is based on an antiquated ideal of a single correct truth that needs to be similarly disrupted.We expose seven myths about human annotation, most of which derive from that antiquated ideal of truth, and dispell these myths with examples from our research.We propose a new theory of truth, Crowd Truth, that is based on the intuition that human interpretation is subjective, and that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.

Transcript of Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

Truth  is  a  Lie  CrowdTruth:  

The  7  Myths  of  Human  Annota9on    

Lora  Aroyo  

Human  annota9on  of  seman)c  interpreta)on  tasks  as  cri)cal  part  of  cogni)ve  systems  engineering  

– standard  prac)ce  based  on  an9quated  ideal  of  a  single  correct  truth  

– 7  myths  of  human  annota)on  

– new  theory  of  truth:    CrowdTruth  

Take  Home  Message  

Lora Aroyo

I  amar  prestar  aen...  •  amount  of  data  &  scale  of  computa9on  available  have    increased  by  a  previously  inconceivable  amount  

•  CS  &  AI  moved  out  of  thought  problems  to  empirical  science  

•  current  methods  pre-­‐date  this  fundamental  shi?  

•  the  ideal  of  “one  truth”  is  a  lie  

•  crowdsourcing  &  seman9cs        together  correct  the  fallacy  and  improve  analy)c  systems  

The  world  has  changed:  there  is  a  need  to  form  a  new  theory  of  truth  -­‐  appropriate  to  cogni)ve  systems  

Lora Aroyo

Seman)c  interpreta)on  is  needed  in  all  sciences  

–  Data  abstracted  into  categories    

–  PaIerns,  correla9ons,  associa9ons  &  implica9ons  are  extracted  

Cogni9ve  Compu9ng:  providing  some  way  of  scalable  seman)c  interpreta)on  

Seman9c  Interpreta9on  

Lora Aroyo

•  Humans  analyze  examples:  annota)ons  for  ground  truth  =  the  correct  output  for  each  example  

•  Machines  learn  from  the  examples  

•  Ground  Truth  Quality:  

–  measured  by  inter-­‐annotator  agreement    

–  founded  on  ideal  for  single,  universally  constant  truth  

–  high  agreement  =  high  quality  –  disagreement  must  be  eliminated  

Tradi9onal  Human  Annota9on  

Lora Aroyo

Current  gold  standard  acquisi9on  &  quality  evalua9on  are  outdated  

•  Cogni)ve  Compu)ng  increases  the  need  for  machines  to  handle  the  scale  of  data    

•  Results  in  increasing  need  for  new  gold  standards  able  to  measure  machine  performance  on  tasks  that  require  seman)c  interpreta)on  

Need  for  Change  Lora Aroyo

The  New  Ground  Truth  is  CrowdTruth    

•  One  truth:  data  collec)on  efforts  assume  one  correct  interpreta)on  for  every  example  

•  All  examples  are  created  equal:  ground  truth  treats  all  examples  the  same  –  either  match  the  correct  result  or  not  

•  Detailed  guidelines  help:  if  examples  cause  disagreement  -­‐  add  instruc)ons  to  limit  interpreta)ons  

•  Disagreement  is  bad:  increase  quality  of  annota)on  data  by  reducing  disagreement  among  the  annotators  

•  One  is  enough:  most  of  the  annotated  examples  are  evaluated  by  one  person  

•  Experts  are  beIer:  annotators  with  domain  knowledge  provide  beIer  annota)ons  

•  Once  done,  forever  valid:  annota)ons  are  not  updated;  new  data  not  aligned  with  previous  

   7  Myths  

myths  directly  influence  the  prac)ce  of  collec)ng  human  annotated  data;  Need  to  be  revisited  in  the  context  of  new  changing  world  &  in  the  face  of  a  new  

theory  of  truth  (CrowdTruth)  

Lora Aroyo

current  ground  truth  collec)on  efforts  assume  one  correct  interpreta)on  

for  every  example  

the  ideal  of  truth  is  a  fallacy  for  seman9c  interpreta9on  and  needs  to  be  changed  

1.  One  Truth  

What  if  there  are  MORE?  

Lora Aroyo

Lora Aroyo

Cluster  1   Cluster  2   Cluster  3   Cluster  4   Cluster  5   Other  passionate,   rollicking,   literate,   humorous,  silly,   aggressive,  fiery,   does  not  fit  into  rousing,   cheerful,  fun,   poignant,  wis9ul,   campy,  quirky,   tense,  anxious,   any  of  the  5  confident,   sweet,  amiable,   bi>ersweet,   whimsical,  wi>y,   intense,  vola?le,   clusters  boisterous,   good-­‐natured   autumnal,   wry   visceral      rowdy       brooding              

Choose  one:    

Which is the mood most appropriate for each song?

one  truth?  

Results  in:  

(Lee  and  Hu  2012)  

•  typically  annotators  are  asked  whether  a  binary  property  holds  for  each  example  

•  o?en  not  given  a  chance  to  say  that  the  property  may  par9ally  hold,  or  holds  but  is  not  clearly  expressed  

•  mathema9cs  of  using  ground  truth  treats  every  example  the  same  –  either  match  correct  result  or  not  

•  poor  quality  examples  tend  to  generate  high  disagreement  

 

disagreement  allows  us  to  weight  sentences  =  the  ability  to  train  &  evaluate  a  machine  more  flexibly  

 

2.  All  Examples  Are  Created  Equal  

What  if  they  are  DIFFERENT?  

Lora Aroyo

ANTIBIOTICS are the first line treatment for indications of TYPHUS. With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.

clearly  treats  

disagreement  can  indicate  vagueness  &  ambiguity  of  sentences  

less  clear  treats  

Is TREAT relation expressed between the highlighted terms?  

 equal  training  data?  

Lora Aroyo

•  Perfuming  agreement  scores  by  forcing  annotators  to  make  choices  they  may  think  are  not  valid  

•  Low  annotator  agreement  is  addressed  by  detailed  guidelines  for  annotators  to  consistently  handle  the  cases  that  generate  disagreement  

•  Remove  poten9al  signal  on  examples  that  are  ambiguous  

 precise  annota)on  guidelines  do  eliminate  disagreement  but  do  not  

increase  quality  

3.  Detailed  Guidelines  Help  

What  if  they  HURT?  

Lora Aroyo

 

 

disagreement  can  indicate  problems  with  the  task  

Instruc9ons  Your  task  is  to  listen  to  the  following  30  second  music  clips  and  select  the  most   appropriate  mood   cluster   that   represents   the  mood  of   the  music.  Try  to  think  about  the  mood  carried  by  the  music  and  please  try  to  ignore  any   lyrics.   If   you   feel   the   music   does   not   fit   into   any   of   the   5   clusters  please  select  “Other”.  The  descrip)ons  of  the  clusters  are  provided  in  the  panel   at   the   top   of   the   page   for   your   reference.   Answer   the   ques)ons  carefully.  Your  work  will  not  be  accepted  if  your  answers  are  inconsistent  and/or  incomplete.  

Which mood cluster is most appropriate for a song?

restric2ng  guidelines  help?  (Lee  and  Hu  2012)  

Lora Aroyo

•  rather  than  accep)ng  disagreement  as  a  natural  property  of  seman)c  interpreta)on  

•  tradi)onally,  disagreement  is  considered  a  measure  of  poor  quality  because:  –  task  is  poorly  defined  or    –  annotators  lack  training  

 this  makes  the  elimina9on  of  

disagreement  the  GOAL  

4.  Disagreement                    is  Bad  

What  if  it  is  GOOD?  

Lora Aroyo

 

ANTIBIOTICS are the first line treatment for indications of TYPHUS. à agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. à agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. à agreement 50%

disagreement  bad?  disagreement  can  reflect  the  degree  of  clarity  in  a  sentence  

Lora Aroyo

Does each sentence express the TREAT relation?

•  over  90%  of  annotated  examples  –  seen  by  1-­‐2  annotators    

•  small  number  overlap  –  to  measure  agreement  

 five  or  six  popular  

interpreta9ons  can’t  be  captured  by  one  or  two  

people  

5.  One  is  Enough  

What  if  it  is  NOT  ENOUGH?  

Lora Aroyo

One  Quality?  

accumulated  results  for  each  rela)on  across  all  the  sentences    

20  workers/sentence  (and  higher)  yields  same  rela9ve  disagreement  

Lora Aroyo

•  conven9onal  wisdom:  human  annotators  with  domain  knowledge  provide  beIer  annotated  data,  e.g  –  medical  texts  should  be  annotated  

by  medical  experts  

•  but  experts  are  expensive  &  don’t  scale  

mul9ple  perspec9ves  on  data  can  be  useful,  beyond  what  experts  believe  is  salient  or  correct    

6.  Experts  Are  BeIer  

What  if  the  CROWD  IS  BETTER?  

Lora Aroyo

experts  beIer  than  crowd?  

•  91% of expert annotations covered by the crowd •  expert annotators reach agreement only in 30% •  most popular crowd vote covers 95% of this

expert annotation agreement

Lora Aroyo

What is the (medical) relation between the highlighted (medical) terms?

•  perspec9ves  change  over  9me  –  old  training  data  might  contain  examples  that  are  not  valid  or  only  par)ally  valid  later  

•  con9nuous  collec9on  of  training  data  over  )me  allows  the  adapta)on  of  gold  standards  to  changing  )mes  –  popularity  of  music    –  levels  of  educa)on  

7.  Once  Done,  Forever  Valid  

What  if  VALIDITY  CHANGES?  

 

OSAMA  BIN  LADEN used money from his own construction company to support the MUHAJADEEN in Afghanistan against Soviet forces.

 

forever  valid?  both  types  should  be  valid  -­‐  two  roles  for  same  en9ty                                                            

-­‐  adapta9on  of  gold  standards  to  changing  9mes  

1990:  hero  2011:  terrorist  

Lora Aroyo

Which are mentions of terrorists in this sentence?

crowdtruth.org Jean-­‐Marc  Côté,  1899  

crowdtruth.org

•  annotator disagreement is signal, not noise. •  it is indicative of the variation in human

semantic interpretation of signs •  it can indicate ambiguity, vagueness,

similarity, over-generality, as well as quality

crowdtruth.org

hIp://crowd-­‐watson.nl  

The  Team  2013  

The Crew 2014

The  (almost  complete)  Team  2014  

crowdtruth.org

lora-aroyo.org slideshare.com/laroyo

@laroyo