Lecture: Summarization

51
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Summarization Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016

Transcript of Lecture: Summarization

Page 1: Lecture: Summarization

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Summarization

Marina  San(ni  [email protected]  

 Department  of  Linguis(cs  and  

Philology  

Uppsala  University,  Uppsala,  Sweden  

 Spring  2016  

   

Page 2: Lecture: Summarization

Previous  Lecture:  Rela$on  Extrac$on  

2  

Page 3: Lecture: Summarization

What’s  a  rela$on?  •  A  rela(on  can  be  formally  defined  in  the  form  of  a  tuple    

•  t  =  (e1;  e2  …;  en)    •  where  the  ei  are  en((es  in  a  predefined  rela(on  r  within  

document  D.    

•  Most  rela(on  extrac(on  systems  focus  on  extrac(ng  binary  rela$ons.    •  Examples  of  binary  rela(ons  include  •  located-­‐in(CMU,  PiHsburgh),    •  father-­‐of(ManuelBlum,  Avrim  Blum).    

•  It  is  also  possible  to  go  to  higher-­‐order  rela(ons  as  well  and  extract  more  complex  rela(ons  (ex  biomedicine).    

3  

Page 4: Lecture: Summarization

Why  Rela$on  Extrac$on?  

•  There  exists  a  vast  amount  of  unstructured  electronic  text  on  the  Web,  including  newswire,  blogs  ,emails,  governmental  documents,  chats,  and  so  on.    

•  The  whole  idea  of  IE  is  turn  unstructured  text  into  structured  by  annota(ng  seman(c  informa(on.  

•  RE  is  the  task    of  recognizing  rela(ons  between  en((es  in  unstructured  text.    !If a query to a search engine is “When was Gandhi born ?”, then the expected answer would be“Gandhi was born in 1869”. The template of the answer is <PERSON> born-in <YEAR> which is nothing but the relational triple: !

born in(PERSON, YEAR) !where PERSON and YEAR are the entities. !

4  

Page 5: Lecture: Summarization

Watch  out!  

•  RE  =  extract  facts  from  unstructured  texts,  ie  rela(ons  that  exist  betw  en((es,  such  as  dates,  proper  names,  companies.    

•  Other  rela(ons  (related  to  Word  Senses):  seman(c  rela(ons  betw  concepts:  hyperonyms,  hyponyms,  etc.  like  in  Wordnet.    

5  

Page 6: Lecture: Summarization

How  to  build  rela$on  extractors  

1.  Hand-­‐wriHen  paHerns  2.  Supervised  machine  learning  3.  Semi-­‐supervised  and  unsupervised    •  Bootstrapping  (using  seeds)  •  Distant  supervision  •  Unsupervised  learning  from  the  web  

6  

Page 7: Lecture: Summarization

Seed-­‐based  or  bootstrapping  approaches  to  rela$on  extrac$on  

•  No  training  set?  Maybe  you  have:  •  A  few  seed  tuples    or  •  A  few  high-­‐precision  paHerns  

•  Can  you  use  those  seeds  to  do  something  useful?  •  Bootstrapping:  use  the  seeds  to  directly  learn  to  populate  a  rela(on  

7  

Roughly  said:  Use  seeds  to  ini(alize  a  process  of  annota(on,  then  refine  through  itera(ons  

Page 8: Lecture: Summarization

Dipre:  Extract  <author,book>  pairs  

•  Start  with  5  seeds:          

•  Find  Instances:  The  Comedy  of  Errors,  by    William  Shakespeare,  was  The  Comedy  of  Errors,  by    William  Shakespeare,  is  The  Comedy  of  Errors,  one  of  William  Shakespeare's  earliest  aHempts  The  Comedy  of  Errors,  one  of  William  Shakespeare's  most  

•  Extract  paHerns  (group  by  middle,  take  longest  common  prefix/suffix)  ?x , by ?y , ?x , one of ?y ‘s !

•  Now  iterate,  finding  new  seeds  that  match  the  paHern  !

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.

Author   Book  Isaac  Asimov   The  Robots  of  Dawn  David  Brin   Star(de  Rising  James  Gleick   Chaos:  Making  a  New  

Science  Charles  Dickens   Great  Expecta(ons  William  Shakespeare  

The  Comedy  of  Errors  

8  

Page 9: Lecture: Summarization

Prac$cal  Ac$vity  Search  for  phrasal  paHerns  on  the  web      Our  seeds:    "*  is  a  novel  by  *"    "*  wrote  the  novel  *"    "the  novel  *  was  wriHen  by  *"  op#onally  add  more  phrases…    Further  refinemets  that  we  felt  are  needed:    •  get  read  of  non-­‐informa(ve  text  included  in  the  returned  strings  

(maybe  via  adding  addi(onal  paHerns  in  the  regular  expressions)  •  Iden(fy  name  en((es  

•  Maybe  via  Reg  Expressions  (eg.  iden(fy  words  star(ng  with  uppercase)  •  Maybe  combining  seeds  and  a  NER  system  •  ect.  

9c  

Google is fantastic, but also unpredictable… à different behaviours depending on the machines, domains, and some “hidden” criteria…  

Page 10: Lecture: Summarization

End  of  previous  lecture  

10  

Page 11: Lecture: Summarization

Acknowledgements Most  slides  borrowed  or  adapted  from:  

Dan  Jurafsky  and  Christopher  Manning,  Coursera  

Some  inspira(on  from  Dragomir  Radev,  Coursera  ….    

   

 

J&M(2009)      

 

     

Page 12: Lecture: Summarization

Text  Summariza$on  

12  

Page 13: Lecture: Summarization

Summary  

13  

Page 14: Lecture: Summarization

News  Summariza$on  

14  

Page 15: Lecture: Summarization

Book  Summaries  

15  

Cliff’s  Notes  are  a  series  of  student  study  guides  available  primarily  in  the  United  States.  

Page 16: Lecture: Summarization

Movie  Summaries  

16  

Page 17: Lecture: Summarization

Search  Engine  Snippets  

17  

Page 18: Lecture: Summarization

Genres  

18  

Page 19: Lecture: Summarization

Types  of  Summaries  

19  

Page 20: Lecture: Summarization

Stages  

20  

Page 21: Lecture: Summarization

Summariza$on  

21  

Page 22: Lecture: Summarization

Human  Summariza$on  and  Abstrac$ng  

22  

Page 23: Lecture: Summarization

Extrac$ve  Summariza$on  

23  

Page 24: Lecture: Summarization

Question Answering

Summarization in Question

Answering

Page 25: Lecture: Summarization

Text  Summariza$on  

•  Goal:  produce  an  abridged  version  of  a  text  that  contains  informa(on  that  is  important  or  relevant  to  a  user.  

         

•  Summariza$on  Applica$ons  •  outlines  or  abstracts  of  any  document,  ar(cle,  etc  •  summaries  of  email  threads  •  ac$on  items  from  a  mee(ng  •  simplifying  text  by  compressing  sentences  

25  

Page 26: Lecture: Summarization

What  to  summarize?    Single  vs.  mul$ple  documents  

•  Single-­‐document  summariza$on  •  Given  a  single  document,  produce  •  abstract  •  outline  •  headline  

•  Mul$ple-­‐document  summariza$on  •  Given  a  group  of  documents,  produce  a  gist  of  the  content:  •  a  series  of  news  stories  on  the  same  event  •  a  set  of  web  pages  about  some  topic  or  ques(on  

26  

Page 27: Lecture: Summarization

Query-­‐focused  Summariza$on  &    Generic  Summariza$on  

•  Generic  summariza(on:  •   Summarize  the  content  of  a  document  

•  Query-­‐focused  summariza(on:  •   summarize  a  document  with  respect  to  an  informa(on  need  expressed  in  a  user  query.  •  a  kind  of  complex  ques(on  answering:  • Answer  a  ques(on  by  summarizing  a  document  that  has  the  informa(on  to  construct  the  answer    

27  

Page 28: Lecture: Summarization

Summariza$on  for  Ques$on  Answering:  Snippets  

•  Create  snippets  summarizing  a  web  page  for  a  query  •  Google:  156  characters  (about  26  words)  plus  (tle  and  link  

28  

Page 29: Lecture: Summarization

Summariza$on  for  Ques$on  Answering:  Mul$ple  documents  

Create  answers  to  complex  ques(ons  summarizing  mul(ple  documents.  •  Instead  of  giving  a  snippet  for  each  document  • Create  a  cohesive  answer  that  combines  informa(on  from  each  document  

29  

Page 30: Lecture: Summarization

Extrac$ve  summariza$on  &    Abstrac$ve  summariza$on  

•  Extrac(ve  summariza(on:  •  create  the  summary  from  phrases  or  sentences  in  the  source  document(s)  

•  Abstrac(ve  summariza(on:  •  express  the  ideas  in  the  source  documents  using  (at  least  in  part)  different  words  

30  

Page 31: Lecture: Summarization

Simple  baseline:  take  the  first  sentence  

31  

Page 32: Lecture: Summarization

Question Answering

Generating Snippets and other Single-

Document Answers

Page 33: Lecture: Summarization

Snippets:  query-­‐focused  summaries  

33  

Page 34: Lecture: Summarization

Summariza$on:  Three  Stages  

1.  content  selec(on:  choose  sentences  to  extract  from  the  document  

2.  informa(on  ordering:  choose  an  order  to  place  them  in  the  summary  

3.  sentence  realiza(on:  clean  up  the  sentences  

34  

DocumentSentence

SegmentationSentenceExtraction

All sentencesfrom documents

Extracted sentences

Information Ordering

Sentence Realization

Summary

Content Selection

Sentence Simplification

Page 35: Lecture: Summarization

Basic  Summariza$on  Algorithm  

1.  content  selec(on:  choose  sentences  to  extract  from  the  document  

2.  informa(on  ordering:  just  use  document  order  3.  sentence  realiza(on:  keep  original  sentences  

35  

DocumentSentence

SegmentationSentenceExtraction

All sentencesfrom documents

Extracted sentences

Information Ordering

Sentence Realization

Summary

Content Selection

Sentence Simplification

Page 36: Lecture: Summarization

Unsupervised  content  selec$on  

•  Intui(on  da(ng  back  to  Luhn  (1958):  •  Choose  sentences  that  have  salient  or  informa(ve  words  

•  Two  approaches  to  defining  salient  words  1.  o-­‐idf:  weigh  each  word  wi  in  document  j  by  o-­‐idf  

2.  topic  signature:  choose  a  smaller  set  of  salient  words  •  mutual  informa(on  •  log-­‐likelihood  ra(o  (LLR)    Dunning  (1993),  Lin  and  Hovy  (2000)  

36  

weight(wi ) = tfij × idfi

weight(wi ) =1 if -2 logλ(wi )>100 otherwise

!"#

$#

H.  P.  Luhn.  1958.  The  Automa(c  Crea(on  of  Literature  Abstracts.  IBM  Journal  of  Research  and  Development.  2:2,  159-­‐165.    

Page 37: Lecture: Summarization

Topic  signature-­‐based  content  selec$on  with  queries  

•  choose  words  that  are  informa(ve  either    •  by  log-­‐likelihood  ra(o  (LLR)  •  or  by  appearing  in  the  query  

•  Weigh  a  sentence  (or  window)  by  weight  of  its  words:  

37  

Conroy,  Schlesinger,  and  O’Leary  2006  

weight(wi ) =1 if -2 logλ(wi )>101 if wi ∈ question0 otherwise

"

#$$

%$$

weight(s) = 1S

weight(w)w∈S∑

(could  learn  more  complex  weights)  

Page 38: Lecture: Summarization

Supervised  content  selec$on  

•  Given:    •  a  labeled  training  set  of  good  summaries  for  each  document  

•  Align:  •  the  sentences  in  the  document  with  sentences  in  the  summary  

•  Extract  features  •  posi(on  (first  sentence?)    •  length  of  sentence  •  word  informa(veness,  cue  phrases  •  cohesion  

•  Train  

•  Problems:  •  hard  to  get  labeled  training  data  

•  alignment  difficult  •  performance  not  beHer  than  unsupervised  algorithms  

•  So  in  prac(ce:  •  Unsupervised  content  selec$on  is  more  common  

•  a  binary  classifier  (put  sentence  in  summary?  yes  or  no)    

Page 39: Lecture: Summarization

Question Answering

Evalua(ng  Summaries:  ROUGE  

Page 40: Lecture: Summarization

ROUGE  (Recall  Oriented  Understudy  for  Gis$ng  Evalua$on)    

•  Intrinsic  metric  for  automa(cally  evalua(ng  summaries  •  Based  on  BLEU  (a  metric  used  for  machine  transla(on)  •  Not  as  good  as  human  evalua(on  (“Did  this  answer  the  user’s  ques(on?”)  •  But  much  more  convenient  

•  Given  a  document  D,  and  an  automa(c  summary  X:  1.  Have  N  humans  produce  a  set  of  reference  summaries    of  D  2.  Run  system,  giving  automa(c  summary  X  3.  What  percentage  of  the  bigrams  from  the  reference  

summaries  appear  in  X?  

40  

Lin and Hovy 2003  

ROUGE − 2 =min(count(i,X),count(i,S))

bigrams i∈S∑

s∈{RefSummaries}∑

count(i,S)bigrams i∈S∑

s∈{RefSummaries}∑

Page 41: Lecture: Summarization

A  ROUGE  example:  Q:  “What  is  water  spinach?”  

Human  1:  Water  spinach  is  a  green  leafy  vegetable  grown  in  the  tropics.  Human  2:    Water  spinach  is  a  semi-­‐aqua(c  tropical  plant  grown  as  a  vegetable.  Human  3:  Water  spinach  is  a  commonly  eaten  leaf  vegetable  of  Asia.  

•  System  answer:  Water  spinach  is  a  leaf  vegetable  commonly  eaten  in  tropical  areas  of  Asia.  

•  ROUGE-­‐2    =  

41   10  +  9  +  9  3  +  3  +  6  

=  12/28  =  .43    

Page 42: Lecture: Summarization

Question Answering

Summarization for Complex Questions

Page 43: Lecture: Summarization

Defini$on  ques$ons  

Q:  What  is  water  spinach?  A:  Water  spinach  (ipomoea  aqua(ca)  is  a  semi-­‐aqua(c  leafy  green  plant  with  long  hollow  stems  and  spear-­‐  or  heart-­‐shaped  leaves,  widely  grown  throughout  Asia  as  a  leaf  vegetable.  The  leaves  and  stems  are  oten  eaten  s(r-­‐fried  flavored  with  salt  or  in  soups.  Other  common  names  include  morning  glory  vegetable,  kangkong  (Malay),  rau  muong  (Viet.),  ong  choi  (Cant.),  and  kong  xin  cai  (Mand.).  It  is  not  related  to  spinach,  but  is  closely  related  to  sweet  potato  and  convolvulus.    

Page 44: Lecture: Summarization

Medical  ques$ons  

Q:  In  children  with  an  acute  febrile  illness,  what  is  the  efficacy  of  single  medica(on  therapy  with  acetaminophen  or  ibuprofen  in  reducing  fever?  A:  Ibuprofen  provided  greater  temperature  decrement  and  longer  dura(on  of  an(pyresis  than  acetaminophen  when  the  two  drugs  were  administered  in  approximately  equal  doses.  (PubMedID:  1621668,  Evidence  Strength:  A)  

Demner-­‐Fushman  and  Lin  (2007)    

Page 45: Lecture: Summarization

Other  complex  ques$ons  

1.  How  is  compost  made  and  used  for  gardening  (including  different  types  of  compost,  their  uses,  origins  and  benefits)?  

2.  What  causes  train  wrecks  and  what  can  be  done  to  prevent  them?  

3.  Where  have  poachers  endangered  wildlife,  what  wildlife  has  been  endangered  and  what  steps  have  been  taken  to  prevent  poaching?  

4.  What  has  been  the  human  toll  in  death  or  injury  of  tropical  storms  in  recent  years?    

45  

Modified  from  the  DUC  2005  compe((on  (Hoa  Trang  Dang  2005)  

Page 46: Lecture: Summarization

Answering  harder  ques$ons:  Query-­‐focused  mul$-­‐document  summariza$on  

•  The  (boHom-­‐up)  snippet  method  •  Find  a  set  of  relevant  documents  •  Extract  informa(ve  sentences  from  the  documents  •  Order  and  modify  the  sentences  into  an  answer  

•  The  (top-­‐down)  informa(on  extrac(on  method  •  build  specific  answerers  for  different  ques(on  types:  •  defini(on  ques(ons  •  biography  ques(ons    •  certain  medical  ques(ons  

Page 47: Lecture: Summarization

Query-­‐Focused  Mul$-­‐Document  Summariza$on  

47  

•  a  Document

DocumentDocument

DocumentDocumentInput Docs

Sentence Segmentation

All sentencesfrom documents

Sentence Simplification

Content Selection

SentenceExtraction:LLR, MMR

Extracted sentences

Information Ordering

Sentence Realization

Summary

All sentencesplus simplified versions

Query

Page 48: Lecture: Summarization

Informa$on  Ordering  

•  Chronological  ordering:  •  Order  sentences  by  the  date  of  the  document  (for  summarizing  news)..          (Barzilay,  Elhadad,  and  McKeown  2002)  

•  Coherence:  •  Choose  orderings  that  make  neighboring  sentences  similar  (by  cosine).  •  Choose  orderings  in  which  neighboring  sentences  discuss  the  same  en(ty  (Barzilay  and  Lapata  2007)    

•  Topical  ordering  •  Learn  the  ordering  of  topics  in  the  source  documents  

48  

Page 49: Lecture: Summarization

Domain-­‐specific  answering:  The  Informa$on  Extrac$on  method  

•  a  good  biography  of  a  person  contains:  •  a  person’s  birth/death,  fame  factor,  educa$on,  na$onality  and  so  on  

•  a  good  defini$on  contains:  •  genus  or  hypernym  •  The  Hajj  is  a  type  of  ritual  

•  a  medical  answer  about  a  drug’s  use  contains:  •  the  problem  (the  medical  condi(on),    •  the  interven$on  (the  drug  or  procedure),  and    •  the  outcome  (the  result  of  the  study).  

Page 50: Lecture: Summarization

Informa$on  that  should  be  in  the  answer  for  3  kinds  of  ques$ons  

Page 51: Lecture: Summarization

The end