CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning&...

16
CS544: Informa.on Extrac.on Zornitsa Kozareva USC/ISI Marina del Rey, CA [email protected] www.isi.edu/~kozareva January 18, 2011 Search Engine Who is Jerry Hobbs? Jerry R. Hobbs. Address: USC/ISI 4676 Admiralty Way ... Jerry R. Hobbs hobbs@isi .edu AMW.com Jerry HobbsFugiDve America's Most Wanted is a longrunning American TV Alleged Killer Dad Denied Bond, Services Set for Two LiVle Girld. A judge had denied Jerry Hobbs Wikipedia, the free encyclopedia Dr. Jerry R. Hobbs (born 25 January 1942) is a prominent researcher in the fields of l World MarDal Arts Games Rank: 3rd Dan, Discipline: JuJutsu Issued By: Jerry Hobbs 10th Dan Rank: 1st Dr. Jerry R. Hobbs (born 25 January 1942) is a prominent researcher in the fields of computaDonal linguisDcs, discourse analysis FugiDve America's Most Wanted is a longrunning American TV show produced World MarDal Arts Games Rank: 3rd Dan Discipline: JuJutsu Issued By: Dentokan MarDal Arts AssociaDon / Roy Jerry Hobbs 10th Dan Rank: 1st Dan Discipline researcher professior fugiDve assassin taekwondo teacher ? ? ? Named EnDty RecogniDon and ClassificaDon IdenDfy menDons in text and classify them into a predefined set of categories of interest: Person Names: Prof. Jerry Hobbs, Jerry Hobbs OrganizaDons: Hobbs corpora.on, FbK LocaDons: Ohio Date and Dme expressions: February <PER>Prof. Jerry Hobbs</PER> will teach CS544 during <DATE>February</DATE>. <PER>Jerry Hobbs</PER> killed his daughter in <LOC>Ohio</LOC>. <ORG>Hobbs corpora.on</ORG> bought <ORG>FbK</ORG>. Email: [email protected] Web address: www.usc.edu Names of drugs: paracetamol Names of ships: Queen Marry IdenDfy menDons in text and classify them into a predefined set of categories of interest: IdenDfy menDons in text and classify them into a predefined set of categories of interest: Person Names: OrganizaDons: LocaDons: Date and Dme expressions:

Transcript of CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning&...

Page 1: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

CS544:  Informa.on  Extrac.on  

Zornitsa Kozareva!USC/ISI!

Marina del Rey, [email protected]!

www.isi.edu/~kozareva!

January  18,  2011  

Search  Engine  

Who  is  Jerry  Hobbs?  Jerry  R.  Hobbs.  Address:  USC/ISI  4676  Admiralty  Way  ...  Jerry  R.  Hobbs  hobbs@isi  .edu  

AMW.com  Jerry  Hobbs-­‐FugiDve  America's  Most  Wanted  is  a  long-­‐running  American  TV  show  produced  by  20th  Century  Fox,  and  is  the  longest-­‐running  program  of  any  kind  in  the  history  of  the  ...  

Alleged  Killer  Dad  Denied  Bond,  Services  Set   for  Two  LiVle  Girld.  A     judge  had  denied  bond   for   Jerry   Hobbs.   ...   Jerry   Hobbs,   who   is   accused   of   killing   his   9-­‐year-­‐oldin   this  undated  ...  

Jerry  Hobbs-­‐  Wikipedia,   the  free  encyclopedia  Dr.  Jerry  R.  Hobbs   (born  25  January  1942)  is  a  prominent  researcher  in  the  fields  of  l    World  MarDal  Arts  Games  Rank:  3rd  Dan,  Discipline:  Ju-­‐Jutsu  Issued  By:  Jerry  Hobbs  

10th  Dan  Rank:  1st  

Dr.  Jerry  R.  Hobbs  (born  25  January  1942)  is  a  prominent  researcher  in  the  fields  of  computaDonal  linguisDcs,  discourse  analysis  

FugiDve  America's  Most  Wanted  is  a  long-­‐running  American  TV  show  produced  

World  MarDal  Arts  Games  Rank:  3rd  Dan  Discipline:  Ju-­‐Jutsu  Issued  By:  Dentokan  MarDal  Arts  AssociaDon  /  Roy  Jerry  Hobbs  10th  Dan  Rank:  1st  Dan  Discipline  

researcher     professior   fugiDve  assassin  taekwondo  teacher  

?  ?  

?  

Named  EnDty  RecogniDon  and  ClassificaDon  

•  IdenDfy   menDons   in   text   and   classify   them   into   a  predefined  set  of  categories  of  interest:  –  Person  Names:  Prof.  Jerry  Hobbs,  Jerry  Hobbs  

–  OrganizaDons:  Hobbs  corpora.on,  FbK  –  LocaDons:  Ohio  –  Date  and  Dme  expressions:  February    

Prof.  Jerry  Hobbs  will  teach  CS544  during  February.  Jerry  Hobbs  killed  his  daughter  in  Ohio.  Hobbs  corporaDon  bought  FbK.  

<PER>Prof.  Jerry  Hobbs</PER>  will  teach  CS544  during  <DATE>February</DATE>.    <PER>Jerry  Hobbs</PER>  killed  his  daughter  in  <LOC>Ohio</LOC>.  <ORG>Hobbs  corpora.on</ORG>  bought  <ORG>FbK</ORG>.  

–  E-­‐mail:  [email protected]  

–  Web  address:  www.usc.edu  

–  Names  of  drugs:  paracetamol  –  Names  of  ships:  Queen  Marry  

–  …  

•  IdenDfy   menDons   in   text   and   classify   them   into   a  predefined  set  of  categories  of  interest:  

•  IdenDfy   menDons   in   text   and   classify   them   into   a  predefined  set  of  categories  of  interest:  –  Person  Names:  

–  OrganizaDons:  –  LocaDons:  –  Date  and  Dme  expressions:  

Page 2: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

Named  EnDty  DiscriminaDon  

•  Discover  the  underlying  meaning  of  a  proper  name  

Prof.  Jerry  Hobbs  will  teach  CS544  during  February.    Jerry  Hobbs  killed  his  daughter  in  Ohio.  Hobbs  corpora.on  bought  FbK.  

?  ?  

?  

Prof.  Jerry  Hobbs  will  teach  CS544  during  February.    Jerry  Hobbs  killed  his  daughter  in  Ohio.  Hobbs  corpora.on  bought  FbK.  

•  The  number  of  enDDes  and  their  meaning  is  unknown  

Knowledge  HarvesDng  

•  Given   the  name   Jerry  Hobbs,   learn  automaDcally   from  the  Web  semanDc  classes  and  members  similar  to  it  

company  

entertainment  firm  

person  

assassin  father  professor   researcher  

InformaDon  ExtracDon  

Page 3: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

What  is  “InformaDon  ExtracDon”?  

•  Goal:  idenDfy  specific  pieces  of  informaDon  from  the  content  of  unstructured  or  semi-­‐structured  textual  documents.  

•  Input:  –  scenario  of  extracDon  (template  schema  to  be  filled)  

–  document  collecDon  

•  Output:  –  a  set  of  instanDated  templates  

A  bomb  went  off   this  morning  near  a  power   tower   in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casualDes  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650.  

Incident type: Date: Time: Location: Perpetrator: Physical target: Human target: Effect on physical target: Effect on human target: Instrument:

A  bomb  went  off  this  morning  near  a  power  tower   in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casual.es  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650.  

Incident type: bombing Date: January 18, 2011 Time: 0650 Location: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Human target: - Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb

Page 4: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

MUC  •  Message   Understanding   Conference   (MUC)   was  an   annual   compeDDon   for   IE   systems   funded   by  DARPA  

– MUC-­‐1  (1987)  and  MUC-­‐2  (1989)  •  Messages  about  naval  opera.ons  

– MUC-­‐3  (1991)  and  MUC-­‐4  (1992)  •  News  arDcles  about  terrorist  aTacks  

– MUC-­‐5  (1993)  •  News  arDcles  about  joint  ventures  and  microelectronics  

– MUC-­‐6  (1995)  •  News  arDcles  about  management  changes  

– MUC-­‐7  (1997)  •  News  arDcles  about  space  vehicle  and  missile  launches  

hTp://www.google.com/squared  

Today  

Today  hTp://www.factual.com/  

Page 5: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

Other  ApplicaDons  •  Job  posDngs  •  Seminar  announcements  

•  Conference  call  for  papers  •  Company  informaDon  

•  Apartment  rental  adds  

•  Social  event  announcements  

•  USC  alert  system  

•  …  

Subject:  US-­‐TN-­‐SOFTWARE  PROGRAMMER  Date:  17  Nov  2009  17:37:29  GMT  OrganizaDon:  Reference.Com  PosDng  Service  Message-­‐ID:  <[email protected]>  

SOFTWARE  PROGRAMMER  

PosiDon   available   for   Sopware   Programmer  experienced     in   generaDng   sopware   for   PC-­‐Based  Voice  Mail   systems.    Experienced   in  C  Programming.    Must   be   familiar   with   communicaDng   with   and  controlling  voice  cards;  preferable  Dialogic,  however,  experience  with  others  such  as  Rhetorix  and  Natural  Microsystems  is  okay.    Prefer   5   years   or   more   experience   with   PC   Based  Voice  Mail,  but  will  consider  as  liVle  as  2  years.    Need  to  find  a  Senior  level  person  who  can  come  on  board  and   pick   up   code   with   very   liVle   training.   Present  OperaDng  System  is  DOS.    May  go  to  OS-­‐2  or  UNIX  in  future.  

Please  reply  to:  Kim  Anderson  AdNET  (901)  458-­‐2888  fax  [email protected]  

computer_science_job_template  id:                                                                                                                        Dtle:  salary:  company:  recruiter:  state:    city:  country:    language:    plasorm:  applicaDon:  area:  req_years_experience:    desired_years_experience:    req_degree:  desired_degree:  post_date:  

Example  is  adapted  from  Raymond  Mooney  

Subject:  US-­‐TN-­‐SOFTWARE  PROGRAMMER  Date:  17  Nov  2009  17:37:29  GMT  OrganizaDon:  Reference.Com  PosDng  Service  Message-­‐ID:  <[email protected]>  

SOFTWARE  PROGRAMMER  

PosiDon   available   for   Sopware   Programmer  experienced     in   generaDng   sopware   for   PC-­‐Based  Voice  Mail  systems.    Experienced   in  C  Programming.    Must   be   familiar   with   communicaDng   with   and  controlling  voice  cards;  preferable  Dialogic,  however,  experience  with  others  such  as  Rhetorix  and  Natural  Microsystems  is  okay.    Prefer   5   years   or   more   experience   with   PC   Based  Voice  Mail,  but  will  consider  as  liVle  as  2  years.    Need  to  find  a  Senior  level  person  who  can  come  on  board  and   pick   up   code   with   very   liVle   training.   Present  OperaDng  System  is  DOS.    May  go  to  OS-­‐2  or  UNIX  in  future.  

Please  reply  to:  Kim  Anderson  AdNET  (901)  458-­‐2888  fax  [email protected]  

computer_science_job_template  id:  [email protected]  Dtle:  SOFTWARE  PROGRAMMER  salary:  company:  recruiter:  state:  TN  city:  country:  US  language:  C  plasorm:  PC  \  DOS  \  OS-­‐2  \  UNIX  applicaDon:  area:  Voice  Mail  req_years_experience:  2  desired_years_experience:  5  req_degree:  desired_degree:  post_date:    17  Nov  2009  

Example  is  adapted  from  Raymond  Mooney  

Page 6: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

Two  general  approaches  to  IE  

15  

Machine  Learning  

Use  sequence  tagging  models  to  classify  individual  tokens  as  to  whether  or  not  they  should  be  extracted.  

PaTern-­‐Based  Systems  

Use  paTerns  and  rules  which  are  applied  to  text.  

Incident type: bombing Date: January 18, 2011 Time: 0650 Location: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Human target: - Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb

A  bomb  went  off  this  morning  near  a  power  tower   in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casual.es  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650.  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Page 7: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

LaSIE  InformaDon  ExtracDon  System  

TokenizaDon  -­‐  idenDfy  word  boundaries  in  text  •       white  spaces  indicate  token  boundaries  •       full  stops  indicate  sentences  boundaries    

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

                       (not  always  true  for  example,  1.  September;  Nov.  1998)  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

GazeVeer  Lookup  –  recognize  phrases  and  keywords  related  to  named  enDDes  which  were  previously  stored  in  its  lists  (gazeVeers)  

A  bomb  went  off  this  morning  near  a  power  tower   in  San  Salvador  leaving   a   large   part   of   the   city   without   energy   ,   but   no   casual.es  have  been  reported  .  According   to   unofficial   sources   ,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650  .  

Page 8: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

GazeVeer  Lookup  –  recognize  phrases  and  keywords  related  to  named  enDDes  which  were  previously  stored  in  its  lists  (gazeVeers)  •     advantage  –  simple,  fast,  language  independent  as  one  just  has  to        create  the  lists  

•     disadvantage  –  collecDon  and  maintainance  is  Dme  consuming,        cannot  deal  with  name  variants,  cannot  resolve  ambiguity  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Sentence  spliVer  -­‐  given  a  text,  returns  a  list  of  strings  where  each  element  is  a  sentence.  

•   uses  a  set  of  rules  like  the  occurrences  of      “.”,      “?”  and    “!”  are  indicators  of  sentence  delimiters  

(not  so  simple,  the  “.”    in    “B.  Clinton”  or  “U.S.”  does  not  have  this  role)  

Sentence1:  A  bomb  went  off  this  morning  near  a  power  tower   in  San  Salvador  leaving   a   large   part   of   the   city   without   energy   ,   but   no   casual.es  have  been  reported  .  

Sentence2:  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650  .  

Page 9: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Part-­‐of-­‐speech  tagging  –  idenDfy  and  mark  up  the  words  in   a   text  with   the   correponding   part   of   speeach   such   as   noun,  verb,  adjecDve,  adverb  etc.  

According_to-­‐adv  unofficial-­‐adj  source[s]-­‐n  ,  the-­‐det  bomb-­‐n  allegedly-­‐adv    detonate[ed]-­‐v   by-­‐prep   urban-­‐adj   guerrilla-­‐n   commando[s]-­‐n   blow_up-­‐v  a-­‐det   power_tower-­‐n   in-­‐prep   the-­‐det   northwestern-­‐adj   part-­‐n   of-­‐prep    San  Salvador-­‐loc  at-­‐prep  0650-­‐.me  

Why  do  we  care?  

Sentence1:  A  bomb  went  off  this  morning  near  a  power  tower   in  San  Salvador  leaving   a   large   part   of   the   city   without   energy   ,   but   no   casual.es  have  been  reported  .  

Sentence2:  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650  .  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

SyntacDco-­‐semanDc  interpretaDon  •   boVom-­‐up  chart  parser  •   cascade  of  NERC  grammars  (eg.  aircrap,  person,  money,  Dme)    

According_to-­‐adv   unofficial-­‐adj   source[s]-­‐n   ,   the-­‐det   bomb-­‐n   allegedly-­‐adv    detonate[ed]-­‐v  by-­‐prep  urban-­‐adj   guerrilla-­‐n   commando[s]-­‐n  blow_up-­‐v   a-­‐det  power_tower-­‐n  in-­‐prep  the-­‐det  northwestern  part  of    San  Salvador-­‐loc  at-­‐prep  0650-­‐.me   NE1  

NE2  

Page 10: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

SyntacDco-­‐semanDc  interpretaDon  •   cascade  of  parDal  grammars  (NPs,  PPs,  complex  NP,  VPs,  complex              VPs,  RelClauses,  Sentence)    

S(According_to-­‐adv   NP(unofficial-­‐adj   source[s]-­‐n)   ,   NP(the-­‐det   bomb-­‐n)    allegedly-­‐adv   VP(detonate[ed]-­‐v)   PP(by-­‐prep   NP(urban-­‐adj   guerrilla-­‐n  commando[s]-­‐n))  -­‐    VP(blow_up-­‐v)  NP(a-­‐det  power_tower-­‐n)  PP(in-­‐prep  NP(the-­‐det  NE1-­‐loc))  PP(at-­‐prep  NP(NE2-­‐.me)))  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

SyntacDco-­‐semanDc  interpretaDon  •   boVom-­‐up  chart  parser  •   cascade  of  NERC  grammars  (eg.  aircrap,  person,  money,  Dme)  •   cascade  of  parDal  grammars  (NPs,  PPs,  complex  NP,  VPs,  complex              VPs,  RelClauses,  Sentence)  •   logic  form    

Event(E1),  detonate(E1,Y,X),  urban_guerrilla_comando(X),  bomb(Y)  

Event(E2),  blow_up(E2,Y,Z),  power_tower(Z),  locaDon_of(Z,NE1),  Dme_of(E2,NE2)    

According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by  urban   guerrilla   commandos   blew   up   a   power   tower   in   the   north  western  part  of  San  Salvador  at  0650  .  

Event(E1),  detonate(E1,Y,X),  urban_guerrilla_comando(X),  bomb(Y)  

Event(E2),  blow_up(E2,Y,Z),  power_tower(Z),  locaDon_of(Z,NE1),  Dme_of(E2,NE2)    

NE1   NE2  

Page 11: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Name  matcher  –  does  not  recognize  new  proper  names,  just  adds  idenDty  relaDons  between  those  found  by  the  parser  

•     first  token  of  the  name  matches  the  second  name          “Pepsi  Cola”  equals  “Pepsi”  •     one  of  the  names  is  an  acronym  of  the  other          “ISI”  is  equivalent  to  “Informa6on  Sciences  Ins6tute”  

•       one  name  is  a  reversal  of  the  other          “Defense  Department”  equals  “Department  of  Defense”  •       one  name  consists  of  concatenated  contracDons  of  the  other                                “Pan  America”  equals  “Pan  Am”  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Discourse  interpreter  –  translates  the  semanDc  representaDons  produced  by  the  parser  into  

•   representaDon  of  instances,  their  ontological  classes  and  aVributes  •   coreference  resoluDon  Event(E1),  detonate(E1,Y,X),  urban_guerrilla_comando(X),  bomb(Y),    Event(E2),  blow_up(E2,Y,Z),  power_tower(Z),  locaDon_of(Z,NE1),  Dme_of(E2,NE2)    

locaDon  of  event  destroy  

bombing  event  

isa  

implies  

implies  

LaSIE  InformaDon  ExtracDon  System  

Sentence  spliVer  

 buChart      parser    

Name  matcher  

Discourse  interpreter  

Template  writer  

lexicon   conceptual  hierarchy  gazeVeers  

GazeVeer  lookup  

TE    TR    ST  

Brill  tagger  

Tagged  Morph  

straDfied  grammar  

Output  template  generaDon  •   procedure  that  writes  the  templates  in  the  desired  format  

Incident type: bombing Date: March 11, 2010 Time: 0650 Location: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Human target: - Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb

Page 12: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

How  well  does  this  work1?  

•  Evaluate  system’s  performance  on  independent  manually-­‐annotated   test   data   which   was   not  used  during  system  development  

•  IE   systems   are   typically   evaluated   in   terms   of  Precision  (P)  and  Recall  (R)  

1hVp://www-­‐nlpir.nist.gov/related_projects/muc/proceedings/st_score_report.html  

LaSIE  in  MUC-­‐6  

Task   Precision   Recall  

Named  EnDty   .94   .84  

Co-­‐reference  resoluDon   .71   .51  

Template  Elements   .74   .66  

Scenario  Templates   .73   .37  

LaSIE  Named  EnDty  

No.   Sehng   Precision   Recall  

1   GazeVeer  Look  Up   .74   .37  

2   1  +  Parsing   .93   .80  

3   2  +  Name  matching   .93   .88  

4   3  +  Discourse  interpretaDon   .93   .89  

•   Results  for  the  Named  EnDty  task  over  30  texts  •   Each  se�ng  indicates  the  contribuDon  of  LaSIE’s  components  

Page 13: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

Can  I  test  an  exisDng  IE  system?  

hTp://services.gate.ac.uk/annie/index.jsp  

Page 14: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

Rule-­‐based  IE:  Pros  and  Cons  

39  

CONS:  

 -­‐    rules  need  to  be  wriTen  by  hand  

 -­‐    requires  experienced  grammar  developers  

 -­‐    difficult  to  port  to  a  different  domain  

PROS:  

 +    clearly  understood  technology  

 +    hand-­‐wriTen  rules  are  rela.vely  precise  

 +    people  can  write  rules  with  a  reasonable  amount  of  training  

Can  we  automaDcally  learn  IE?  

•  In   the   mid-­‐1990’s   supervised   IE   systems   were  created.    

•  Supervised   learning   requires   annotated   training  data.    

•  Trade-­‐off:   annotaDng   texts   vs.  manual   knowledge  engineering  

     –  weeks  vs.  months    

     –  domain  experts  vs.  computaDonal  linguists  

AnnotaDng  Texts  for  IE  

 Alleged  guerilla  urban  commandos  launched  

 two  highpower  bombs  against  a  car  dealership  

 in  downtown  San  Salvador  this  morning  .  A  

police  report  said  that  the  aVack  set  the  building  

 on  fire  but  did  not  result  any  causaliDes    .  

perpetrator  

weapon  target  

loca.on  date  

damage  injury  

Page 15: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

AnnotaDng  Texts  for  IE  

 Alleged  guerilla  urban  commandos  launched  

 two  highpower  bombs  against  a  car  dealership  

 in  downtown  San  Salvador  this  morning  .  A  

police  report  said  that  the  aVack  set  the  building  

 on  fire  but  did  not  result  any  causaliDes    .  

perpetrator  

weapon   target  

loca.on   date  

damage  

injury  

PaVern  Learning  Algorithms  

•  A   variety   of   different   paVern/rule  representaDons   have   been   developed,   but  very  commonly:  –  IE  systems  learn  paVerns  by  beginning  with  highly  specialized   paVerns   and   iteraDvely   generalizing  them.  

–  rule-­‐learning   stops   when   a   set   of   paVerns   has  been  generated  to  sufficiently  “cover”  the  training  examples.  

AutoSlog  [Riloff  1993]  

44  

Annotated  text   Parser  

SUBJ:        The  World  Trade  Center  VP:                was  bombed  PP:                by  terrorists  

SyntacDc  Rules  

ExtracDon  PaVern:  <target>  was  bombed  was  bombed  by  <perpetrator>  

What  happens  if  you  apply  on  new  text?  

 <Kuwait>    was  bombed        <Newcastle>        <Hiroshima>  <Pearl  Harbor>  

         …              <Belgrade>  

Page 16: CS544:InformaonExtracon · Two’general’approaches’to’IE’ 15 Machine&Learning& Usesequencetaggingmodelsto classifyindividualtokensasto& whetherornottheyshouldbe extracted.&

<subject>  passive-­‐vp          <target>  was  bombed  <subject>  acDve-­‐vp          <perpetrator>  bombed  <subject>  acDve-­‐vp  dobj        <perpetrator>  threw  dynamite    <subject>  acDve-­‐vp  infiniDve    <perpetrator>  tried  to  kill    <subject>  passive-­‐vp  infiniDve      <perpetrator>  was  hired  to  kill    <subject>  auxiliary  dobj        <vicDm>  was  fatality  

acDve-­‐vp  <dobj>            bombed  <target>    infiniDve  <dobj>            to  kill  <vicDm>    acDve-­‐vp  infiniDve  <dobj>        tried  to  kill  <vicDm>    

passive-­‐vp  infiniDve  <dobj>      was  hired  to  kill  <vicDm>  subject  auxiliary  <dobj>        fatality  was  <vicDm>  passive-­‐vp  prep  <np>        was  killed  by  <perpetrator>    acDve-­‐vp  prep  <np>          exploded  in  <target>    infiniDve  prep  <np>            to  kill  with  <weapon>    noun  prep  <np>            assassinaDon  of  <vicDm>   45  

Examples  of  learned  paVers  with  AutoSlog: