Extending the Espresso Method for Greater Recall
-
Upload
dbspringer -
Category
Technology
-
view
13.941 -
download
0
description
Transcript of Extending the Espresso Method for Greater Recall
![Page 1: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/1.jpg)
Relationship Extraction from Text
Extending the Espresso Method for Greater Recall
Derek SpringerUCLA Computer Science Department
November 19, 2009
![Page 2: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/2.jpg)
Related Works
• Ganapathi, Swathi. “Relationship Extraction from Text: Comparison and Experimental Evaluation of the State-of-the-Art.” UCLA comp exam. March 2009.
• Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection of Treatment Relationships in Patent Retrieval." 2008 CIKM Patent Information Retrieval Workshop. October 2008.
![Page 3: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/3.jpg)
Related Works, cont'd
• Girju, R. "Automatic Detection of Causal Relations for Question Answering." In the proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). Workshop on "Multilingual Summarization and Question Answering - Machine Learning and Beyond". 2003.
• Pantel, Patrick and Pennacchiotti, Marco. "Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations." In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL- 06). pp. 113-120. Sydney, Australia. 2006.
![Page 4: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/4.jpg)
Relationship Extraction
• The task of recognizing the assertion of a particular relationship between two or more entities in text.
• Can aid in the development of standalone, intelligent, automated and adaptable user-specific content retrieval systems.
• We focus on extracting treatment relationships → A (subject) used to treat B (object).
![Page 5: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/5.jpg)
Goals and Contributions
• Extended state-of-the-art Espresso relationship extraction system originally implemented by Ganapathi.
• Did an in-depth experimental evaluation of the developed system while comparing it to prior work (Chu, Ganapathi).
• Future goal is to use the system developed here as a plug for relationship feature extractor in iScore.
![Page 6: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/6.jpg)
Integration Into iScore
• iScore presents additional articles based on an aggregate score of “interestingness.”
• We believe filtering articles based on relationships can improve the results of iScore.
• We hypothesize that extending the Espresso system implemented by Swathi Ganapathi will improve the ability of a system such as iScore to utilize relationship extraction as a feature.
![Page 7: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/7.jpg)
Comparison Criteria
• Performance: Want system to have high precision and recall
• Minimal Supervision: Want system to require little to no human supervision
• Breadth: Want system to extract relations from varying corpus sizes, domains and formats.
• Generality: Want system to extract wide variety of relation types without losing its edge in any of the above criteria.
![Page 8: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/8.jpg)
The Espresso Algorithm
• General purpose algorithm which can be used to extract a wide variety of binary relations.
• Requires minimal supervision. Only input is a small seed set of known relations.
• By looking at individual sentences in detecting relationships, works well on all kinds of corpora.
• On tests conducted by the creators of the algorithm, Espresso generated balanced precision and recall.
![Page 9: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/9.jpg)
The Espresso Method
![Page 10: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/10.jpg)
Extending Espresso
Ganapathi's Implementation
Extension
37.8%
91.2%
![Page 11: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/11.jpg)
Ganapathi's Implementation
• Ganapathi's approach uses lexico-syntactic patterns of the form NP1 VP NP2 (Verb category in Table 1).
• VP contains treatment verb or pattern and the two NPs would contain the subject and object.
• This structure is a very common relationship, accounting for 37.8% of all relationships.
![Page 12: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/12.jpg)
Extension
• There still remains a large number of relationships that may provide fruitful results.
• Expanding the implementation to include: - Noun+Prep e.g. "X settlement with Y"- Verb+Prep e.g. "X moved to Y" - Infinitive e.g. "X plans to acquire Y" and - Modifier e.g. "X is Y winner" relationship
• Retrieves 91.2% of common relationships.
![Page 13: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/13.jpg)
Test Corpora
• Patent Corpus: Developed by Shigeo 50,000 drug patent documents from 2008 from Class 424 & 514 of
the U.S. Patents Classification: “drug, bio-affecting and body treating compositions” and their subclasses.
o Patents were pre-filtered to only contain keywords “diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”, “coronary artery”
o All sentences from each document added to a sentence table in the schema
• PubMed Corpus: Developed by Gustavoo Comprised of medical abstracts from PubMedo Each abstract was parsed and all sentences from each abstract
was stored as individual tuples in the sentence table
![Page 14: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/14.jpg)
Performance Measures
![Page 15: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/15.jpg)
Seed Treatment Relationships
• (Xanax, Anxiety)• (Ambien, Insomnia)• (Effexor, Depression)• (Paxil, Depression)• (Lexapro, Depression)• (Caffeine, Depression)• (Zoloft, Depression)• (Imipramine, Depression)
• (Glycoside, Depression)• (Ibuprofen, Arthritis)• (Ibuprofen, Headache)• (Tylenol, Fever)• (Tylenol, Headache)• (Antibody, Inflammation)• (Ibuprofen, Inflammation)• (Surgery, Glaucoma)
![Page 16: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/16.jpg)
Procedure
1.Re-tag original data set to incorporate extended relationship types.
2.Re-run Ganapathi's baseline Espresso implementation to compare against updated data set.
3.Run extended Espresso implementation to compare against updated data set.
![Page 17: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/17.jpg)
Experiment #1: Extraction on Drug Patent Corpus
• Drug Patent corpus used.• Algorithm was run with seed relations and 12 verbs were extracted as
being relevant (verbs with rπ greater than 0.2).• These treatment verbs were used to create a test sentence set of 120
sentences i.e. 10 sentences containing a treatment verb for every relevant treatment verb.
• 358 possible relations were extracted for each of which we calculated the ri score.
• 208 relations were obtained with ri score greater than the threshold out of which 126 were actually correct (through manual tagging).
• Of the original 358 relations, manual tagging determined that 213 of them were correct treatment relations.
![Page 18: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/18.jpg)
Experiment #1 Results
![Page 19: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/19.jpg)
Experiment #2: Number of Relationships and Performance
• Drug Patent corpus used.• Test the performance of the system under
smaller and larger data loads.• Started with initial set of 120 sentences obtained
from Drug Patent corpus (10 sentences for each verb, 12 verbs as in test #1)
• Increased the number of sentences for each verb by 10 in each case, so that we had sentence sets of 240 and 360 sentences each
![Page 20: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/20.jpg)
Experiment #2 Results
![Page 21: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/21.jpg)
Experiment #2 Analysis
• Performance of the system and the number of relationships are inversely related.
• ri scores are affected inversely by the max pmi across all relationship instances, it is possible that having more relationship instances in a set lowers the ri for all those relationships.
• more relationships => chance of a greater max pmi => lowered ri for all relationship instances.
• Not worried → articles likely won't have 200 relations of the same type.
![Page 22: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/22.jpg)
Experiment #3: Extraction on PubMed Corpus
• PubMed corpus used.• Want to test the performance of the system on a different
type and sized corpus• Algorithm was run with input seed relations on this corpus
and10 verbs with the topmost rπ values were extracted• We constructed a test sentence set of 80 sentences (8
sentences for every relevant verb)• We then extracted a total of 162 relations from this test set
and calculated their ri scores. • The average ri score was used as the threshold value
![Page 23: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/23.jpg)
Experiment #3 Results
![Page 24: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/24.jpg)
Comparison Over Both Corpora
![Page 25: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/25.jpg)
Experiment #3 Analysis
• Performance is worse on PubMed corpus.• Patent corpus dealt with drugs and cures for diseases.• Therefore, there was an abundance of treatment type
relations in patent corpus.• PubMed had more general medical data and only
contained abstracts => less info.• Therefore, there were fewer treatment relations in
PubMed which affected performance.
![Page 26: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/26.jpg)
Comparison with Previous Work
* signifies our contribution
![Page 27: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/27.jpg)
Analysis
• F-score of Ganapathi's version of Espresso fell nearly 10% → due to lower recall, as predicted.
• Results of extension over the re-tagged data are on par with Ganapathi's original results.
• When you consider that Ganapathi's system dropped nearly 10%, it seems to indicate the increased general purpose nature of the extension over the original version.
![Page 28: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/28.jpg)
Success
• Recall of system is more important than precision, especially when it comes to using relationships as a feature in iScore.
• Method is almost completely automated.• Easily expanded to extract other relationship types by
changing the input seed relations.• Initial results seem insignificant, but analysis indicates
that extended system has the potential to be a general-purpose relationship extraction feature.
![Page 29: Extending the Espresso Method for Greater Recall](https://reader035.fdocuments.us/reader035/viewer/2022062418/556b899bd8b42a6c7c8b5137/html5/thumbnails/29.jpg)
Future Work
• Development of a relationship feature extractor for iScore.
• Relations will have to be syntactically and semantically compared with relations present in other articles and the best article matches will be returned as “interesting” choices for a user.
• Optimizations: algorithm design improvements, database connection optimizations and parallelization.