Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai...
-
Upload
alaina-charles -
Category
Documents
-
view
216 -
download
2
Transcript of Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai...
Overview of Text Data Mining
(CS591-CXZ Text Data Mining Seminar)
Sept. 1, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Most Data are Unstructured (Text)
or Semi-Structured…•Email
• Insurance claims
•News articles
•Web pages
•Patent portfolios
•…
•Customer complaint letters
•Contracts
•Transcripts of phone calls with customers
•Technical documents
•…
(Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)
The more data we have, the more likely we can find patterns in data
Text Management Applications
Access Mining
Organization
Select information
Create Knowledge
Add Structure/Annotations
Elements of Text Info Management Technologies
Search
Text
Filtering
Categorization
Summarization
Clustering
Natural Language Content Analysis
Extraction
Mining
VisualizationRetrievalApplications
MiningApplications
InformationAccess
KnowledgeAcquisition
InformationOrganization
What Is Text Mining?
“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)
(Slide from Rebecca Hwa’s “Intro to Text Mining”)
Text Mining vs. NLP, IR, DM…
•How does it relate to data mining in general?
•How does it relate to computational linguistics?
•How does it relate to information retrieval?
Finding Patterns
Finding “Nuggets”
Novel Non-Novel
Non-textual data
General
data-mining
Exploratory Data
Analysis
Database queries
Textual data Computational Linguistics
Information
Retrieval(Adapted from Rebecca Hwa’s “Intro to Text Mining”)
Challenges in Text Mining
•Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on many levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text typically need annotated training examples
• Consider bootstrapping techniques
•What to mine?
(adapted from Rebecca Hwa’s “Intro to Text Mining”)
Applications of Text Mining
•Direct applications
– Domain-dependent (Bioinformatics, Business Intelligence, etc)
– Data-dependent (WWW, literature, email, customer reviews, etc)
•Indirect applications
– Assist information access
– Assist information organization
Text Mining for Hypertext Creation
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Concept map
Hypertext
Doc 1 Doc 2 Doc N
Type of Links
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Doc Doc Links
Doc 1 Doc 2 Doc N
Term Term Links DocTerm Links
TermDoc Links
Examples of Linkages in Text
Related Areas/Conferences
•Natural Language Processing (NLP): ACL, EMNLP, COLING
•Information Retrieval: SIGIR, CIKM
•Machine Learning: ICML, NIPS, UAI
•Data Mining & Knowledge Discovery: SIGKDD
•World Wide Web: WWW
•Bioinformatics: ISMB, PSC
Candidate Papers – SIGKDD 04
• Probabilistic Author-Topic Models for Information Discovery
Authors: Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths
• Mining Reference Tables for Automatic Text Segmentation Authors: Eugene Agichtein, Venkatesh Ganti
• Exploiting Dictionaries in Named Entity Extraction: Combining SemiMarkov Extraction Processes and Data Integration Methods Authors: William Cohen, Sunita Sarawagi
• Mining and Summarizing Customer Reviews Authors: Minqing Hu, Bing Liu
• Cluster-based Concept Invention for Statistical Relational Learning Authors: Alexandrin Popescul, Lyle Ungar
Candidate Papers –WWW 04
• Unsupervised Learning of Soft Patterns for Generating Definitions from Online News (page 90)H. Cui, M.-Y. Kan, T.-S. Chua, National University of Singapore , WWW2004
• Web-Scale Information Extraction in KnowItAll (Preliminary Results) (page 100)O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, University of Washington, WWW 2004
• LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora (page 184)C.-C. Huang, S.-L. Chuang, Academia SinicaL.-F. Chien, Academia Sinica, National Taiwan University, WWW 2004
• Towards the Self-Annotating Web (page 462)P. Cimiano, S. Handschuh, University of KarlsruheS. Staab, University of Karlsruhe, Ontoprise GmbH
• Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty (page 482)E. Gabrilovich, Technion, Microsoft ResearchS. Dumais, E. Horvitz, Microsoft Research
• A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results (page 658)K. Kummamuru, R. Lotlikar, S. Roy, IBM India Research LabK. Singal, IIT-GuwahatiR. Krishnapuram, IBM India Research Lab WWW2004
Candidate Papers – PSB & ISMB 03/04
• Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity , O. Tuason, L. Chen, H. Liu, J.A Blake, and C. Friedman; Pacific Symposium on Biocomputing 9:238-249(2004)
• Playing Biology's Name Game: Identifying Protein Names in Scientific Text , D. Hanisch, J. Fluck, HT. Mevissen, R. Zimmer; Pacific Symposium on Biocomputing 8:403-414(2003).
• Mining Terminological Knowledge in Large Biomedical Corpora , H. Liu, C. Friedman; Pacific Symposium on Biocomputing 8:415-426(2003).
• A Biological Named Entity Recognizer , M. Narayanaswamy, K. E. Ravikumar, K. Vi jay-Shanker; Pacific Symposium on Biocomputing 8:
• A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , A.S. Schwartz, M.A. Hearst; Pacific Symposium on Biocomputing 8:451-462(2003).
• Evaluation of Text Data mining for Database Curation: LessonsLearned from the KDD Challenge CupAlexander Yeh, Lynette Hirschman, Alexander Morgan ISMB 2003
• Extracting Synonymous Gene and Protein Terms from Biological LiteratureHong Yu and Eugene Agichtein ISMB 2003
• Mining MEDLINE for Implicit Links between Dietary Substances and Diseases, Padmini Srinivasan - University of IowaBisharah Libbus - National Library of Medicine ISMB 2004
• Protein Names Precisely Peeled Off Free Text, Sven Mika - Columbia UniversityBurkhard Rost - CUBIC/C2B2/NESG, Dept Biochemistry and Molecular Biophysics, Columbia University 2004 ISMB