Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai...

Overview of Text Data Mining

(CS591-CXZ Text Data Mining Seminar)

Sept. 1, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Most Data are Unstructured (Text)

or Semi-Structured…•Email

• Insurance claims

•News articles

•Web pages

•Patent portfolios

•…

•Customer complaint letters

•Contracts

•Transcripts of phone calls with customers

•Technical documents

•…

(Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)

The more data we have, the more likely we can find patterns in data

Text Management Applications

Access Mining

Organization

Select information

Create Knowledge

Add Structure/Annotations

Elements of Text Info Management Technologies

Search

Text

Filtering

Categorization

Summarization

Clustering

Natural Language Content Analysis

Extraction

Mining

VisualizationRetrievalApplications

MiningApplications

InformationAccess

KnowledgeAcquisition

InformationOrganization

What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

Text Mining vs. NLP, IR, DM…

•How does it relate to data mining in general?

•How does it relate to computational linguistics?

•How does it relate to information retrieval?

Finding Patterns

Finding “Nuggets”

Novel Non-Novel

Non-textual data

General

data-mining

Exploratory Data

Analysis

Database queries

Textual data Computational Linguistics

Information

Retrieval(Adapted from Rebecca Hwa’s “Intro to Text Mining”)

Challenges in Text Mining

•Data collection is “free text”

– Data is not well-organized

• Semi-structured or unstructured

– Natural language text contains ambiguities on many levels

• Lexical, syntactic, semantic, and pragmatic

– Learning techniques for processing text typically need annotated training examples

• Consider bootstrapping techniques

•What to mine?

(adapted from Rebecca Hwa’s “Intro to Text Mining”)

Applications of Text Mining

•Direct applications

– Domain-dependent (Bioinformatics, Business Intelligence, etc)

– Data-dependent (WWW, literature, email, customer reviews, etc)

•Indirect applications

– Assist information access

– Assist information organization

Text Mining for Hypertext Creation

...

Subtopic 1

A general topic

Subtopic i Subtopic M

Concept map

Hypertext

Doc 1 Doc 2 Doc N

Type of Links

...

Subtopic 1

A general topic

Subtopic i Subtopic M

Doc Doc Links

Doc 1 Doc 2 Doc N

Term Term Links DocTerm Links

TermDoc Links

Examples of Linkages in Text

Related Areas/Conferences

•Natural Language Processing (NLP): ACL, EMNLP, COLING

•Information Retrieval: SIGIR, CIKM

•Machine Learning: ICML, NIPS, UAI

•Data Mining & Knowledge Discovery: SIGKDD

•World Wide Web: WWW

•Bioinformatics: ISMB, PSC

Candidate Papers – SIGKDD 04

• Probabilistic Author-Topic Models for Information Discovery

Authors: Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths

• Mining Reference Tables for Automatic Text Segmentation Authors: Eugene Agichtein, Venkatesh Ganti

• Exploiting Dictionaries in Named Entity Extraction: Combining SemiMarkov Extraction Processes and Data Integration Methods Authors: William Cohen, Sunita Sarawagi

• Mining and Summarizing Customer Reviews Authors: Minqing Hu, Bing Liu

• Cluster-based Concept Invention for Statistical Relational Learning Authors: Alexandrin Popescul, Lyle Ungar

Candidate Papers –WWW 04

• Unsupervised Learning of Soft Patterns for Generating Definitions from Online News (page 90)H. Cui, M.-Y. Kan, T.-S. Chua, National University of Singapore , WWW2004

• Web-Scale Information Extraction in KnowItAll (Preliminary Results) (page 100)O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, University of Washington, WWW 2004

• LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora (page 184)C.-C. Huang, S.-L. Chuang, Academia SinicaL.-F. Chien, Academia Sinica, National Taiwan University, WWW 2004

• Towards the Self-Annotating Web (page 462)P. Cimiano, S. Handschuh, University of KarlsruheS. Staab, University of Karlsruhe, Ontoprise GmbH

• Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty (page 482)E. Gabrilovich, Technion, Microsoft ResearchS. Dumais, E. Horvitz, Microsoft Research

• A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results (page 658)K. Kummamuru, R. Lotlikar, S. Roy, IBM India Research LabK. Singal, IIT-GuwahatiR. Krishnapuram, IBM India Research Lab WWW2004

Candidate Papers – PSB & ISMB 03/04

• Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity , O. Tuason, L. Chen, H. Liu, J.A Blake, and C. Friedman; Pacific Symposium on Biocomputing 9:238-249(2004)

• Playing Biology's Name Game: Identifying Protein Names in Scientific Text , D. Hanisch, J. Fluck, HT. Mevissen, R. Zimmer; Pacific Symposium on Biocomputing 8:403-414(2003).

• Mining Terminological Knowledge in Large Biomedical Corpora , H. Liu, C. Friedman; Pacific Symposium on Biocomputing 8:415-426(2003).

• A Biological Named Entity Recognizer , M. Narayanaswamy, K. E. Ravikumar, K. Vi jay-Shanker; Pacific Symposium on Biocomputing 8:

• A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , A.S. Schwartz, M.A. Hearst; Pacific Symposium on Biocomputing 8:451-462(2003).

• Evaluation of Text Data mining for Database Curation: LessonsLearned from the KDD Challenge CupAlexander Yeh, Lynette Hirschman, Alexander Morgan ISMB 2003

• Extracting Synonymous Gene and Protein Terms from Biological LiteratureHong Yu and Eugene Agichtein ISMB 2003

• Mining MEDLINE for Implicit Links between Dietary Substances and Diseases, Padmini Srinivasan - University of IowaBisharah Libbus - National Library of Medicine ISMB 2004

• Protein Names Precisely Peeled Off Free Text, Sven Mika - Columbia UniversityBurkhard Rost - CUBIC/C2B2/NESG, Dept Biochemistry and Molecular Biophysics, Columbia University 2004 ISMB

Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai...

Documents

Transcript of Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) Sept. 1, 2004 ChengXiang Zhai...