Text Feature Extraction. Text Classification Text classification has many applications –Spam email...

10
Text Feature Extraction

Transcript of Text Feature Extraction. Text Classification Text classification has many applications –Spam email...

Page 1: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Text Feature Extraction

Page 2: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Text Classification

• Text classification has many applications– Spam email detection– Automated tagging of streams of news articles, e.g., Google News– Automated creation of Web-page taxonomies

• Data Representation– “Bag of words” most commonly used: either counts or binary– Can also use “phrases” for commonly occuring combinations of words

• Classification Methods– Naïve Bayes widely used (e.g., for spam email)

• Fast and reasonably accurate– Support vector machines (SVMs)

• Typically the most accurate method in research studies• But more complex computationally

– Logistic Regression (regularized)• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles,

2002)

Page 3: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Further Reading on Text Classification

• Web-related text mining in general– S. Chakrabarti, Mining the Web: Discovering Knowledge from

Hypertext Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• General references on text and language modeling– Foundations of Statistical Language Processing, C. Manning

and H. Schutze, MIT Press, 1999.– Speech and Language Processing: An Introduction to Natural

Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector

Machines: Methods, Theory and Algorithms, Kluwer, 2002

Page 4: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Common Data Sets used for Evaluation

• Reuters– 10700 labeled documents – 10% documents with multiple class labels

• Yahoo! Science Hierarchy – 95 disjoint classes with 13,598 pages

• 20 Newsgroups data– 18800 labeled USENET postings– 20 leaf classes, 5 root level classes

• WebKB– 8300 documents in 7 categories such as “faculty”, “course”, “student”.

• Industry– 6449 home pages of companies partitioned into 71 classes

Page 5: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Trimming the Vocabulary

• Stopword removal: – remove “non-content” words

• very frequent “stop words” such as “the”, “and”….– remove very rare words, e.g., that only occur a few times in 100k

documents

• Stemming:– Reduce all variants of a word to a single term– E.g., {draw, drawing, drawings} -> “draw”– Porter stemming algorithm (1980)

• relies on a preconstructed suffix list with associated rules• e.g. if suffix=IZATION and prefix contains at least one vowel

followed by a consonant, replace with suffix=IZE– BINARIZATION => BINARIZE

• This still often leaves p ~ O(104) terms => a very high-dimensional classification problem!

Page 6: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Feature Selection

• Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms– See classification results later in these slides

• Greedy search– Start from empty set or full set and add/delete one at a

time– Heuristics for adding/deleting– Methods tend not to be particularly sensitive to the

specific heuristic used for feature selection, but some form of feature selection often improves performance

Page 7: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Example of Role of Feature Selection

9600 documents from US Patent database20,000 raw features (terms)

Page 8: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Classifying Term Vectors

• Typically multiple different words may be helpful in classifying a particular class, e.g.,– Class = “finance”– Words = “stocks”, “return”, “interest”,

“rate”, etc.– Thus, classifiers that combine multiple

features often do well, e.g,• Naïve Bayes, Logistic regression,

SVMs, etc

Page 9: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

On Class PracticeFormat your own Text Data

• Data – your own collected text data

• Method – Stop words removal– Stemming– Key words frequency calculation

• Software– Coding or by Text editor

Page 10: Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Format your own Text DataRequirements

• File Format: Pure text• Length of sample: Maximum length

for one instance: 250 words• Delimiter: single space • Data Clean: Stop words removed• Class Label: Folder name• Example: Text_Example.txt provided

on Moodle