Improving Gender Classification of Blog Authors
-
Upload
hamilton-moore -
Category
Documents
-
view
38 -
download
2
description
Transcript of Improving Gender Classification of Blog Authors
![Page 1: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/1.jpg)
Improving Gender Classification of Blog Authors
Arjun MukherjeeBing LIu
UIC
![Page 2: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/2.jpg)
Introduction
• Dataset– 3100 blogs ( 1588 – men,1512 – women )
• Related work– Current systems use POS n-grams, word classes,
personality types to capture stylistic behavior of authors’ writings for classifying gender.
– However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning.
![Page 3: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/3.jpg)
Feature Engineering and mining
• F-Measure ( Not the accuracy!!! )– F-Measure is used to measure contextuality and
formality– F = 0.5 * [(freq.noun + freq.adj + freq.prep +
freq.art) – (freq.pron + freq.verb + freq.adv + freq.int) + 100]
• Stylistic features– Words such as lol, hmm and smileys determine
style of person
![Page 4: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/4.jpg)
Feature….(contd)
• Gender Preferential features– Females tend to post emotionally. Frequent use of
intensive adverbs ( terribly, awfully….) whereas men tend to more provocative
![Page 5: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/5.jpg)
Features…( cont. )
• Factor analysis and word classes– Factor or word factor analysis refers to the process of
finding groups of similar words that tend to occur in similar documents.
• POS sequence pattern– A POS sequence pattern is a sequence of consecutive POS
tags that satisfy some constraints– POS ngrams are good at capturing the heavy stylistic and
syntactic information. Instead of using all such n-grams, we want to discover all those patterns that represent true regularities, and we also want to have flexible lengths
![Page 6: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/6.jpg)
Features…( contd.. )
• POS Sequence patterns– mining algorithm mines all such patterns that
satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints.
![Page 7: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/7.jpg)
POS sequence algo.
![Page 8: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/8.jpg)
Feature Selection
• System uses EFS( Ensemble feature selection ) algorithm for selecting the fearutes.– EFS is a hybrid of filter and wrapper techniques– Some of the criteria used for feature selection are
information gain, Mutual information, Chi square test
• Feature value assignment– The values are either boolean or term frequency
![Page 9: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/9.jpg)
Experiments and results
• Classifiers– Naïve Bayes– SVM– SVM Regression
![Page 10: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/10.jpg)
Results
![Page 11: Improving Gender Classification of Blog Authors](https://reader036.fdocuments.us/reader036/viewer/2022082517/56813184550346895d97f958/html5/thumbnails/11.jpg)
Comparison results