Linguistic Ethnography: Identifying Dominant Word Classes...
Transcript of Linguistic Ethnography: Identifying Dominant Word Classes...
Rada Mihalcea University of Michigan
Linguistic Ethnography: Identifying Dominant Word Classes in Text
Stephen Pulman Oxford University
Linguistic Ethnography?
• Finding and understanding patterns in given types of text – Find the characteristics of a text – Reflective of behavior or style
• Examples – Female vs. male authored texts (gender) – Texts describing happy vs. sad moods (mood) – Humorous vs. non-humorous text (comic) – Introvert vs. extrovert authors (psychology)
Linguistic Ethnography vs. Text Classification
• Text classification: – Automatic separation of classes of text – Supervised or semi-supervised algorithms (Naïve Bayes,
SVM, perceptron, etc.) – Feature weighting and selection
• Linguistic ethnography – Identification of classes of words over salient features – Understand the characteristics of the texts – Insights into the properties and behaviors modeled by those
texts
Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts
mood:
Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.
mood:
An Example: Finding Happiness
Corpus-derived Happiness Factors
yay 86.67 shopping 79.56 awesome 79.71 birthday 78.37 lovely 77.39 concert 74.85 cool 73.72 cute 73.20 lunch 73.02 books 73.02
goodbye 18.81 hurt 17.39 tears 14.35 cried 11.39 upset 11.12 sad 11.11 cry 10.56 died 10.07 lonely 9.50 crying 5.50
Identifying Word Classes in Text
• Foreground corpus: corpus of texts of interest • Background corpus: “neutral” texts – Collection of texts that do not have the property shared by
the foreground corpus – Balanced corpus
• Mix of texts
• Goal: identify word classes that are dominant in the foreground corpus
Word Class Dominance
• C = {W1, W2, …, Wn}
• Score significantly higher than 1: word classes that are dominant in the foreground corpus
)(
)(
FSize
WFrequencyCoverage CW
i
Fi
∑∈=
)(
)(
BSize
WFrequencyCoverage CW
i
Bi
∑∈=
)()(.CCoverageCCoveragenanceDomi
B
FF =
Lexical Resources for Word Classes
• Roget – Thesaurus of English language – 100,000 grouped based on synonymy and other semantic
relations • Linguistic Inquiry and Word Count (LIWC) – Lexicon developed for psycholinguistic analysis (Pennebaker
& all) – 2,200 words grouped into 70 classes
• WordNet Affect – Resource built on top of WordNet – Annotations with the emotions in the classification of Ortony – Focus on: anger, disgust, fear, joy, sadness, surprise
Word Class Examples
• Roget: – PERFECTION: perfection, purity, integrity, impeccability, … – MEDIOCRITY: mediocrity, dullness, indifference, inferiority, …
• LIWC: – OPTIMISM: accept, best, confidence, glorious, hope, … – SOCIAL: adult, advice, affair, boy, buddies, comrade, …
• WordNet-Affect: – ANGER: offense, temper, irritation, fury, rage, … – JOY: worship, adoration, sympathy, tenderness, respect, love, …
A Case Study: Verbal Humour
• Gain insights into the “language of humour” • Find classes of words that are dominant in humorous
text • Foreground corpus: humorous text – Two types of verbal humour:
• One-liners • Humorous news articles
• Background corpus: non-humorous text – A mix of data from non-humorous sources: Reuters
newspapers, British National Corpus, proverbs, Open Mind Common Sense
Humorous Data: One-liners
• “He who smiles in a crisis has found someone to blame” • Short sentence, simple syntax • Deliberate use of rhetoric devices (alliteration, rhyme) • Frequent use of creative language • Comic effect
• Web-based bootstrapping • Start with a few manually selected seeds • Identify a list of Web pages including at least one seed • Parse Web pages and find new one-liners • Repeat
– 16,000 one-liners
Humorous Data: News stories
• “The Onion” – “the best source of humour out there” (Jeff Grienfield, CNN)
• Canadian Prime Minister Jean Chrétien and Indian President Abdul Kalam held a subdued press conference in the Canadian Capitol building Monday to announce that the two nations have peacefully and sheepishly resolved a dispute over their common border. "We are - well, I guess proud isn't the word - relieved, I suppose, to restore friendly relations with India after the regrettable dispute over the exact coordinates of our shared border," said Chrétien, who refused to meet reporters' eyes as he nervously crumpled his prepared statement. "The border that, er... Well, I guess it turns out that we don't share a border after all." Chrétien then officially withdrew his country's demand that India hand over a 20-mile-wide stretch of land that was to have served as a demilitarized buffer zone between the two nations.“
– 1,125 news articles from August 2005 – March 2006 • 1,000-10,000 characters
Dominant Roget Word Classes in Humorous Text
• anonymity 3.48 : you, person, cover, anonymous, unknown, unidentified, unspecified
• odor 3.36 : nose, smell, strong, breath, inhale, stink, pong, perfume, flavor
• secrecy 2.96 : close, wall, secret, meeting, apart, ourselves, security, censorship
• wrong 2.83 : wrong, illegal, evil, terrible, shame, beam, incorrect, pity, horror
• unorthodoxy 2.52 : error, non, err, wander, pagan, fallacy, atheism, erroneous, fallacious
• overestimation 2.45 : think, exaggerate, overestimated, overestimate, exaggerated
• disarrangement 2.18 : trouble, throw, ball, bug, insanity, confused, upset, mess, confuse
Dominant LIWC Word Classes in Humorous Text
• you 3.17 : you, thou, thy, thee, thin • I 2.84 : myself, mine • swear 2.81 : hell, ass, butt, suck, dick, arse, bastard, sucked,
sucks, boobs • self 2.23 : our, myself, mine, lets, ourselves, ours • sexual 2.07 : love, loves, loved, naked, butt, gay, dick, boobs,
cock, horny, fairy • groom 2.06 : soap, shower, perfume, makeup • cause 1.99 : why, how, because, found, since, product, depends,
thus, cos • humans 1.79 : man, men, person, children, human, child, kids,
baby, girl, boy
Dominant WordNet-Affect Word Classes in Humorous Text
• surprise 3.31 : stupid, wonder, wonderful, beat, surprised, surprise, amazing, terrific
Evaluation
• How good are these classes? • Derive word classes from different data sets and
measure correlation • Split the one-liners in two: 8,000 one-liners vs. 8,000 one-
liners • Split the news stories in two: 550 stories vs. 550 stories • 16,000 one-liners vs. 1,100 news stories
Roget LIWCone-liners vs. one-liners 0.95 0.96news stories vs. news stories 0.84 0.88one-liners vs. news stories 0.63 0.42
Characteristics of Verbal Humour
• Observed by analyzing the word classes • Human-centerdness – YOU, I, SELF, HUMANS
• you occurs in more than 25% of the one-liners • “You can always find what you are not looking for.” • professional communities • “It was so cold last winter, that I saw a lawyer with his hands in his
own pockets.”
Characteristics of Verbal Humour
• Negative polarity – WRONG, UNORTHODOXY,
DISARRANGEMENT • “Only adults have trouble with child-proof
bottles.” • “When everything comes your way, you are
in the wrong lane.”
Dominant Classes in Humour
– Human-centeredness: human-related semantic classes found dominant in humorous text as compared to non-humorous text
– Negative polarity: semantic classes with negative orientation
• Humour as “natural therapy” where tensions related to negative scenarios concerning us humans are relieved through laughter
• Correlation with empirical observations from previous work • Human-centerdness, negative polarity, sexual vocabulary,
swear words, surprise
Conclusions
• Find the dominant word classes in types of text • Reflective of behavior or style • Systematic and portable
• Case study on humour: • Good correlation among classes derived from different
corpora • Correlation with empirical observations from previous work
A conclusion is simply the place where you got tired of thinking. ?