Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science,...
-
Upload
irea-childrey -
Category
Documents
-
view
215 -
download
0
Transcript of Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science,...
![Page 1: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/1.jpg)
Joshua Wortman
Industrial Engineering and Management, Technion
Prof. Alon Itai
Computer Science, Technion
The contents of this presentation may not be reused or redistributed in whole or in part without expressed written permission of the authors. Please contact jwortman-at-technion.ac.il for more information. 2010.
Film classification using subtitles and automatically generated language factors
![Page 2: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/2.jpg)
2
Background and motivations
Components of Analysis
Classification models
Conclusions
Contents
![Page 3: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/3.jpg)
3
The data:
Subtitle files for 1062 films
genres from IMDB
The challenge:
Label each film with its genres.
Background & Motivations
![Page 4: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/4.jpg)
4
Background & Motivations What is a genre?
DramaThriller
Comedy
Action
Crime
RomanceAnimation
Family
War
Sci-Fi
Fantasy
Horror
AdventureMystery
Genre is a theme or style, not a topic.
![Page 5: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/5.jpg)
5
Background & Motivations There are no prototypes
Drama
Comedy
Thriller
Action
Romance
Adventure
Crime
Sci-F
iFa
ntasy
Horror
Myste
ry
Family
Comedy 122Thriller 161 48Action 89 75 193Romance 139 138 38 27Adventure 42 80 77 147 33Crime 121 69 155 83 21 12Sci-Fi 23 23 81 91 9 67 9Fantasy 31 56 41 52 30 71 4 24Horror 25 12 91 35 4 18 8 30 28Mystery 49 12 84 12 13 13 34 20 20 35Family 16 66 7 18 22 70 4 15 41 1 6Animation 9 27 7 13 8 46 2 17 25 0 2 49
![Page 6: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/6.jpg)
6
Background & MotivationsExamples % Count Genre
The Shawshank Redemption, 1994; Passion of the Christ, 2004 43% 464 Drama
Next Friday, 2000; Legally Blonde, 2001 37% 401 Comedy
The Glass House, 2001; Red Eye, 2005 37% 391 Thriller
Sniper 2; 2002; A Better Way to Die, 2000 32% 340 Action
Meet the Parents, 2000; The Notebook, 2004 23% 244 Romance
Indiana Jones and the Temple of Doom, 1984 22% 240 Adventure
Get Shorty, 1995; Gangs of New York, 2002 21% 230 Crime
Battlestar Galactica: The Second Coming, 1999 13% 142 Sci-Fi
Big, 1988; Bruce Almighty, 2003 13% 140 Fantasy
Secret Window, 2004; Alien: Resurrection, 1997 11% 114 Horror
The Sixth Sense, 1999; Flightplan, 2005 10% 110 Mystery
Charlie and the Chocolate Factory, 2005 10% 107 Family
Monsters, Inc., 2001; Snow White and the Seven Dwarfs, 1937 6% 66 Animation
Kippur, 2000; Saving Private Ryan, 1998 5% 49 War
![Page 7: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/7.jpg)
7
Background & Motivations What tools and methods do we have available?
Topical Bag of words Entity Extraction (Feldman et al.; various) Word Net (Katsiouli, Tsetsos, & Hadjiefthymiades, 2007)
Stylistic Lexicographic analysis, Corpus Linguistics (Biber, Conrad & Reppen, 2004).
POS tagging
Usage of linguistic word categories in text. Linguistic Inquiry and Word Count (LIWC)
![Page 8: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/8.jpg)
8
Background & MotivationsFactor Approach with LIWC
Pennebaker & King. Linguistic Style: Language Use as an Individual Difference. J. of Personality and Social Psychology (1999).
Mairesse et al. Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. Journal of Artificial Intelligence (2007).
Category Examples Total Stems
Linguisti
c Processe
s
Personal pronouns I, them, her 701st pers singular I, me, mine 122nd person You, your, thou 203rd pers singular She, her, him 17Past tense a Went, ran, had 145Present tense a Is, does, hear 169Future tense a Will, gonna 48Adverbs Very, really, quickly 69Prepositions To, with, above 60Conjunctions And, but, whereas 28Negations No, not, never 57Quantifiers Few, many, much 89Swear words Damn, piss, …. 53Assent Agree, OK, yes 30
Psychologica
l Processe
s
Social processes b Mate, talk, they, child 455Family Daughter, husband, aunt 64Friends Buddy, friend, neighbor 37Humans Adult, baby, boy 61Affective processes Happy, cried, abandon 915Positive emotion Love, nice, sweet 406Negative emotion Hurt, ugly, nasty 499Anxiety Worried, fearful, nervous 91Anger Hate, kill, annoyed 184Sadness Crying, grief, sad 101Cognitive processes cause, know, ought 730
Personal Concern
s
Work Job, majors, xerox 327Achievement Earn, hero, win 186Leisure Cook, chat, movie 229Home Apartment, kitchen, family 93Money Audit, cash, owe 173Religion Altar, church, mosque 159Death Bury, coffin, kill 62
![Page 9: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/9.jpg)
9
Background & Motivations Category Examples Total Stems
Linguisti
c Processe
s
Personal pronouns I, them, her 701st pers singular I, me, mine 122nd person You, your, thou 203rd pers singular She, her, him 17Past tense a Went, ran, had 145Present tense a Is, does, hear 169Future tense a Will, gonna 48Adverbs Very, really, quickly 69Prepositions To, with, above 60Conjunctions And, but, whereas 28Negations No, not, never 57Quantifiers Few, many, much 89Swear words Damn, piss, …. 53Assent Agree, OK, yes 30
Psychologica
l Processe
s
Social processes b Mate, talk, they, child 455Family Daughter, husband, aunt 64Friends Buddy, friend, neighbor 37Humans Adult, baby, boy 61Affective processes Happy, cried, abandon 915Positive emotion Love, nice, sweet 406Negative emotion Hurt, ugly, nasty 499Anxiety Worried, fearful, nervous 91Anger Hate, kill, annoyed 184Sadness Crying, grief, sad 101Cognitive processes cause, know, ought 730
Personal Concern
s
Work Job, majors, xerox 327Achievement Earn, hero, win 186Leisure Cook, chat, movie 229Home Apartment, kitchen, family 93Money Audit, cash, owe 173Religion Altar, church, mosque 159Death Bury, coffin, kill 62
p(future)
p(assent)
p(anxiety)
![Page 10: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/10.jpg)
10
Components of AnalysisFactor Approach
For a set of films: i = 1, 2,… andg {Drama, Comedy,…}
di is subtitle file i (1 document)
D: set of subtitle files from a training set
D = {di : i training set}
Dg: set of subtitle files from the same training set, where each is labeled with g (template)
Dg = {di : i g training set}
s
a c
s
ab
x
wz
u
x
y
di
r
n
af
m
This film is 17 words
long
factorα = {a,b,c,d}
#(factorα| di)= 5
![Page 11: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/11.jpg)
11
Components of AnalysisFactor Approach
Probability for factorα in di is:
The film i is represented as a vector of probabilities
Probability for factorα in D is:
Probability for factorα in Dg is:
s
a c
s
ab
x
wz
u
x
y
di
factorα = {a,b,c,d}
#(factorα| di)= 5
pi,α = 5/17 = 29%
r
n
af
m
This film is 17 words
long
pi = (pi1 , pi2 , … , piα , … , pi,m) [0,1]m
,#( | )
p ( | )| |
gg g
g
factor Dp factor D
D
#( )ˆ ( )
| |
factorp p factor
D
,#( | )
p ( | )| |
ii i
i
factor dp factor d
d
![Page 12: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/12.jpg)
12
Components of Analysis
factor2
MYSTERY
Scooby-Doo
Scream
Cinderella
Top Gun
factor1
factorm
MYSTERY =
Scooby-Doo =
,1
,2
,
p
p
p
SCOOBY
SCOOBY
SCOOBY m
,1
,2
,
p
p
p
MYSTERY
MYSTERY
MYSTERY m
![Page 13: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/13.jpg)
13
Components of Analysis Drill down into LIWC category “SWEAR_WORDS”
Maybe log likelihood can help?
Dram
a
Comed
y
Thril
ler
Action
Roman
ce
Adven
ture
Crime
Sci-Fi
Fant
asy
Horro
r
Mys
tery
Fam
ily
Anim
atio
nW
ar0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%
,#( | )
p| |
gg SWEAR
g
SWEAR D
D
pCRIME, SWEAR 0.63%#( )
p̂| |SWEAR
SWEAR
D
,plog
p̂CRIME SWEAR
SWEAR
![Page 14: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/14.jpg)
14
Components of Analysis Drill down into LIWC category “SWEAR_WORDS”
Michelson Contrast Function Amplifies low signals Bounds max signal
m]1,1[, 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5Michelson Contrast Function vs. Log Likelihood Ratio
Contrast
Log Likelihood
X = 0.05
,,
,
ˆp p
ˆp pi SWEAR SWEAR
i SWEARi SWEAR SWEAR
,,
,
ˆp p
ˆp pCRIME SWEAR SWEAR
CRIME SWEARCRIME SWEAR SWEAR
![Page 15: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/15.jpg)
15
Components of Analysis Drill down into LIWC category “SWEAR_WORDS”
Dram
a
Comed
y
Thril
ler
Action
Roman
ce
Adven
ture
Crime
Sci-Fi
Fant
asy
Horro
r
Mys
tery
Fam
ily
Anim
atio
nW
ar-0.7
-0.5
-0.3
-0.1
0.1
0.3
0.5
0.7
Dram
a
Comed
y
Thril
ler
Action
Roman
ce
Adven
ture
Crime
Sci-Fi
Fant
asy
Horro
r
Mys
tery
Fam
ily
Anim
atio
nW
ar0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%5.0%
0.63%
,#( | )
p| |
gg SWEAR
g
SWEAR D
D
#( )p̂
| |SWEARSWEAR
D
,,
,
ˆp p
ˆp pg SWEAR SWEAR
g SWEARg SWEAR SWEAR
Probability
Contrast
![Page 16: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/16.jpg)
16
Components of Analysis
MC( ,x2)
MYSTERY
Scooby-Doo
Scream
Cinderella
Top Gun
MC( ,x1)
MC( ,xm)
MYSTERY =
Scooby-Doo =
ScoobyMYST
ScoobyMYSTScoobyMYSTsim
),(
,1
,2
,
SCOOBY
SCOOBY
SCOOBY m
,1
,2
,
MYSTERY
MYSTERY
MYSTERY m
1p̂
p̂m
2p̂
![Page 17: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/17.jpg)
17
Classification Models 10 fold cross validation:
70% train set: creates D and each Dg
20% threshold training: identify optimal classification threshold 10% test set: used to generate performance values
65 LIWC categories, targeting optimal F-score
Dram
a
Comed
y
Thrille
r
Action
Roman
ce
Adven
ture
Crime
Fanta
sy
Sci-Fi
Horro
r
Family
Mys
tery
Animat
ion War
0%10%20%30%40%50%60%70%80%90%
Precision F-score Recall
![Page 18: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/18.jpg)
18
Classification ModelsHits
Hits + FA
HitsHits + Misses
Precision =
Recall =
33 + 2
33 + 1
= 0.6
= 0.75
=
=
F-score = 2 *Precision * Recall / (Precision + Recall) = .67
65 LIWC categories, targeting optimal F-score
Dram
a
Comed
y
Thrille
r
Action
Roman
ce
Adven
ture
Crime
Fanta
sy
Sci-Fi
Horro
r
Family
Mys
tery
Animat
ion War
0%10%20%30%40%50%60%70%80%90%
Precision F-score Recall
![Page 19: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/19.jpg)
19
Classification Models 65 LIWC categories, targeting optimal F-score
TITLE Prec Recl
Sorority Boys Comedy Romance 50% 100%
The Cooler Drama Comedy Romance Crime Thriller 50% 67%
The Art of War Action Thriller 100% 100%
Metallic Blues Drama Comedy Thriller 50% 50%
Gladiator Action Drama Adventure Mystery Thriller War 40% 67%
Meet Joe Black Romance Fantasy Mystery Comedy Drama 33% 33%
National Security Action Crime Thriller Comedy Drama 75% 75%
Requiem for a Dream Crime Drama Action Horror Thriller 40% 100%
Resident Evil: Apocalypse Action Sci-Fi Horror Drama Mystery Thriller 40% 67%
Jeepers Creepers Drama Horror Thriller Adventure Action Crime Mystery 50% 75%
Constantine Action Drama Horror Thriller Fantasy Mystery 80% 80%
Ocean's Eleven Comedy Thriller Crime Drama 67% 67%
Chitty Chitty Bang Bang Family Comedy Fantasy Adventure Animation Romance 50% 100%
Clockstoppers Adventure Sci-Fi Thriller Comedy Family Romance 0% 0%
Walk the Line Drama Romance Comedy 67% 100%
Scooby-Doo Adventure Comedy Family Fantasy Mystery Action Thriller 50% 40%
Daddy Day Care Comedy Family Drama Romance 50% 100%
Batman Begins Action Thriller Crime Adventure Comedy Drama Fantasy 33% 67%
Close Encounters of the Third Kind Adventure Sci-Fi Drama Action Comedy Thriller 40% 67%
Bad Company Adventure Comedy Thriller Action Animation Crime Romance 40% 50%
Waking Ned Comedy Drama Romance 33% 100%
I, Robot Action Mystery Sci-Fi Thriller Drama History 80% 100%
Bad Boys II Action Crime Thriller Comedy Drama Horror 60% 75%
The Exorcist Horror Thriller Drama Mystery 50% 100%
The Talented Mr. Ripley Romance Crime Drama Mystery Thriller Animation Comedy Family 25% 20%
CAUGHT MISSED EXCESS
![Page 20: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/20.jpg)
20
Classification Models Strength metric for factor selection
Using D and Dg , split each factork in two sets:
Hk,g = {w factork | p(w|g) > p(w) } Lk,g = {w factork | p(w|g) ≤ p(w) }
Possibilities: p(Hk,g) = p(Lk,g)
p(Hk,g) > p(Lk,g)
p(Hk,g) < p(Lk,g)
Sort k by:
| log(p(Hk,g)/p(Lk,g)) | 40.5 factors used 37% savings!
gkHw
gk gwpHp,
)|()( ,
gkLw
gk gwpLp,
)|()( ,
0 10 20 30 40 50 60 700.2
0.3
0.4
0.5
0.6
0.7
0.8
Most Predictive
Least Predictive
# Factors
F-Sc
ore
![Page 21: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/21.jpg)
21
Generating factors from a graph1. Select a set of useful words W
2. Represent relationship between members of W as a graph
3. Cluster words from graph
Implement vector model with these clusters
General GraphGenre-specific Graph
Classification Models
![Page 22: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/22.jpg)
22
Classification Models (general method)1. Selecting useful words (no “love” methodology):
For w D (w) = Percentage of films with #(w|d) c : if (2% < (w) < 45%)
Add w to W
Stop words include:word df(w) r(w)
please 96% 88%
sorry 95% 85%
life 94% 77%
love 90% 74%
kill 85% 55%
girl 83% 54%
wanted 81% 53%
shit 67% 51%
![Page 23: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/23.jpg)
23
Classification Models (general method)2. Building GWG with words in W:
For each word pair: wi, wj W
(wi, wj) = Percentage of films with both #(wi|d) c and #(wj|d) c
The co-occurrence ratio:
(wi, wj) = (wi, wj) / [ (wi) * (wj) ]
GWG contains the set of edges:
(wi, wj) GWG (wi, wj)
empirical constants
![Page 24: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/24.jpg)
24
Classification Models (general method)3. Clustering words of GWG into sets:
Create the set of maximum size cliques for each w GWG
Merge highly similar cliques
This will be a factor in our custom model Modeling performance with these factors equals LIWC performance Strength metric reduces dimensionality…. 38% of 220 factors used
Maximum sized cliques for “commander”
9: commander contact heading national position states strike target weapons9: commander contact heading major national states strike target weapons9: commander contact heading launch position states strike target weapons9: commander contact heading launch major states strike target weapons
Relaxed clique containing “commander” using θ = 0.7attack base begin bomb bridge build built captain center commander complete contact crew destroy destroyed earth emergency energy escape force forward
holding immediately impossible launch lieutenant main necessary planet prepare project ship signal space speed system weapons
![Page 25: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/25.jpg)
25
Classification Models (genre specific method)1. Selecting useful words:
For w Dg ( Dg = {di : i g} )
g (w) = Percentage of films in g with #(w|d) c
¬g (w) = Percentage of films in g with #(w|d) c ( (w) - g (w) ) if( g(w) 25% & g(w) / ¬g(w) > 1.4 )
Add w to Wg
![Page 26: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/26.jpg)
26
Classification Models (genre-specific method)2. Building the GS word graph from Wg
For each pair w,v Wg: Calculate new word pair relationship value:
g(w,v) =
Normalize them:
δg(w,v) =
GS contains the set of edges:
(wi, wj) GS where g(wi, wj) g
3. Create cliques and relaxed cliques…
gi
ii dvdw )1)|(log(#)1)|(log(#
gi igi i
g
dvdw
vw
22 )1)|(log(#)1)|(log(#
),(
![Page 27: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/27.jpg)
27
Classification Models (genre-specific method) Results using GS factors:
Performance also equals LIWC performance! Strength metric reduces dimensionality…. 55% of factors used
Comparing GWG and GS factors
Conclusion: let’s try mixing them…
GWG
FactorsGS
FactorsUnique words 539 420Total usage 3.57% 3.98%Clusters 220 200Average cluster size 7.1 10.1Largest 37 33
![Page 28: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/28.jpg)
Mixed model performance is significantly better than all previous methods (p<0.003).
Compared to LIWC model
error is reduced by: 8.1% for precision 6.6% for recall 7.4% for F-score
28
Classification Models (mixed model)
GWG
FactorsGS
FactorsMixed
Factors Unique words 539 420 752 Total usage 3.6% 4.0% 5.9% Factors 220 200 375 Factors used 38% 55% 49%
Precision Recall Fscore45%
50%
55%
60%
65%
70%
75%
48.6%
67.5%
53.2%
49.7%
68.4%
54.3%
49.2%
68.1%
53.6%52.3%
70.2%
56.4%
LIWC selected factorsAuto Factors 1, selected factorsAuto Factors 2, selected factorsMixed 1 & 2, select factors
Per
form
ance
Lev
el
mixed
LIWC
![Page 29: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/29.jpg)
29
Classification Models (mixed model continued)TITLE PRECISION RECALL
"Sorority Boys" Comedy Romance 50% 100%
"The Cooler" Comedy Drama Romance Crime 50% 33%
"The Art of War" Action Thriller Drama Sci-Fi 50% 100%
"Metallic Blues" Drama Comedy Thriller 50% 50%
"Gladiator" Action Adventure Drama Fantasy War 50% 67%
"Meet Joe Black" Romance Fantasy Mystery Comedy Drama 33% 33%
"National Security" Action Crime Thriller Comedy 100% 75%
"Requiem for a Dream" Drama Crime Action Horror Thriller 25% 50%
"Resident Evil: Apocalypse" Action Horror Sci-Fi Crime Drama Mystery Thriller 33% 67%
"Jeepers Creepers" Drama Horror Thriller Adventure Comedy Crime Mystery 50% 75%
"Constantine" Action Horror Thriller Drama Fantasy Mystery 75% 60%
"Ocean's Eleven" Crime Thriller Comedy Action Drama 50% 67%
"Chitty Chitty Bang Bang" Fantasy Comedy Family Adventure Drama Romance 40% 67%
"Clockstoppers" Adventure Sci-Fi Thriller Comedy 0% 0%
"Walk the Line" Romance Drama Comedy 67% 100%
"Scooby-Doo" Adventure Comedy Family Fantasy Mystery Animation Romance 67% 80%
"Daddy Day Care" Comedy Family Action 50% 50%
"Batman Begins" Action Thriller Crime Drama 67% 67%
"Close Encounters of the Third Kin Adventure Sci-Fi Drama Action Horror Thriller 40% 67%
"Bad Company" Comedy Thriller Action Adventure 100% 50%
"Waking Ned" Comedy Drama 50% 100%
"I, Robot" Action Mystery Sci-Fi Thriller Adventure Crime Drama Horror 50% 100%
"Bad Boys II" Action Crime Thriller Comedy Drama 75% 75%
"The Exorcist" Horror Thriller Action Drama Mystery 40% 100%
"The Talented Mr. Ripley" Drama Mystery Romance Thriller Crime Comedy 75% 60%
CORRECT MISSED EXCESSCAUGHT
![Page 30: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/30.jpg)
30
Classification Models (mixed model continued)
Genre Score Thresh Tag Correct
Animation 0.784 0.783 X
Fantasy 0.730 0.534 X C Family 0.678 0.633 X C Adventure 0.557 0.276 X C Sci-Fi 0.425 0.474 C Romance 0.395 0.380 X Horror 0.342 0.456 C Comedy 0.193 0.093 X C Mystery 0.079 0.638 Action -0.056 -0.039 C War -0.312 0.674 C Thriller -0.317 -0.015 C Crime -0.332 0.162 C Drama -0.358 -0.085 C
Unmasking Scooby-doo…
Over all films,
Classification Accuracy is 79.4%
Far below mystery
threshold
CA=78.6%
![Page 31: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/31.jpg)
31
Conclusions: Should automatically generated factors replace LIWC?
You can't begin to imagine the thousands of hours that have gone into the making of these dictionaries. And when I see that we apparently missed "yourself" I wonder how it is possible that it happened. – Prof. James Pennebaker, LIWC, personal communication.
Our methods may be used for… thematic document classification personality research building a better search engine creating a movie recommendation system
![Page 32: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.](https://reader038.fdocuments.us/reader038/viewer/2022103021/56649c755503460f94929bd7/html5/thumbnails/32.jpg)
~ Thank you ~