Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science,...

32
Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be reused or redistributed in whole or in part without expressed written permission of the authors. Please contact jwortman-at-technion.ac.il for more Film classification using subtitles and automatically generated language factors

Transcript of Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science,...

Page 1: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

Joshua Wortman

Industrial Engineering and Management, Technion

Prof. Alon Itai

Computer Science, Technion

The contents of this presentation may not be reused or redistributed in whole or in part without expressed written permission of the authors. Please contact jwortman-at-technion.ac.il for more information. 2010.

Film classification using subtitles and automatically generated language factors

Page 2: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

2

Background and motivations

Components of Analysis

Classification models

Conclusions

Contents

Page 3: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

3

The data:

Subtitle files for 1062 films

genres from IMDB

The challenge:

Label each film with its genres.

Background & Motivations

Page 4: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

4

Background & Motivations What is a genre?

DramaThriller

Comedy

Action

Crime

RomanceAnimation

Family

War

Sci-Fi

Fantasy

Horror

AdventureMystery

Genre is a theme or style, not a topic.

Page 5: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

5

Background & Motivations There are no prototypes

Drama

Comedy

Thriller

Action

Romance

Adventure

Crime

Sci-F

iFa

ntasy

Horror

Myste

ry

Family

Comedy 122Thriller 161 48Action 89 75 193Romance 139 138 38 27Adventure 42 80 77 147 33Crime 121 69 155 83 21 12Sci-Fi 23 23 81 91 9 67 9Fantasy 31 56 41 52 30 71 4 24Horror 25 12 91 35 4 18 8 30 28Mystery 49 12 84 12 13 13 34 20 20 35Family 16 66 7 18 22 70 4 15 41 1 6Animation 9 27 7 13 8 46 2 17 25 0 2 49

Page 6: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

6

Background & MotivationsExamples % Count Genre

The Shawshank Redemption, 1994; Passion of the Christ, 2004 43% 464 Drama

Next Friday, 2000; Legally Blonde, 2001 37% 401 Comedy

The Glass House, 2001; Red Eye, 2005 37% 391 Thriller

Sniper 2; 2002; A Better Way to Die, 2000 32% 340 Action

Meet the Parents, 2000; The Notebook, 2004 23% 244 Romance

Indiana Jones and the Temple of Doom, 1984 22% 240 Adventure

Get Shorty, 1995; Gangs of New York, 2002 21% 230 Crime

Battlestar Galactica: The Second Coming, 1999 13% 142 Sci-Fi

Big, 1988; Bruce Almighty, 2003 13% 140 Fantasy

Secret Window, 2004; Alien: Resurrection, 1997 11% 114 Horror

The Sixth Sense, 1999; Flightplan, 2005 10% 110 Mystery

Charlie and the Chocolate Factory, 2005 10% 107 Family

Monsters, Inc., 2001; Snow White and the Seven Dwarfs, 1937 6% 66 Animation

Kippur, 2000; Saving Private Ryan, 1998 5% 49 War

Page 7: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

7

Background & Motivations What tools and methods do we have available?

Topical Bag of words Entity Extraction (Feldman et al.; various) Word Net (Katsiouli, Tsetsos, & Hadjiefthymiades, 2007)

Stylistic Lexicographic analysis, Corpus Linguistics (Biber, Conrad & Reppen, 2004).

POS tagging

Usage of linguistic word categories in text. Linguistic Inquiry and Word Count (LIWC)

Page 8: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

8

Background & MotivationsFactor Approach with LIWC

Pennebaker & King. Linguistic Style: Language Use as an Individual Difference. J. of Personality and Social Psychology (1999).

Mairesse et al. Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. Journal of Artificial Intelligence (2007).

  Category Examples Total Stems

Linguisti

c Processe

s

Personal pronouns I, them, her 701st pers singular I, me, mine 122nd person You, your, thou 203rd pers singular She, her, him 17Past tense a Went, ran, had 145Present tense a Is, does, hear 169Future tense a Will, gonna 48Adverbs Very, really, quickly 69Prepositions To, with, above 60Conjunctions And, but, whereas 28Negations No, not, never 57Quantifiers Few, many, much 89Swear words Damn, piss, …. 53Assent Agree, OK, yes 30

Psychologica

l Processe

s

Social processes b Mate, talk, they, child 455Family Daughter, husband, aunt 64Friends Buddy, friend, neighbor 37Humans Adult, baby, boy 61Affective processes Happy, cried, abandon 915Positive emotion Love, nice, sweet 406Negative emotion Hurt, ugly, nasty 499Anxiety Worried, fearful, nervous 91Anger Hate, kill, annoyed 184Sadness Crying, grief, sad 101Cognitive processes cause, know, ought 730

Personal Concern

s

Work Job, majors, xerox 327Achievement Earn, hero, win 186Leisure Cook, chat, movie 229Home Apartment, kitchen, family 93Money Audit, cash, owe 173Religion Altar, church, mosque 159Death Bury, coffin, kill 62

Page 9: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

9

Background & Motivations  Category Examples Total Stems

Linguisti

c Processe

s

Personal pronouns I, them, her 701st pers singular I, me, mine 122nd person You, your, thou 203rd pers singular She, her, him 17Past tense a Went, ran, had 145Present tense a Is, does, hear 169Future tense a Will, gonna 48Adverbs Very, really, quickly 69Prepositions To, with, above 60Conjunctions And, but, whereas 28Negations No, not, never 57Quantifiers Few, many, much 89Swear words Damn, piss, …. 53Assent Agree, OK, yes 30

Psychologica

l Processe

s

Social processes b Mate, talk, they, child 455Family Daughter, husband, aunt 64Friends Buddy, friend, neighbor 37Humans Adult, baby, boy 61Affective processes Happy, cried, abandon 915Positive emotion Love, nice, sweet 406Negative emotion Hurt, ugly, nasty 499Anxiety Worried, fearful, nervous 91Anger Hate, kill, annoyed 184Sadness Crying, grief, sad 101Cognitive processes cause, know, ought 730

Personal Concern

s

Work Job, majors, xerox 327Achievement Earn, hero, win 186Leisure Cook, chat, movie 229Home Apartment, kitchen, family 93Money Audit, cash, owe 173Religion Altar, church, mosque 159Death Bury, coffin, kill 62

p(future)

p(assent)

p(anxiety)

Page 10: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

10

Components of AnalysisFactor Approach

For a set of films: i = 1, 2,… andg {Drama, Comedy,…}

di is subtitle file i (1 document)

D: set of subtitle files from a training set

D = {di : i training set}

Dg: set of subtitle files from the same training set, where each is labeled with g (template)

Dg = {di : i g training set}

s

a c

s

ab

x

wz

u

x

y

di

r

n

af

m

This film is 17 words

long

factorα = {a,b,c,d}

#(factorα| di)= 5

Page 11: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

11

Components of AnalysisFactor Approach

Probability for factorα in di is:

The film i is represented as a vector of probabilities

Probability for factorα in D is:

Probability for factorα in Dg is:

s

a c

s

ab

x

wz

u

x

y

di

factorα = {a,b,c,d}

#(factorα| di)= 5

pi,α = 5/17 = 29%

r

n

af

m

This film is 17 words

long

pi = (pi1 , pi2 , … , piα , … , pi,m) [0,1]m

,#( | )

p ( | )| |

gg g

g

factor Dp factor D

D

#( )ˆ ( )

| |

factorp p factor

D

,#( | )

p ( | )| |

ii i

i

factor dp factor d

d

Page 12: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

12

Components of Analysis

factor2

MYSTERY

Scooby-Doo

Scream

Cinderella

Top Gun

factor1

factorm

MYSTERY =

Scooby-Doo =

,1

,2

,

p

p

p

SCOOBY

SCOOBY

SCOOBY m

,1

,2

,

p

p

p

MYSTERY

MYSTERY

MYSTERY m

Page 13: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

13

Components of Analysis Drill down into LIWC category “SWEAR_WORDS”

Maybe log likelihood can help?

Dram

a

Comed

y

Thril

ler

Action

Roman

ce

Adven

ture

Crime

Sci-Fi

Fant

asy

Horro

r

Mys

tery

Fam

ily

Anim

atio

nW

ar0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%

,#( | )

p| |

gg SWEAR

g

SWEAR D

D

pCRIME, SWEAR 0.63%#( )

p̂| |SWEAR

SWEAR

D

,plog

p̂CRIME SWEAR

SWEAR

Page 14: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

14

Components of Analysis Drill down into LIWC category “SWEAR_WORDS”

Michelson Contrast Function Amplifies low signals Bounds max signal

m]1,1[, 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5Michelson Contrast Function vs. Log Likelihood Ratio

Contrast

Log Likelihood

X = 0.05

,,

,

ˆp p

ˆp pi SWEAR SWEAR

i SWEARi SWEAR SWEAR

,,

,

ˆp p

ˆp pCRIME SWEAR SWEAR

CRIME SWEARCRIME SWEAR SWEAR

Page 15: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

15

Components of Analysis Drill down into LIWC category “SWEAR_WORDS”

Dram

a

Comed

y

Thril

ler

Action

Roman

ce

Adven

ture

Crime

Sci-Fi

Fant

asy

Horro

r

Mys

tery

Fam

ily

Anim

atio

nW

ar-0.7

-0.5

-0.3

-0.1

0.1

0.3

0.5

0.7

Dram

a

Comed

y

Thril

ler

Action

Roman

ce

Adven

ture

Crime

Sci-Fi

Fant

asy

Horro

r

Mys

tery

Fam

ily

Anim

atio

nW

ar0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%5.0%

0.63%

,#( | )

p| |

gg SWEAR

g

SWEAR D

D

#( )p̂

| |SWEARSWEAR

D

,,

,

ˆp p

ˆp pg SWEAR SWEAR

g SWEARg SWEAR SWEAR

Probability

Contrast

Page 16: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

16

Components of Analysis

MC( ,x2)

MYSTERY

Scooby-Doo

Scream

Cinderella

Top Gun

MC( ,x1)

MC( ,xm)

MYSTERY =

Scooby-Doo =

ScoobyMYST

ScoobyMYSTScoobyMYSTsim

),(

,1

,2

,

SCOOBY

SCOOBY

SCOOBY m

,1

,2

,

MYSTERY

MYSTERY

MYSTERY m

1p̂

p̂m

2p̂

Page 17: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

17

Classification Models 10 fold cross validation:

70% train set: creates D and each Dg

20% threshold training: identify optimal classification threshold 10% test set: used to generate performance values

65 LIWC categories, targeting optimal F-score

Dram

a

Comed

y

Thrille

r

Action

Roman

ce

Adven

ture

Crime

Fanta

sy

Sci-Fi

Horro

r

Family

Mys

tery

Animat

ion War

0%10%20%30%40%50%60%70%80%90%

Precision F-score Recall

Page 18: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

18

Classification ModelsHits

Hits + FA

HitsHits + Misses

Precision =

Recall =

33 + 2

33 + 1

= 0.6

= 0.75

=

=

F-score = 2 *Precision * Recall / (Precision + Recall) = .67

65 LIWC categories, targeting optimal F-score

Dram

a

Comed

y

Thrille

r

Action

Roman

ce

Adven

ture

Crime

Fanta

sy

Sci-Fi

Horro

r

Family

Mys

tery

Animat

ion War

0%10%20%30%40%50%60%70%80%90%

Precision F-score Recall

Page 19: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

19

Classification Models 65 LIWC categories, targeting optimal F-score

TITLE Prec Recl

Sorority Boys Comedy Romance 50% 100%

The Cooler Drama Comedy Romance Crime Thriller 50% 67%

The Art of War Action Thriller 100% 100%

Metallic Blues Drama Comedy Thriller 50% 50%

Gladiator Action Drama Adventure Mystery Thriller War 40% 67%

Meet Joe Black Romance Fantasy Mystery Comedy Drama 33% 33%

National Security Action Crime Thriller Comedy Drama 75% 75%

Requiem for a Dream Crime Drama Action Horror Thriller 40% 100%

Resident Evil: Apocalypse Action Sci-Fi Horror Drama Mystery Thriller 40% 67%

Jeepers Creepers Drama Horror Thriller Adventure Action Crime Mystery 50% 75%

Constantine Action Drama Horror Thriller Fantasy Mystery 80% 80%

Ocean's Eleven Comedy Thriller Crime Drama 67% 67%

Chitty Chitty Bang Bang Family Comedy Fantasy Adventure Animation Romance 50% 100%

Clockstoppers Adventure Sci-Fi Thriller Comedy Family Romance 0% 0%

Walk the Line Drama Romance Comedy 67% 100%

Scooby-Doo Adventure Comedy Family Fantasy Mystery Action Thriller 50% 40%

Daddy Day Care Comedy Family Drama Romance 50% 100%

Batman Begins Action Thriller Crime Adventure Comedy Drama Fantasy 33% 67%

Close Encounters of the Third Kind Adventure Sci-Fi Drama Action Comedy Thriller 40% 67%

Bad Company Adventure Comedy Thriller Action Animation Crime Romance 40% 50%

Waking Ned Comedy Drama Romance 33% 100%

I, Robot Action Mystery Sci-Fi Thriller Drama History 80% 100%

Bad Boys II Action Crime Thriller Comedy Drama Horror 60% 75%

The Exorcist Horror Thriller Drama Mystery 50% 100%

The Talented Mr. Ripley Romance Crime Drama Mystery Thriller Animation Comedy Family 25% 20%

CAUGHT MISSED EXCESS

Page 20: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

20

Classification Models Strength metric for factor selection

Using D and Dg , split each factork in two sets:

Hk,g = {w factork | p(w|g) > p(w) } Lk,g = {w factork | p(w|g) ≤ p(w) }

Possibilities: p(Hk,g) = p(Lk,g)

p(Hk,g) > p(Lk,g)

p(Hk,g) < p(Lk,g)

Sort k by:

| log(p(Hk,g)/p(Lk,g)) | 40.5 factors used 37% savings!

gkHw

gk gwpHp,

)|()( ,

gkLw

gk gwpLp,

)|()( ,

0 10 20 30 40 50 60 700.2

0.3

0.4

0.5

0.6

0.7

0.8

Most Predictive

Least Predictive

# Factors

F-Sc

ore

Page 21: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

21

Generating factors from a graph1. Select a set of useful words W

2. Represent relationship between members of W as a graph

3. Cluster words from graph

Implement vector model with these clusters

General GraphGenre-specific Graph

Classification Models

Page 22: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

22

Classification Models (general method)1. Selecting useful words (no “love” methodology):

For w D (w) = Percentage of films with #(w|d) c : if (2% < (w) < 45%)

Add w to W

Stop words include:word df(w) r(w)

please 96% 88%

sorry 95% 85%

life 94% 77%

love 90% 74%

kill 85% 55%

girl 83% 54%

wanted 81% 53%

shit 67% 51%

Page 23: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

23

Classification Models (general method)2. Building GWG with words in W:

For each word pair: wi, wj W

(wi, wj) = Percentage of films with both #(wi|d) c and #(wj|d) c

The co-occurrence ratio:

(wi, wj) = (wi, wj) / [ (wi) * (wj) ]

GWG contains the set of edges:

(wi, wj) GWG (wi, wj)

empirical constants

Page 24: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

24

Classification Models (general method)3. Clustering words of GWG into sets:

Create the set of maximum size cliques for each w GWG

Merge highly similar cliques

This will be a factor in our custom model Modeling performance with these factors equals LIWC performance Strength metric reduces dimensionality…. 38% of 220 factors used

Maximum sized cliques for “commander”

9: commander contact heading national position states strike target weapons9: commander contact heading major national states strike target weapons9: commander contact heading launch position states strike target weapons9: commander contact heading launch major states strike target weapons

Relaxed clique containing “commander” using θ = 0.7attack base begin bomb bridge build built captain center commander complete contact crew destroy destroyed earth emergency energy escape force forward

holding immediately impossible launch lieutenant main necessary planet prepare project ship signal space speed system weapons

Page 25: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

25

Classification Models (genre specific method)1. Selecting useful words:

For w Dg ( Dg = {di : i g} )

g (w) = Percentage of films in g with #(w|d) c

¬g (w) = Percentage of films in g with #(w|d) c ( (w) - g (w) ) if( g(w) 25% & g(w) / ¬g(w) > 1.4 )

Add w to Wg

Page 26: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

26

Classification Models (genre-specific method)2. Building the GS word graph from Wg

For each pair w,v Wg: Calculate new word pair relationship value:

g(w,v) =

Normalize them:

δg(w,v) =

GS contains the set of edges:

(wi, wj) GS where g(wi, wj) g

3. Create cliques and relaxed cliques…

gi

ii dvdw )1)|(log(#)1)|(log(#

gi igi i

g

dvdw

vw

22 )1)|(log(#)1)|(log(#

),(

Page 27: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

27

Classification Models (genre-specific method) Results using GS factors:

Performance also equals LIWC performance! Strength metric reduces dimensionality…. 55% of factors used

Comparing GWG and GS factors

Conclusion: let’s try mixing them…

 GWG

FactorsGS

FactorsUnique words 539 420Total usage 3.57% 3.98%Clusters 220 200Average cluster size 7.1 10.1Largest 37 33

Page 28: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

Mixed model performance is significantly better than all previous methods (p<0.003).

Compared to LIWC model

error is reduced by: 8.1% for precision 6.6% for recall 7.4% for F-score

28

Classification Models (mixed model)

 GWG

FactorsGS

FactorsMixed

Factors Unique words 539 420 752 Total usage 3.6% 4.0% 5.9% Factors 220 200 375 Factors used 38% 55% 49%

Precision Recall Fscore45%

50%

55%

60%

65%

70%

75%

48.6%

67.5%

53.2%

49.7%

68.4%

54.3%

49.2%

68.1%

53.6%52.3%

70.2%

56.4%

LIWC selected factorsAuto Factors 1, selected factorsAuto Factors 2, selected factorsMixed 1 & 2, select factors

Per

form

ance

Lev

el

mixed

LIWC

Page 29: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

29

Classification Models (mixed model continued)TITLE PRECISION RECALL

"Sorority Boys" Comedy Romance 50% 100%

"The Cooler" Comedy Drama Romance Crime 50% 33%

"The Art of War" Action Thriller Drama Sci-Fi 50% 100%

"Metallic Blues" Drama Comedy Thriller 50% 50%

"Gladiator" Action Adventure Drama Fantasy War 50% 67%

"Meet Joe Black" Romance Fantasy Mystery Comedy Drama 33% 33%

"National Security" Action Crime Thriller Comedy 100% 75%

"Requiem for a Dream" Drama Crime Action Horror Thriller 25% 50%

"Resident Evil: Apocalypse" Action Horror Sci-Fi Crime Drama Mystery Thriller 33% 67%

"Jeepers Creepers" Drama Horror Thriller Adventure Comedy Crime Mystery 50% 75%

"Constantine" Action Horror Thriller Drama Fantasy Mystery 75% 60%

"Ocean's Eleven" Crime Thriller Comedy Action Drama 50% 67%

"Chitty Chitty Bang Bang" Fantasy Comedy Family Adventure Drama Romance 40% 67%

"Clockstoppers" Adventure Sci-Fi Thriller Comedy 0% 0%

"Walk the Line" Romance Drama Comedy 67% 100%

"Scooby-Doo" Adventure Comedy Family Fantasy Mystery Animation Romance 67% 80%

"Daddy Day Care" Comedy Family Action 50% 50%

"Batman Begins" Action Thriller Crime Drama 67% 67%

"Close Encounters of the Third Kin Adventure Sci-Fi Drama Action Horror Thriller 40% 67%

"Bad Company" Comedy Thriller Action Adventure 100% 50%

"Waking Ned" Comedy Drama 50% 100%

"I, Robot" Action Mystery Sci-Fi Thriller Adventure Crime Drama Horror 50% 100%

"Bad Boys II" Action Crime Thriller Comedy Drama 75% 75%

"The Exorcist" Horror Thriller Action Drama Mystery 40% 100%

"The Talented Mr. Ripley" Drama Mystery Romance Thriller Crime Comedy 75% 60%

CORRECT MISSED EXCESSCAUGHT

Page 30: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

30

Classification Models (mixed model continued)

Genre Score Thresh Tag Correct

Animation 0.784 0.783 X  

Fantasy 0.730 0.534 X C Family 0.678 0.633 X C Adventure 0.557 0.276 X C Sci-Fi 0.425 0.474   C Romance 0.395 0.380 X   Horror 0.342 0.456   C Comedy 0.193 0.093 X C Mystery 0.079 0.638     Action -0.056 -0.039   C War -0.312 0.674   C Thriller -0.317 -0.015   C Crime -0.332 0.162   C Drama -0.358 -0.085   C

Unmasking Scooby-doo…

Over all films,

Classification Accuracy is 79.4%

Far below mystery

threshold

CA=78.6%

Page 31: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

31

Conclusions: Should automatically generated factors replace LIWC?

You can't begin to imagine the thousands of hours that have gone into the making of these dictionaries. And when I see that we apparently missed "yourself" I wonder how it is possible that it happened. – Prof. James Pennebaker, LIWC, personal communication.

Our methods may be used for… thematic document classification personality research building a better search engine creating a movie recommendation system

Page 32: Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

~ Thank you ~