Packing and Unpacking the Bag of Words:Introducing a Toolkit for Inductive Automated
Frame Analysis
Damian Trilling & Jeroen Jonkman
[email protected]@damian0604
www.damiantrilling.net
Afdeling CommunicatiewetenschapUniversiteit van Amsterdam
WAPOR, Buenos Aires, 16–19 June 2015
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and searchstrings
• advanced: supervisedmachine learning
Inductive
• word frequencies andco-occurrences
• visualizations• principal component
analysis• cluster analysis• latent dirichlet allocation• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and searchstrings
• advanced: supervisedmachine learning
Inductive
• word frequencies andco-occurrences
• visualizations• principal component
analysis• cluster analysis• latent dirichlet allocation• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and searchstrings
• advanced: supervisedmachine learning
Inductive
• word frequencies andco-occurrences
• visualizations• principal component
analysis• cluster analysis• latent dirichlet allocation• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and searchstrings
• advanced: supervisedmachine learning
Inductive
• word frequencies andco-occurrences
• visualizations• principal component
analysis• cluster analysis• latent dirichlet allocation• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)
• Do we expect each element to occur in one and only oneframe? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)
• Do we expect each element to occur in one and only oneframe? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)
• Do we expect each element to occur in one and only oneframe? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)
• Do we expect each element to occur in one and only oneframe? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Practical issues
Practical issues
• no standard software (but: more and more R-packages andPython modules)
• reliance on inaccessible, self-written, or proprietary software
• lack of knowledge in the field
• size of the datasets
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
A catalogue of criteria
A catalogue of criteria
A toolkit for automated framing analysis should. . .
1 not depend on commercial software
2 run on all major operating systems
3 be scalable: usable on a laptop, but also on powerful serversto analyze millions of documents.
4 be flexible and open: adoptable to own needs
5 have a powerful database engine on the background
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Sample implementation: INFRA
To meet these criteria, we wrote INFRA in Python, using theNoSQL database MongoDB. The toolkit will be made freelyavailable, both as source code and via a web interface.
Packing and Unpacking the Bag of Words Trilling & Jonkman
Data (e.g., LexisNexis articles)
Import filter
NoSQL database
Cleaning and pre-processing filters
Cleaned NoSQLdatabase
word frequenciesand co-occurences
log likelihood visualizations
define details foranalysis (e.g., im-
portant actors)
dictionary filter/namedentity recognition
Latent dirich-let allocation
Principal com-ponent analysis
Cluster analysis
Data management phase
Analysis phase
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Central storage
Data management phase handled on the server; analyses can behandled either on the server (SSH) or locally (INFRA)
External data
MongoDB server
Computer2 Computer3Computer1 Computer4
Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming itsshortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
In the analysis phase
• work with a much faster dataset that contains only thenecessary information
• no need to deal with misspellings and variations any more
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming itsshortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
In the analysis phase
• work with a much faster dataset that contains only thenecessary information
• no need to deal with misspellings and variations any more
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming itsshortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
In the analysis phase
• work with a much faster dataset that contains only thenecessary information
• no need to deal with misspellings and variations any more
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the data management phase
• spend much time on re-coding relevant multi-word entities toavoid noise (of course, “Barack” and “Obama” occurtogether) and recode synonyms (how would you otherwisereliably estimate frequencies?)⇒ especially important for questions like “how is actor Xframed?”
• regular expressions instead of simple word lists!
• make an informed decision on how to harmonize the dataset(stopword removal, stemming (?), POS tagging (?))
And: share these procedures!
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the analysis phase
• background knowledge necessary (face validity)
• robustness: do slightly different parameters deliver similarresults?
• too small dataset ⇒ sensitivity for atypical events (scandalsetc.) ⇒ discovering topic rather than frame
• difference between statistical predictive power andmeaningfulness
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Empirical example:Dutch business news
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Preprocessing steps
1 Ingest and parse all possibly relevant articles (≈ 500 000)
2 Compose list of ≈ 1 500 regular expressions to substitutesynonyms and combinations to correctly code actors, allowingfor conditional substitutions
3 Remove stopwords, punctuation, etc.
4 Determine part-of-speech, keep only nouns, adjectives, adverbs
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Analysis steps
1 Determine relevant actors with frequency counts, filtering outall non-Dutch words (alternative: named entity recognition)
2 Conduct PCA, cluster analysis, and LDA – additionally, countfrequency of actor mentions
3 Finetuning, repeating, choose final model
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Attention over time
Overview of news attention: attention to 100 firms in companynews and entropy (red line) from 2007 to 2013.
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: TopicsResults of a topic model
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: ComponentsResults of a principal component analysis
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: co-occurrencesResults of a network visualization of co-occurrances
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Conclusions
• We developed a toolkit that integrates all recent methodsused for automated inductive framing analysis
• It is free
• It works with large-scale datasets
• It can be used by a whole group together
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Next steps
• RE the tool: graphical interface
• RE the method: systematic validation study; comparingdifferent approaches and settings
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Questions
Questions?
[email protected]@damian0604
www.damiantrilling.net
Packing and Unpacking the Bag of Words Trilling & Jonkman
Top Related