‘The Right to Privacy and the Future of Mass Surveillance ...
"Mass Surveillance" through Distant Reading
-
Upload
shalin-hai-jew -
Category
Data & Analytics
-
view
91 -
download
2
Transcript of "Mass Surveillance" through Distant Reading
“MASS SURVEILLANCE” THROUGH DISTANT READING
Shalin Hai-Jew• Aesthesia• March 2, 2017• Marianna Kistler Beach
Museum of Art• Kansas State University
OVERVIEW
Distant reading refers to the uses of computers to “read” texts by counting words, identifying themes and subthemes (through topic modeling), extracting sentiment, applying psychological analysis to the author(s), and otherwise finding latent or hidden insights. This work is based on research on “mass surveillance” based on five text sets: academic, mainstream journalism, microblogging, Wikipedia articles, and leaked government data. The purpose was to capture some insights about the collective social discussions occurring around this issue in an indirect way. This presentation uses a variety of data visualizations (article network graphs, word trees, dendrograms, treemaps, cluster diagrams, line graphs, bar charts, pie charts, and others) to show how machines read and the types of summary data they enable (at computational speeds, at machine scale, and in a reproducible way). Also, some computational linguistic analysis tools enable the creation of custom dictionaries for unique types of applied research. The tools used in this presentation include NVivo 11 Plus and LIWC2015.
2
SOME COMMON TYPES OF “DISTANT READING” AND APPLICATIONS
Linguistic analysis
Topic modeling Theme and subtheme extraction
Sentiment analysis • Positive and negative
Text networks Word relationships
Authorship analysis (based on latent features) Stylometry “fingerprinting”
Author gender identification
Psychological analysis
Cultural analysis, culturomics
History-based applications
Literary analysis Dialogue analysis
Geographical referencing and patterning
Character analysis
Predictive analytics Classification
Trend
3
STUDIED PHENOMENA IN THE COMPUTATIONAL LINGUISTIC ANALYSIS RESEARCH LITERATURE
Political science, leader speech analysis (for profiling)
State-of-a-field research
Authorship identification
Plagiarism detection
Suicidality
Movie popularity, song popularity
Language studies
Law enforcement
Fraud detection
Threat detection, and others
4
WHY DISTANT READING?
Textual interpretation At computational speeds
At computational scale
Reproducible, repeatable
Measures various analytical constructs in quantized ways
Surfacing latent (hidden) ideas and data patterns not seeable otherwise (such as by human “close reading”)
Results comparable against large textual datasets of particular types of text (such as comparing a Tweetstream against other social media texts or even microblogging texts)
Complementary to and augmentary of human “close reading”
5
COMMON ANALYTICAL TRAJECTORIES
Curation of text sets (corpora) -> distant reading data summaries -> zoomed-in analysis (of concepts, names, dates, locations, symbols, and numbers, etc.) -> human close reading
General-to-specific trajectory
Baseline text set statistics based on curated text collections and text corpora
Comparisons across text sets
Relative data
6
WHY “MASS SURVEILLANCE”?
A timely construct
A point-of-global discussion
A mixed group of competing stakeholders re: the issue
Wide public availability of five (somewhat) disparate text sets:
Academic
Mainstream journalism
Microblogging
Wikipedia articles
Leaked government data
8
30
Gunning Fog Index Coleman Liau Index Flesch Kincaid
Grade Level
ARI (Automated
Readability Index)
SMOG Readability
Formula
Flesch
Reading
Ease ( /100)
Set 1: Academic article text
set (partial)
13.20 11.71 10.71 9.29 12.80 43.26
Set 2: Mainstream
journalistic text set
14.28 13.88 12.12 12.40 13.75 39.25
Set 3: Twitter
microblogging hashtag
discourse text set
28.88 32.36 24.40 29.73 21.75 -38.46 (on a
100 point
scale)
Set 4: Wikipedia article
network text set (partial)
11.09 12.25 9.46 8.31 11.07 44.39
Set 5: Leaked U.S.
government text set (partial)
14.65 12.45 12.29 10.89 13.97 36.44
data table
36
Final Full Set of Mass-surveillance Article Network from Wikipedia Themes and Subthemes Treemap
treemap diagram
49
article-article network
from Wikipedia
(NodeXL or
“Network Overview,
Discovery and
Exploration for Excel”)
article network graph
61
0
1
2
3
4
5
6
7
8
A : content B : dissemination C : front door D : hidden service E : information F : jflftflvjffdissemination
G : node H : onion I : r dissemination
Num
ber
of
Mentions
Auto-extracted Top-Level Themes from a Government Document
An Article Histogram of a Leaked Government Documentarticle
histogram
w/ main
theme
extractions
62
0 0.5 1 1.5 2 2.5 3 3.5
A : event
B : facebook
C : msn
D : notification
E : sources
F : target
Counts of Mentions of Top-Level Themes
Auto
-extr
act
ed T
op
-Level
Them
es
A Theme Histogram from a Government Document
article
histogram
w/ main
theme
extractions
CONTRIBUTIONS TO THE “MASS SURVEILLANCE” TOPIC
Academic writing: legal, philosophical, technological, and practical implications
Mainstream journalistic articles: domestic and foreign government engagement with the issue (executive, legislative, judicial, and others)
Microblogging messages: global surveillance challenges, changing technologies (drones)
Wikipedia (open-source and crowdsourced encyclopedia): summary details, highlighted events, personages, URLs, and timely observations
Government documents: bureaucratese, technical capabilities
64
ABOUT THE RELATED TEXT SETS…FROM DISTANT READING
Different genres of writing, based on a particular topic, manifest differently on different textual dimensions. Some textual features seem to co-vary and may be because these are features of prose writing, or
other factors.
Analysis of different features of the text sets may be helpful in identifying source types that may be most useful for certain types of research or questions.
Social media “netspeak” has not yet fully been captured in the two commercial tools used for this analysis.
Average word counts per unit differed: academic (7,624 – 8,073 words per unit), mainstream journalistic articles (1,460 – 1488 words per unit), microblogging hashtag discourse (44 – 61 per user account), Wikipedia articles (6,710 – 7,216 words per article), and leaked government documents (1,711 – 1,800 words). Variance in word counts were based on the uses of differing software programs to do the counts…and
natural ambiguity in word identification.
65
ABOUT THE RELATED TEXT SETS…FROM DISTANT READING (CONT.)
Computational analysis of the five text sets showed a spike in terms of human drives across all sets…in terms of “power.” Because this applied across all five text sets, it may be that “power” is a driving issue of concern regarding “mass surveillance.”
Sentiment was most present in the following (in descending order): Wikipedia articles, academic articles, leaked government documents, mainstream journalism, and hashtag discourse, according to analysis in NVivo 11 Plus but a different order was found using LIWC2015 (in descending order): mainstream journalism, Wikipedia articles, academic articles, leaked government documents, and hashtag discourse.
The only rank position of agreement was having hashtag discourse in last place with the least sentiment, which can partially be explained by the brevity of Tweets and the expression of emotion in emoticons and punctuation marks.
66
ABOUT THE RELATED TEXT SETS…BASED IN PART ON SELECTED CLOSE READING
All five text sets—academic, mainstream journalistic, microblogging messages, Wikipedia articles, and the government documents—were informed by the source government documents.
The journalistic articles, with a rights narrative of deep intrusions into privacy, seem to have captured the readership’s attention, while academic and government documents were not consumed as broadly.
Journalistic articles ranked high in sociality measures—and that may indicate why people see it as connecting with their lives.
Twitter was used to advertise writings from academia and mainstream journalism.
Some academic publications cited mainstream journalistic pieces, but fewer journalistic pieces cited academic works.
67
ABOUT THE RELATED TEXT SETS…BASED IN PART ON SELECTED CLOSE READING (CONT.)
Academia did not have a lot of pieces on this issue in the subscription databases and other sources that were checked.
It may be that more time has to pass for researchers to study the issues.
The technological complexity of the government documents required technology and legal and policy experts to interpret.
These documents were generally handled in a non-consumptive way for computational linguistic analysis. Non-consumptiveness refers to the extraction of statistical features of a text set without direct access to the underlying texts. For this analysis, the focus was on computational reading of the related documents, not a human interpretation of the text set or the related capabilities.
68
ABOUT USING COMPUTATIONAL LINGUISTIC ANALYSIS TO “READ” UP ON AN ISSUE
Selected text sets should be as comprehensive as possible in order to represent the topic. The text sets should be cleaned, so irrelevant elements may be eliminated. There should be clear documentation about how data was collected and processed and handled. How the text sets are handled affect the results.
The bundling of particular text sets will affect results as well.
Because social media only attracts some to participate, there can be some large gaps in informational coverage. Social media platform APIs are often rate- and data-limited, so it’s important to review the terms of
access to such data.
Using multiple software tools to conduct analysis makes sense because there are differences between tool designs which will affect what is observed or not. The “validity” and “reliability” of software tools vary…
69
ABOUT USING COMPUTATIONAL LINGUISTIC ANALYSIS TO “READ” UP ON AN ISSUE (CONT.)
How the researcher asks questions and wields the technology will affect what is seeable and seen. There is not an “objective” reading machine… Subjectivity and judgment play a role.
External validation may be an important piece of research using computational reading.
The data visualizations here are mostly interactive, and it is possible to link to original underlying data. All the data visualizations are informed by underlying data, and these should be accessed for deeper understandings.
These interactive features and underlying data should be engaged to fully benefit from the computational analyses. (Data visualizations are not used independent of the underlying data.)
“Non-consumptive” text analysis can sometimes be helpful even without the benefit of close reading and examination of the underlying text corpora used for the computational analysis.
70
ABOUT USING COMPUTATIONAL LINGUISTIC ANALYSIS TO “READ” UP ON AN ISSUE (CONT.)
Close reading always a part of the work, even though distant reading is brought to bear. Both enhance the other, and there are many rich processing sequences to read.
What a human reader “sees” vs. what a computer does differs.
71
SOME POSSIBLE EFFECTS OF THE RESEARCH
Different genres of texts may reach different parts of a population. Those who limit themselves to particular genres will only capture some aspects of information about a topic.
Those engaged in strategic communications would benefit from gaining a sense of which communications modes to engage in order to reach their target audience.
It helps to know what issues are trending at any particular time…and the collective emotions which are being expressed.
It helps to strategically target limited human close reading attention based on observations from distant reading.
72
WHY “MASS SURVEILLANCE” AND “DISTANT READING”?
There is an elision of mass surveillance and distant reading…in this slideshow…in part because technological enablements enable “mass surveillance” and dataveillance (data + surveillance, in a portmanteau term).
Practically speaking, human close reading would be wholly insufficient to interact with mass data. There are not enough human years to plough through the masses of structured and unstructured data being created today.
For complex data, human close reading requires close and slow attention (200 wpm / words per minute).
Human close reading is not known for great objective accuracy. Rather, human reading is informed by a trained and subjective lens. Human reading is known for a unique perspective and voice.
73
WHY “MASS SURVEILLANCE” AND “DISTANT READING”? (CONT.)
Together, “distant” and “close” reading expand human power to read, interpret, and learn. Sometimes, these complementary efforts help solve very human challenges.
Computational distant reading does not “displace” people or what they can bring to research and analysis. Oftentimes, the findings from each diverge, resulting in different insights attained in different ways.
74
ABOUT NVIVO 11 PLUS
Enables the building of unstructured, semi-structured, and structured data (using SQL as the understructure on Windows)
Enables analysis of any data represented by UTF-8 (Unicode character set) but requires a main base language
Enables exact matches, stemmed words, synonyms, specializations, and generalizations
Enables the application of special characters and Boolean terms
Enables the building of an exportable code dictionary
Enables topic modeling, sentiment analysis, and “coding by existing pattern”
Enables “distant reading” and interactive data visualizations including word trees, dendrograms, treemaps, cluster diagrams, and others
76
ABOUT LIWC2015 PLUS
Has a built-in linguistic analysis dictionary which has been built up over decades of refinement and empirical research
Summarizes datasets on four scores: Analytic, Clout, Authentic, and Tone
Includes psychological and socio-psychological elements
Includes sentiment and emotional analysis features
Includes gender reference counts
Includes human drives counts
Includes generic linguistic analysis counts (including for function words)
78
ABOUT LIWC2015 PLUS (CONT.)
Is back-stopped by decades of solid research
Is a very well and smartly documented tool
Is set up as a processor and a dictionary
Enables the building of custom dictionaries to run against textual datasets to surface more unique insights
79
ABOUT LIWC2015 PLUS (CONT.)
Requires some in-depth reading of the related documentation
The Development and Psychometric Properties of LIWC2015
Linguistic Inquiry and Word Count: LIWC2015
Requires reading of years of research for the smoothest research applications
Requires experience in Excel since data dump out into .xl or .xlsx
There is no proprietary file to save an analysis using LIWC2015
80
CONTACT AND CONCLUSION
Dr. Shalin Hai-Jew
Instructional Designer
Kansas State University
785-532-5262
“Distant reading” is a term originated by Franco Moretti (founder of the Stanford Literary Lab) in 2011.
This slideshow is based on a research-based chapter forthcoming in 2017.
81