CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR...capturing human behavior and language for interactive...
Transcript of CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR...capturing human behavior and language for interactive...
CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR
INTERACTIVE SYSTEMS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Ethan Fast
August 2018
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/bk979gs1829
© 2018 by Ethan Fast. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Michael Bernstein, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Maneesh Agrawala
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Eric Horvitz,
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
From smart homes that prepare coffee when we wake, to phones that know not to interrupt us dur-
ing important conversations, our collective visions of human-computer interaction (HCI) imagine a
future in which computers understand a broad range of human behaviors. Today our systems fall
short of these visions, however, because this range of behaviors is too large for designers or program-
mers to capture manually. In this thesis I will present three systems that mine and operationalize
an understanding of human life from large text corpora. The first system, Augur, focuses on what
people do in daily life: capturing many thousands of relationships between human activities (e.g.,
taking a phone call, using a computer, going to a meeting) and the scene context that surrounds
them. The second system, Empath, focuses on what people say: capturing hundreds of linguistic
signals through a set of pre-generated lexicons, and allowing computational social scientists to create
new lexicons on demand. The final system, Codex, explores how similar models can empower an
understanding of emergent programming practice across millions of lines of open source code. Be-
tween these projects, I will demonstrate how semi-supervised and unsupervised learning can enable
many new applications and analyses for interactive systems.
iv
Acknowledgments
This thesis is dedicated to the many people who made it possible:
• Binbin Chen, who in addition to everything else can always be counted on to help brainstorm
and refine ideas;
• Jon Bassen, whose influence is similarly present in the work;
• my parents Kevin and Kathy Fast, supporting me in everything I do;
• my advisor Michael Bernstein, who first introduced me to HCI and has shaped my thinking
in profound ways;
• the many mentors I have had over the course of my PhD, including Eric Horvitz, Alex Aiken,
Joel Brandt, and Maneesh Agrawala;
• the undergraduate and graduate students I have collaborated with, including Will McGrath,
Pranav Rajpurkar, Julia Mendelsohn, Daniel Steffee, Lucy Wang, and Colleen Lee;
• my many friends and colleagues in the Stanford HCI group;
• my undergraduate mentor and advisor Westley Weimer, who taught me how to do research.
I was supported by a NSF Graduate Fellowship and a Brown Institute Grant for Media Innovation
over my time at Stanford. Special thanks to these groups for funding my work.
v
Contents
Abstract iv
Acknowledgments v
1 Introduction 1
1.1 Human Life and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Human Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 Modeling Human Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Mining community data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Ubiquitous computing interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Knowledge representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Modeling Human Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Extracting signal from text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Text mining and modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Modeling Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Mining software repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Bugfinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Learning from code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Data-driven interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Modeling Human Behavior 14
3.1 Augur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Human Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Object Affordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Connections between activities . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vi
3.1.4 A data mining DSL for natural language . . . . . . . . . . . . . . . . . . . . . 17
3.1.5 Mining activity patterns from text . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.6 Vector space model for retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Augur API and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Identifying Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Expanding Activites with Object Affordances . . . . . . . . . . . . . . . . . . 22
3.2.3 Predicting Future Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Bias of Fiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Field test of A Soundtrack for Life . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 A stress test over #dailylife . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Modeling Signals in Human Language 33
4.1 Empath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Designing Empath’s categories . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Refining categories with crowd validation . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Empath API and web service . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Empath Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Example 1: Understanding deception in hotel reviews . . . . . . . . . . . . . 36
4.2.2 Example 2: Mood on Twitter and time of day . . . . . . . . . . . . . . . . . . 38
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Comparing Empath and LIWC . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 The role of human validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Data-driven: who is actually driving? . . . . . . . . . . . . . . . . . . . . . . 42
4.4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.4 Statistical false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Modeling Patterns in Code 44
5.1 Codex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Indexing and Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Statistical Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.3 Pattern Finding Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Codex Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Statistical Linting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Pattern Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
5.2.3 Library Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 The Codex Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Pattern Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.3 Statistical Linting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Discussion 59
6.1 Data Mining in HCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Biases in Data-Driven Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Data Power vs. Modeling Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusion 63
Bibliography 64
viii
List of Tables
3.1 We find average rates of 96% recall and 71% precision over common activities in the
dataset. Here Ground Truth Frames refers to the total number of frames labeled with
each activity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 As rated by external experts, the majority of Augur’s predictions are high-quality. . 31
4.1 Empath can analyze text across hundreds of data-driven categories. Here we provide
a sample of representative terms in 8 sample categories. . . . . . . . . . . . . . . . . 34
4.2 Crowd workers found 95% of the words generated by Empath’s unsupervised model
to be related to its categories. However, machine learning is not perfect, and some
unrelated terms slipped through (“Did not pass” above), which the crowd then removed. 35
4.3 We compared the classifications of LIWC, EmoLex and Empath across thirteen cate-
gories, finding strong correlation between tools. The first column represents compar-
isons between Empath’s unsupervised model against LIWC, the second after crowd
filtering against LIWC, the third between EmoLex and LIWC, and the fourth between
the General Inquirer and LIWC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Codex identifies common programming snippets automatically, then feeds them to
crowdsourced expert programmers for metadata such as the bolded title and descrip-
tive text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 A sample of functions from CodexLib, detected in emergent programming practice
and encapulated into a new standard library. . . . . . . . . . . . . . . . . . . . . . . 53
5.3 The percent of snippets that are unique after normalization for common AST node
types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Programmers from an expert crowdsourcing market annotated Codex’s idioms with
their usage type. The vast majority concern the use of standard, built-in libraries. . 56
ix
List of Figures
1.1 Augur mines human activities from a large dataset of modern fiction. Its statisti-
cal associations give applications an understanding of when each activity might be
appropriate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Empath learns word embeddings from 1.8 billion words of fiction, makes a vector
space from these embeddings that measures the similarity between words, uses seed
terms to define and discover new words for each of its categories, and finally filters
its categories using crowds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Codex draws on millions of lines of open source code to create software engineer-
ing interfaces that integrate emergent programming practice. Here, Codex’s pattern
annotation calls out popular idioms that appear in the user’s code. . . . . . . . . . . 5
3.1 Augur’s activity detection API translates a photo into a set of likely relevant activities.
For example, the user’s camera might automatically photojournal the food whenever
the user may be eating food. Here, Clarifai produced the object labels. . . . . . . . . 21
3.2 Augur’s APIs map input images through a deep learning object detector, then ini-
tializes the returned objects into a query vector. Augur then compares that vector to
the vectors representing each activity in its database and returns those with lowest
cosine distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Augur’s object affordance API translates a photo into a list of possible affordances.
For example, Augur could help a blind user who is wearing an intelligent camera and
says they want to sit. Here, Clarifai produced the object labels. . . . . . . . . . . . . 23
3.4 A Soundtrack for Life is a Google Glass application that plays musicians based on
the user’s predicted activity, for example associating working with The Glitch Mob. 26
3.5 We deployed an Augur-powered wearable camera in a field test over common daily
activities, finding average rates of 96% recall and 71% precision for its classifications. 29
4.1 Deceptive reviews convey stronger sentiment across both positively and negatively
charged categories. In contrast, truthful reviews show a tendency towards more mun-
dane activities and physical objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
x
4.2 We use Empath to replicate the work of Golder and Macy, investigating how mood
on Twitter relates to time of day. The signals reported by Empath and LIWC by
hour are strongly correlated for positive (r=0.87) and negative (r=0.90) sentiment. . 38
4.3 Empath categories strongly agreed with LIWC, at an average Pearson correlation of
0.90. Here we plot Empath’s best and worst correlations with LIWC. Each dot in
the plot corresponds to one document. Empath’s counts are graphed on the x-axis,
LIWC’s on the y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 The Codex IDE calls out a snippet of unlikely code by a yellow highlight in its gutter.
Warning text appears in the footer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 A plot of Codex’s hit rate as it indexes code over four random samples of file orderings.
The y-axis plots the database hit rate, and the x-axis plots the number of lines of
code indexed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xi
Chapter 1
Introduction
People don’t use systems in isolation. When we write an email, develop code, or interact with a
virtual assistant, we are engaging in activities that many other people have done before us—even
if they have not done so in exactly the same way. Often, the way we interact with these systems
becomes shared knowledge. This knowledge sharing can be explicit, as in sharing a code snippet
on StackOverflow, or it can be implicit, as in replying to an email using business jargon that we’ve
seen a colleague apply to a similar situation. In either case, the systems we use are embedded in a
broader context of how other people use them.
This shared context can allow systems to better understand or anticipate our needs. For example,
if you have been writing code using a cryptographic library that other people don’t use because it
is too slow, this is something you might like to know (and avoid). Or if you have just come back
from a long run on a hot day, a smart home might guess you’d like a glass of cold water. In both
cases, systems can learn from the behavior of others to predict information relevant to you. This
vision is far from new. Mark Weiser described a similar scenario decades ago [82], and other ideas
of intelligent interfaces have been around for even longer.
While these ideas have been successfully applied to a small set of activities known in advance
to a system designer [2], the path to achieving such predictions in more open-ended domains has
remained largely unexplored. Consider the domain of human life, something that a smart home ought
to understand. A useful system must encode knowledge about thousands of potential activities a
user might engage in (e.g., cooking dinner, reading a book, calling a friend), and beyond that, the
relationships between them. There are far too many of these activities and relationships to manually
define in a system. Similar problems exist in other open domains such as writing or programming.
Here again the set of potentially valuable signals and patterns is enormous and does not lend itself
to pre-specification by a system designer.
In this thesis, I explore how systems can leverage semi-supervised and unsupervised learning
techniques to better understand the communities of users that surround them. By applying these
1
CHAPTER 1. INTRODUCTION 2
techniques to model knowledge, systems can learn a large set of actionable concepts that do not
need to be defined in advance by a system designer. I present systems that address user needs
across three different domains—programming, ubiquitous computing, and text analysis—based on
similar unsupervised models applied to community data. In each of these domains, textual datasets
record what many people have said or done, allowing us to bootstrap a vocabulary of higher level
abstractions among such behaviors. For example, one of these systems can learn that paying bills
usually happens after ordering food without its designer ever deciding that these activities should
exist as concepts. Another has learned that a Ruby function that ends in a “‘!” should probably
modify its argument in place without anyone realizing that was an interesting syntactical analysis.
Yet another can tell a user that they are using language evocative of being a hipster.
Figure 1.1: Augur mines human activities from a large dataset of modern fiction. Its statisticalassociations give applications an understanding of when each activity might be appropriate.
1.1 Human Life and Behavior
Our most compelling visions of human-computer interaction depict worlds in which computers un-
derstand the breadth of human life. Mark Weiser’s first example scenario of ubiquitous computing,
for instance, imagines a smart home that predicts its user may want coffee upon waking up [82].
Apple’s Knowledge Navigator similarly knows not to let the user’s phone ring during a conversation
[5]. In science fiction, technology plays us upbeat music when we are sad, adjusts our daily routines
to match our goals, and alerts us when we leave the house without our wallet. In each of these
visions, computers understand the actions people take, and when.
Many years have passed since these visions were first articulated, and yet interactive systems
still lack a broad understanding of human behavior. Today, interaction designers instead create
CHAPTER 1. INTRODUCTION 3
special-case rules and single-use machine learning models. The resulting systems can, for example,
teach a phone (or Knowledge Navigator) not to respond to calls during a calendar meeting. But
even the most clever developer cannot encode behaviors and responses for every human activity – we
also ignore calls while eating lunch with friends, doing focused work, or using the restroom, among
many other situations. To achieve this breadth, we need a knowledge base of human activities, the
situations in which they occur, and the causal relationships between them. Even the web and social
media, serving as large datasets of human record, do not offer this information readily.
To solve this problem, we show it is possible to create a broad knowledge base of human behavior
by text mining a large dataset of modern fiction. Fictional human lives provide surprisingly accurate
accounts of real human activities. While we tend to think about stories in terms of the dramatic
and unusual events that shape their plots, stories are also filled with prosaic information about how
we navigate and react to our everyday surroundings. Over many millions of words, these mundane
patterns are far more common than their dramatic counterparts. Characters in modern fiction turn
on the lights after entering rooms; they react to compliments by blushing; they do not answer their
phones when they are in meetings. Our knowledge base, Augur (Figure 1.1), learns these associations
by mining 1.8 billion words of modern fiction from the online writing community Wattpad.
There are far too many human activities to enumerate in advance, much less to train and validate
independent predictive models over. We use an unsupervised vector space model to model relation-
ships between activities and scene context. We first extract activities through subject-verb-object
sequences determined by a dependency parse, then train a neural network or predict relationships
between these activities and their context over millions of lines of fiction. The weights learned by
the neural network produce a vector space that provides a representation of activities in terms of
other activities and scene context. This vector space encodes many thousands of relationships: for
example, associating activities such as eating with hundreds of food items and relevant tools such as
cutlery, plates and napkins, or associating one activate such as enter store with many others, such
as shop, grab cart, and pay. We go on to demonstrate how these models can be leveraged by a new
class of interactive systems, such as an automatic food diary or a system that warns users about
their bank balance when they are about to spend money, and evaluate the system through a user
study and deployment on Google Glass.
Figure 1.2: Empath learns word embeddings from 1.8 billion words of fiction, makes a vector spacefrom these embeddings that measures the similarity between words, uses seed terms to define anddiscover new words for each of its categories, and finally filters its categories using crowds.
CHAPTER 1. INTRODUCTION 4
1.2 Human Language
Just as there is breadth in human life, there is also breadth in human language. Language is rich
in subtle signals. The previous sentence, for example, conveys connotations of wealth (“rich”),
cleverness (“subtle”), communication (“language”, “signals”), and positive sentiment (“rich”). A
growing body of work in human-computer interaction, computational social science and social com-
puting uses tools to identify these signals: for example, detecting emotional contagion in status
updates or linguistic correlates of deception [49, 66].
High quality lexicons allow us to analyze language at scale and across a broad range of signals. For
example, researchers often use LIWC (Linguistic Inquiry and Word Count) to analyze social media
posts, counting words in lexical categories like sadness, health, and positive emotion [68]. LIWC
offers many advantages: it is fast, easy to interpret, and extensively validated. Researchers can
easily inspect and modify the terms in its categories — word lists that, for example, relate “scream”
and “war” to the emotion anger. But like other popular lexicons, LIWC is small: it has only 40
topical and emotional categories, many of which contain fewer than 100 words. Further, many
potentially useful categories like violence or social media don’t exist in current lexicons, requiring
creating of new gold standard word lists. Other categories may benefit from updating with modern
terms like “paypal” for money or “selfie” for leisure.
To solve these problems, we have created Empath: a tool that allows researchers to generate and
validate new lexical categories on demand, using a combination of machine learning and crowdsourc-
ing. For example, using the seed terms “twitter” and “facebook,” we can generate and validate a
category for social media. Empath also analyzes text across 200 built-in, pre-validated categories
such as neglect (deprive, refusal), government (embassy, democrat), strength (tough, forceful), and
technology (ipad, android). Empath combines modern NLP techniques with the benefits of hand-
made lexicons: its categories are word lists, easily extended and fast. And like LIWC (but unlike
other machine learning models), Empath’s contents have been validated by humans.
Empath is powered by a skip-gram network that captures words in a neural embedding [60]. This
embedding learns associations between words and their context, providing a model of connotation.
We use similarity comparisons in the resulting vector space to map a vocabulary of 59,690 words
onto Empath’s 200 categories (and beyond, onto user-defined categories). We then can filter these
relationships through the crowd to efficiently construct new, human validated dictionaries. We show
how Empath’s model can replicate and extend classic work in classifying deceptive language [66]
and analyzing mood on twitter [33]. Finally, we further validate Empath by comparing its analyses
against LIWC, a lexicon of gold standard categories that have been psychometrically validated. We
find the correlation between Empath and LIWC across a mixed-corpus dataset is high both with
(r=0.906) and without (0.90) the crowd filter. In sum, Empath shares high correlation with gold
standard lexicons, yet it also offers analyses over a dynamic set of categories.
CHAPTER 1. INTRODUCTION 5
Figure 1.3: Codex draws on millions of lines of open source code to create software engineeringinterfaces that integrate emergent programming practice. Here, Codex’s pattern annotation callsout popular idioms that appear in the user’s code.
1.3 Code Patterns
Just as human language provides structure that we can leverage to build representations of activities
or encode relationships between words and concepts, program code gives us even more precise
structure that we can exploit to enable new kinds of programming tools and analyses.
In software development, the way people adapt to a system can be just as informative as its
original design. User practice and designer intention differ across several levels of abstraction:
programmers use library APIs in undocumented and unexpected ways [56], language idioms evolve
over time [81], and programmers repurpose source code for new tasks [8, 20]. Norms emerge for
programming systems that aren’t codified in documentation or on the web. What is the best library
to use for a task? Does this code follow common practice? How is a language being used today?
We can examine the ecosystem of open source software to find answers to these practice-driven
questions. The informal rules and conventions of programming languages and libraries are implicitly
present in open source projects, which, when analyzed, often illuminate the ways people code that are
too complex or uncommon to appear in official forms of documentation. We can then operationalize
this knowledge to support everyday programming practice.
To achieve this end, we present Codex : a knowledge base that models practice-driven knowledge
for the Ruby programming language. Codex provides a living, queryable database of how program-
mers write code, informed by popular open source Ruby projects. The system normalizes program
abstract syntax trees (ASTs) to collapse similar idioms and identifiers, filters these idioms and anno-
tates them using paid crowd experts, and then allows applications to query its database in support
of new data-driven programming interfaces.
In the domain of programming, emergent practice develops at both the high-level of idioms, for
CHAPTER 1. INTRODUCTION 6
example a code snippet that initializes a nested hash, and at the low-level of syntactical combinations
of code, for example blocks that return the result of an addition operation. Codex seeks to capture
both higher-level patterns of reuseable program components and lower-level combinations and chains
of more basic programming units. Through the pattern finding module, Codex identifies commonly
reused Ruby idioms. This module uses typicality analysis to identify idioms such as Hash.new { |h,k|
h[k] = {} }, the most accepted way to initialize a nested hash table. Expert crowds then attach
metadata to these idioms, such as a title, description, and measure of recommended usefulness.
Alternatively, using the statistical analysis module, Codex can compute the frequencies of AST
node combinations, describing the uniqueness of syntatical patterns.
We present three applications that demonstrate how Codex supports programming practice and
software engineering interfaces. First, pattern annotation automatically annotates Ruby idioms
inside the IDE and presents these annotated snippets through a search interface. Second, statistical
linting identifies problematic syntax by checking code features (e.g., the kinds of AST nodes used
as function signatures or return values) against a large database of trusted and idiomatic snippets;
more generally, these statistics give programmers a tool to quantify the uniqueness of their code.
Finally, library generation pulls particularly common Ruby idioms into a new standard library —
authored not by individual developers but by emergent software practice — helping programmers
avoid the redefinition of common program components.
Codex enables new software engineering applications that are supported by large-scale program-
ming behavior rather than sets of special-cased rules. While other projects have crowdsourced
documentation for existing library functions [65, 15], mined code to enable query-based searching
for patterns or examples [56, 78], or embedded example-finding tools into an IDE [12, 14, 35, 69],
Codex augments traditional data mining techniques with crowds, presenting a broad data-driven
window into programming convention. We demonstrate how these kinds of emergent behavior can
inform new design opportunities for user interfaces.
1.4 Thesis Overview
To begin, Chapter 2 introduces a set of challenges faced by systems that seek to take advantage of
unsupervised models trained on unstructured datasets such as text and code. It then situates the
contributions of this thesis within the context of prior systems and models. Following this:
• Chapter 3 introduces Augur: a system that captures the relationships between more than
50,000 human activities and surrounding objects. I show that fiction provides a surprisingly
accurate source of knowledge about the activities of daily life, and how this knowledge can be
captured through unsupervised text mining to enable a new class of applications.
• Chapter 4 extends these ideas to Empath: a tool that analyses a broad set of signals in text and
allows computational social scientists to generate new lexical categories on demand through
CHAPTER 1. INTRODUCTION 7
and unsupervised word embedding model.
• Chapter 5 shows how similar ideas apply to code. Codex is a database that models practice-
driven knowledge for programming languages informed by unsupervised models trained on
open source projects. This system enables new software engineering applications that are
supported by large-scale programming behavior rather than sets of known rules.
Finally, the remaining chapters reflect on the contributions of this thesis. Chapter 6 discusses new
questions, challenges, and opportunities raised by this work. Chapter 7 concludes with a vision of
systems that feed data back to the communities they draw from, inspiring a virtuous cycle.
Chapter 2
Related Work
For an interactive system to reason about an open-ended domain such as human life, it needs to
understand both user vocabulary—how a user seeks to interact with it—and also system vocabulary,
the underlying domain language upon which a system operates. This thesis builds on a history of
work that has similarly mapped user vocabulary to the domain language of a system. For example,
query-feature graphs show how user terminology can be connected with commands run by an inter-
active system [28], and other systems such as CommandSpace expand upon this idea to show how
such a mapping can exist when both user language and the set of commands executed by a system
are learned from existing community resources and textual datasets [3, 57].
In prior work, however, the user and system vocabularies are known in advance to the model
under construction, either through explicit tags in the text under analysis or through manual entry
by the researchers constructing the model. It is an open challenge in many domains to instead
learn these high level patterns from lower level components, as the systems I present do through
their analyses. For example, Codex leverages the tree structure of code to mine large, common
subtrees that are used and repeated across many projects, then relates these subtrees to their
surrounding context. Similarly, human activities do not have a natural, high level representation in
text, a challenge Augur overcomes by combining the generality of learned vector space models with
patterns extracted through regularities in English language.
A second general challenge faced by work that attempts to bridge user and system vocabularies
is how the resulting models can then be used by interactive systems. For example, a system may
be able to relate the user work “mask” to the system command “layer mask” in an application like
Photoshop, which is useful for search. But are there applications for these models that extend beyond
information retrieval? This thesis engages with how such models can be embedded in interactive
systems, and what interactions are empowered by this new information. For example, Augur shows
how scene context provided by a Google Glass can feed activity predictions learned from an analysis
of fiction, leading to downstream applications such as an automatic food diary or an application
8
CHAPTER 2. RELATED WORK 9
that warns you when you are spending too much money. Similarly, Codex shows how a linting tool
trained on millions of lines of open source code can be warn users when their code deviates from
conventional idioms of a language.
In the following sections, I introduce three independent systems that work along these lines,
leveraging unsupervised and semi-supervised data mining techniques to develop a new class of inter-
active applications. Here I motivate the problems these systems solve and discuss how they extend
existing work in their domains.
2.1 Modeling Human Behavior
The first system I present in this thesis, Augur, is a knowledge base that uses fiction to connect
human activities to objects and their behaviors. This system draws on a large body of related work
in commonsense knowledge representation and ubiquitous computing, as well as prior work in data
mining and unsupervised language modeling.
2.1.1 Mining community data
Augur is inspired by many existing techniques for mining user behavior from data. For example,
query-feature graphs show how to encode the relationships between high-level descriptions of user
goals and underlying features of a system [28], even when these high-level descriptions are different
from an application’s domain language [3]. Researchers have applied these techniques to applications
such as AutoCAD [57] and Photoshop [3], where the user’s description of a domain and that domain’s
underlying mechanics are often disjoint. With Augur, we introduce techniques that mine real-world
human activities that typically occur outside of software.
Other systems have developed powerful domain-specific support by leveraging user traces. For
example, in the programming community, research systems have captured emergent practice in
open source code [27], drawn on community support for debugging computer programs [38], and
modeled how developers backtrack and revise their programs [84]. In mobile computing, the space
of user actions is small enough that it is often possible to predict upcoming actions [53]. In design,
a large dataset of real-world web pages can help guide designers to find appropriate ideas [50].
Creativity-support applications can use such data to suggest backgrounds or alternatives to the
current document [52, 73]. Augur complements these techniques by focusing on unstructured data
such as text and modeling everyday life rather than behavior within the bounds of one program.
2.1.2 Ubiquitous computing interfaces
Ubiquitous computing research and context-aware computing aim to empower interfaces to benefit
from the context in which they are being used [59, 2]. Their visions motivated the creation of
CHAPTER 2. RELATED WORK 10
our knowledge base (e.g., [82, 5]). Some applications have aimed to model specific activities or
contexts such as jogging and cycling (e.g., [18]). Augur aims to augment these models with a
broader understanding of human life. For example, what objects might be nearby before someone
starts jogging? What activities do people perform before they decide to go jogging? Doing so could
improve the design and development of many such applications.
2.1.3 Knowledge representation
We draw on work in natural language processing, information extraction, and computer vision to
distill human activites from fiction. Prior work discusses how to extract patterns from text by parsing
sentences [16, 23, 7, 17]. We adapt and extend these approaches in our text mining domain-specific
language, producing an alternative that is more declarative and potentially easier to inspect and
reason about. Other work in NLP and CV has shown how vector space models can extract useful
patterns from text [61], or how other machine learning algorithms can generate accurate image labels
[45] and classify images given a small closed set of human actions [51]. Augur draws on insights
from these approaches to make conditional predictions over thousands of human activities.
Our research also benefits from prior work in commonsense knowledge representation. Existing
databases of linguistic and commonsense knowledge provide networks of facts that computers should
know about the world [54]. Augur captures a set of relations that focus more deeply on human be-
havior and the causal relationships between human activities. We draw on forms of commonsense
knowledge, like the WordNet hierarchy of synonym sets [62], to more precisely extract human ac-
tivities from fiction. Parts of this vocabulary may be mineable from social media, if they are of the
sort that people are likely to advertise on Twitter [46]. We find that fiction offers a broader set of
local activities.
2.2 Modeling Human Language
The second system I present in this thesis, Empath, analyzes text across hundreds of topics and
emotions. Like LIWC and other dictionary-based tools, it counts category terms in a text document.
However, Empath covers a broader set of categories than other tools, and users can generate and
validate new categories with a few seed words. Empath inherits from a rich ecosystem of tools
and applications for text analysis, and draws on the insights of prior work in data mining and
unsupervised language modeling.
2.2.1 Extracting signal from text
Text analysis via dictionary categories has a long history in academic research. LIWC, for example,
is an extensively validated dictionary that offers a total of 62 syntactic (e.g., present tense verbs,
CHAPTER 2. RELATED WORK 11
pronouns), topical (e.g., home, work, family) and emotional (e.g., anger, sadness) categories [68].
The General Inquirer (GI) is another human curated dictionary that operates over a broader set
of topics than LIWC (e.g., power, weakness), but fewer emotions [76]. Other tools like EmoLex,
ANEW, and SentiWordNet are designed to analyze larger sets of emotional categories [63, 11, 22].
While Empath’s analyses are similarly driven by dictionary-based word counts, Empath operates
over a more extensive set of categories, and can generate and validate new categories on demand
using unsupervised language modeling.
Work in sentiment analysis has developed powerful techniques to classify text across positive
and negative polarity [75], but has also benefited from simpler, transparent models and rules [44].
Empath draws on the complementary strengths of these ideas, using the power of unsupervised
machine learning to create human-interpretable feature sets for the analysis of text. One of Empath’s
goals is to embed modern NLP techniques in a way that offers the transparency of dictionaries like
LIWC.
2.2.2 Text mining and modeling
A large body of prior work has investigated unsupervised language modeling. For example, re-
searchers have learned sentiment models from the relationships between words [40], classified the
polarity of reviews in an unsupervised fashion [79], discovered patterns of narrative in text [16], and
(more recently) used neural networks to model word meanings in a vector space [60]. We borrow
from the last of these approaches in constructing of Empath’s unsupervised model.
Empath also takes inspiration from techniques for mining human patterns from data. Augur
likewise mines text on the web to learn human activities for interactive systems [26]. Augur’s
evaluation indicated that with regard to low-level behaviors such as actions, these data provide a
surprisingly accurate mirror of human behavior. Empath contributes a different perspective, that
text on the web can be an appropriate tool for learning a breadth of topical and emotional categories,
to the benefit of social science. In other research communities, systems have used unsupervised
models to capture emergent practice in open source code [27] or design [50]. In Empath, we adapt
these techniques to mine natural language for its relation to emotional and topical categories.
Finally, Empath also benefits from prior work in commonsense knowledge representation. Ex-
isting databases of linguistic and commonsense knowledge provide networks of facts that computers
should know about the world [54, 62, 22]. We draw on some of this knowledge, like the ConceptNet
hierarchy, when seeding Empath’s categories. Further, Empath itself captures a set of relations
on the topical and emotional connotations of words. Some aspects of these connotations may be
mineable from social media, if they are of the sort that people are likely to advertise on Twitter [46].
CHAPTER 2. RELATED WORK 12
2.3 Modeling Code Patterns
The final system I present in this thesis, Codex, analyses millions of lines of open source code
to uncover undocumented norms of practice and convention. Codex builds upon related work in
software repository mining, program analysis, and data-driven interfaces.
2.3.1 Mining software repositories
Codex draws on techniques from software repository mining to extract patterns from a large body of
open source code. Other researchers have mined code for software patterns and redundant code using
code normalization or typicality [56, 8, 20, 41, 65, 15]. However, much of this research emphasizes
the discovery of known design patterns and is oriented towards applications such as refactoring of
duplicate code, while Codex discovers new patterns from the ground up. Further, Codex combines
typicality analysis with expert crowdsourcing to build its database — an approach independant of
any particular code normalization scheme.
Databases can also systematize knowledge about open source code. However, these databases
are usually designed to enable specific forms of code search [83, 78], example-finding [43, 36, 69], or
autocompletion [42], either query based or automatic. While tools designed for specific use cases
may be highly optimized for their tasks, Codex enables a broader set of applications, including
pattern annotation and detecting problematic code through statistical linting.
2.3.2 Bugfinding
One of Codex’s core applications is to help programmers avoid bugs. Much work has focused on
tools for static and dynamic analysis [6, 21]. Other work has focused on helping users debug their
programs through program analysis or crowdsourced aggregation of user activities [38, 4, 34, 48, 65].
Codex does not explicitly try to discover bugs in programs; rather, it notifies users when code
violates convention. This is a subtle but important difference: code may be syntactically correct but
semantically unusual and error-prone.
2.3.3 Learning from code examples
Codex takes inspiration from prior research on code example finding and reuse. Some of these tools
rely on official forms of documentation [12] and others focus on real code from the web [69, 39, 77].
Codex generalizes this work — it covers a broader set of examples than manually curated datasets
and can determine when an example is a one-off and when it represents more general practice. Codex
also enables a more powerful search over examples through AST analysis, benefits from the human-
powered filtering and annotation, and makes possible many applications besides example-finding.
CHAPTER 2. RELATED WORK 13
Researchers have also addressed how programmers make use of example code, whether the code
is copy-pasted [47] or foraged from documentation or online examples [14, 13, 35]. By formalizing
embedded software practice, Codex is able to support programmers through a larger space of ex-
amples and lower-level conventions. Many of these idioms and code snippets may not have been
formally discussed on the web.
2.3.4 Data-driven interfaces
Codex draws on insights from data-driven interfaces in non-programming domains. Users can gain
much through querying and exploration. For example, Webzeigeist allows designers to query a large
corpus of rendered web sites [50]. Crowd data also allows interactive systems to transform a partial
sketch of the users intent into a complete state, for example matching a sung melody against a large
database of music to produce an automatic backup band [73]. Algorithms can then identify patterns
in crowd behavior and percolate them up to the interface, for example answering a wide variety of
user queries, demonstrating how a given feature is used in practice [9, 28, 58], or predicting likely
actions from past history [37]. Codex demonstrates that the more structured nature of programming
languages provides a platform for more powerful interactive support such as error finding.
Chapter 3
Modeling Human Behavior
From smart homes that prepare coffee when we wake, to phones that know not to interrupt us
during important conversations, our collective visions of HCI imagine a future in which computers
understand a broad range of human behaviors. Today our systems fall short of these visions, however,
because this range of behaviors is too large for designers or programmers to capture manually. In this
chapter, we instead demonstrate it is possible to mine a broad knowledge base of human behavior
by analyzing more than one billion words of modern fiction. Our resulting knowledge base, Augur,
trains vector models that can predict many thousands of user activities from surrounding objects in
modern contexts: for example, whether a user may be eating food, meeting with a friend, or taking a
selfie. Augur uses these predictions to identify actions that people commonly take on objects in the
world and estimate a user’s future activities given their current situation. We demonstrate Augur-
powered, activity-based systems such as a phone that silences itself when the odds of you answering
it are low, and a dynamic music player that adjusts to your present activity. A field deployment of
an Augur-powered wearable camera resulted in 96% recall and 71% precision on its unsupervised
predictions of common daily activities. A second evaluation where human judges rated the system’s
predictions over a broad set of input images found that 94% were rated sensible.
3.1 Augur
Augur is a knowledge base that uses fiction to connect human activities to objects and their behav-
iors. We begin with an overview of the basic activities, objects, and object affordances in Augur,
then then explain our approach to text mining and modeling.
14
CHAPTER 3. MODELING HUMAN BEHAVIOR 15
3.1.1 Human Activities
Augur is primarily oriented around human activities, which we learn from verb phrases that have hu-
man subjects, for example “he opens the fridge” or “we turn off the lights.” Through co-occurrence
statistics that relate objects and activities, Augur can map contextual knowledge onto human be-
havior. For example, we can ask Augur for the five activities most related to the object “facebook”
(in modern fiction, characters use social media with surprising frequency):
Activity Score Frequency
message 0.71 1456
get message 0.53 4837
chat 0.51 4417
close laptop 0.45 1480
open laptop 0.39 1042
Here score refers to the cosine similarity between a vector-embedded query and activities in the
Augur knowledge base (we’ll soon explain how we arrive at this measure).
Like real people, fictional characters waste plenty of time messaging or chatting on Facebook.
They also engage in activities like post, block, accept, or scroll feed.
Similarly, we can look at relations that connect multiple objects. What activities occur around
a shirt and tie? Augur captures not only the obvious sartorial applications, but notices that shirts
and ties often follow specific other parts of the morning routine such as take shower :
Activity Score Frequency
wear 0.05 58685
change 0.04 56936
take shower 0.04 14358
dress 0.03 16701
slip 0.03 59965
In total, Augur relates 54,075 human activities to 13,843 objects and locations. While the head
of the distribution contributes many observed activities (e.g., extremely common activities like ask
or open door), a more significant portion lie in the bulk of the tail. These less common activities,
like reply to text message or take shower, make up much of the average fictional human’s existence.
Further out, as the tail diminishes, we find less frequent but still semantically interesting activities
like throw out flowers or file bankruptcy.
Augur associates each of its activities with many objects, even activities that appear relatively
infrequently. For example, unfold letter occurs only 203 times in our dataset, yet Augur connects
it to 1072 different objects (e.g., handwriting, envelope). A more frequent activity like take picture
CHAPTER 3. MODELING HUMAN BEHAVIOR 16
occurs 10,249 times, and is connected with 5,250 objects (e.g., camera, instagram). The abundance
of objects in fiction allows us to make inferences for a large number of activities.
3.1.2 Object Affordances
Augur also contains knowlege about object affordances: actions that are strongly associated with
specific objects. To mine object affordances, Augur looks for subject-verb-object sentences with
objects either as their subject or direct object. Understanding these behaviors allows Augur to
reason about how humans might interact with their surroundings. For example, the ten most
related affordances for a car:
Activity Score Frequency
honk horn 0.38 243
buckle seat-belt 0.37 203
roll window 0.35 279
start engine 0.34 898
shut car-door 0.33 140
open car-door 0.33 1238
park 0.32 3183
rev engine 0.32 113
turn on radio 0.30 523
drive home 0.26 881
Cars undergo basic interactions like roll window and buckle seat-belt surprisingly often. These
are relatively mundane activities, yet abundant in fiction.
Like the distribution of human activities, the distribution of objects is heavy-tailed. The head of
this distribution contains objects such as phone, bag, book, and window, which all appear more than
one million times. The thick “torso” of the distribution is made of objects such as plate, blanket,
pill, and wine, which appear between 30,000 and 100,000 times. On the fringes of the distribution
are more idiosyncratic objects such as kindle (the e-book reader), heroin, mouthwash, and porno,
which appear between 500 and 1,500 times.
3.1.3 Connections between activities
Augur also contains information about the connections between human activities. To mine for
sequential activties, we can look at extracted activities that co-occur within a small span of words.
Understanding which activities occur around each other allows Augur to make predictions about
what a person might do next.
For example, we can ask Augur what happens after someone orders coffee:
CHAPTER 3. MODELING HUMAN BEHAVIOR 17
Activity Score Frequency
eat 0.48 49347
take order 0.40 1887
take sip 0.39 11367
take bite 0.39 6914
pay 0.36 23405
Even fictional characters, it seems, must pay for their orders.
Likewise, Augur can use the connections between activities to determine which activities are
similar to one another. For example, we can ask for activities similar to the social media photography
trend of take selfie:
Activity Score Frequency
snap picture 0.78 1195
post picture 0.76 718
take photo 0.67 1527
upload picture 0.58 121
take picture 0.57 10249
By looking for activities with similar object co-occurrence patterns, we can find near-synonyms.
3.1.4 A data mining DSL for natural language
Creating Augur requires methods that can extract relevant information from large-scale text and
then model it. Exploring the patterns in a large corpus of text is a difficult and time consuming
process. While constructing Augur, we tested many hypotheses about the best way to capture
human activties. For example, we asked: what level of noun phrase complexity is best? Some
complexity is useful. The pattern run to the grocery store is more informative for our purposes than
run to the store. But too much complexity can hurt predictions. If we capture phrases like run
to the closest grocery store, our data stream becomes too sparse. Worse, when iterating on these
hypotheses, even the cleanest parser code tends not to be easily reusable or interpretable.
To help us more quickly and efficiently explore our dataset, we created TC (Text Combinator),
a data mining DSL for natural language. TC allows us to build parsers that capture patterns in a
stream of text data, along with aggregate statistics about these patterns, such as frequency and co-
occurrence counts, or the mutual information (MI) between relations. TC’s scripts can be easier to
understand and reuse than hand-coded parsers, and its execution can be streamed and parallelized
across a large text dataset.
TC programs can model syntactic and semantic patterns to answer questions about a corpus.
CHAPTER 3. MODELING HUMAN BEHAVIOR 18
For example, suppose we want to figure out what kinds of verbs often affect laptops:
laptop = [DET]? ([ADJ]+)? "laptop"
verb_phrase = [VERB] laptop-
freq(red_vp)
Here the laptop parser matches phrases like “a laptop” or “the old broken laptop” and returns
exactly the matched phrase. The verb phrase parser matches pharses like “throw the broken laptop”
and returns just the verb in the phrase (e.g., “throw”). The freq aggregator keeps a count of unique
tokens in the output stream of the verb phrase parser. On a small portion of our corpus, we see as
output:
open 11
close 7
shut 6
restart 4
To clarify the syntax for this example: square brackets (e.g., [NOUN]) define a parser that matches
on a given part of speech, quotes (e.g., "laptop") matches on an exact string, whitespace is an implicit
then-combinator (e.g., [NOUN] [NOUN] matches two sequential nouns), a question mark (e.g., [DET]?
optionally matches an article like “a” or “the”, also matching on the empty string), a plus (e.g.,
[VERB]+ matches on as many verbs as appear consecutively), and a minus (e.g., [NOUN]- matches on
a noun but removes it from the returned match).
We wrote the compiler for TC in Python. Behind the scenes, our compiler transforms an input
program into a parser combinator, instantiates the parser as a Python generator, then runs the
generator to lazily parse a stream of text data. Aggregation commands (e.g., freq frequency counting
and MI for MI calculation) are also Python generators, which we compose with a parser at compile
time. Given many input files, TC also supports parallel parsing and aggregation.
3.1.5 Mining activity patterns from text
To build the Augur knowledge base, we index more than one billion words of fiction writing from
600,000 stories written by more than 500,000 writers on the Wattpad writing community1. Wattpad
is a community where amateur writers can share their stories, oriented mostly towards writers of
genre fiction. Our dataset includes work from 23 of these genres, including romance, science fiction,
and urban fantasy, all of which are set in the modern world.
Before processing these stories, we normalize them using the spaCy part of speech tagger and
lemmatizer2. The tagger labels each word with its appropriate part of speech given the context
1http://wattpad.com2https://honnibal.github.io/spaCy/)
CHAPTER 3. MODELING HUMAN BEHAVIOR 19
of a sentence. Part of speech tagging is important for words that have multiple senses and might
otherwise be ambiguous. For example, “run” is a noun in the phrase, “she wants to go for a run”,
but a verb in the phrase “I run into the arms of my reviewers.” The lemmatizer converts each word
into its singular and present-tense form. For example, the plural noun “soldiers” can be lemmatized
to the singular “soldier” and the past tense verb “ran” to the present “runs.”
Activity-Object statistics
Activity-object statistics connect commonly co-occurring objects and human activities. These statis-
tics will help Augur detect activities from a list of objects in a scene. We define activities as verb
phrases where the subject is a human, and objects as compound noun phrases, throwing away
adjectives. To generate these edges, we run the TC script:
human_pronoun = "he" | "she" | "i" | "we" | "they"
np = [DET]? ([ADJ]- [NOUN])+
vp = human_pronoun ([VERB] [ADP])+
MI(freq(co-occur(np, vp, 50)))
For example, backpack co-occurs with pack 2413 times, and radio co-occurs with singing 7987
times. Given the scale of our data, Augur’s statistics produce meaningful results by focusing just
on pronoun-based sentences.
In this TC script, mutual information processes our final co-occurence statistics to calculate the
mututal information of our relations, where A and B are the frequencies of two relations, and the
term AB is the frequency of collocation between A and B:
MI(A,B) = log
(AB
A ∗B
)MI describes how much one term of a co-occurrence tells us about the other. For example, if
people type with every kind of object in equal amounts, then knowing there is a computer in your
room doesn’t mean much about whether you are typing. However, if people type with computers
far more often than anything else, then knowing there is a computer in your room tells us significant
information, statistically, about what you might be doing.
Object-affordance statistics
The object-affordance statistic connects objects directly to their uses and behaviors, helping Augur
understand how humans can interact with the objects in a scene. We define object affordances as
verb phrases where an object serves as either the subject or direct object of the phrase, and we again
capture physical objects as compound noun phrases. To generate these edges, we run the TC script:
np = [DET]? ([ADJ]- [NOUN])+
CHAPTER 3. MODELING HUMAN BEHAVIOR 20
vp = ([VERB] [ADP])+
svo = np vp np?
MI(freq(svo))
For example, coffee is spilled 229 times, and facebook is logged into 295 times.
Activity-Activity statistics
Activity-activity statistics count the times that an activity is followed by another activity, helping
Augur make predictions about what is likely to happen next. To generate these statistics, we run
the TC script:
human_pronoun = "he" | "she" | "i" | "we" | "they"
vp = human_pronoun ([VERB] [ADP])+
MI(freq(skip-gram(vp,2,50)))
Activity-activity statistics tend to be more sparse, but Augur can still uncover patterns. For
example, wash hair precedes blow dry hair 64 times, and get text (e.g., receive a text message)
precedes text back 34 times.
In this TC script, skip-gram(vp,2,50) constructs skip-grams of length n = 2 sequential vp
matches on a window size of 50. Unlike co-occurrence counts, skip-grams are order-dependent,
helping Augur find potential causal relationships.
3.1.6 Vector space model for retrieval
Augur’s three statistics are not enough by themselves to make useful predictions. These statistics
represent pairwise relationships and only allow prediction based on a single element of context (e.g.,
activity predictions from a single object), ignoring any information we might learn from similar
co-occurrences with other terms. For many applications it is important to have a more global view
of the data.
To make these global relationships available, we embed Augur’s statistics into a vector space
model (VSM), allowing Augur to enhance its predictions using the signal of multiple terms. Queries
based on multiple terms narrow the scope of possibility in Augur’s predictions, strengthing predic-
tions common to many query terms, and weaking those that are not.
VSMs encode concepts as vectors, where each dimension of the vector conveys a feature relevant
to the concept. For Augur, these dimensions are defined by MI > 0 with Laplace smoothing (by a
constant value of 10), which in practice reduces bias towards uncommon human activities [80].
Augur has three VSMs. 1). Object-Activity : each vector is a human activity and its dimensions
are smoothed MI between it and every object. 2). Object-Affordance: each vector is an affordance
and its dimensions are smoothed MI between it and every object. 3). Activity-Prediction: each
vector is a activity and its dimensions are smoothed MI between it and every other activity.
CHAPTER 3. MODELING HUMAN BEHAVIOR 21
Figure 3.1: Augur’s activity detection API translates a photo into a set of likely relevant activities.For example, the user’s camera might automatically photojournal the food whenever the user maybe eating food. Here, Clarifai produced the object labels.
Figure 3.2: Augur’s APIs map input images through a deep learning object detector, then initializesthe returned objects into a query vector. Augur then compares that vector to the vectors representingeach activity in its database and returns those with lowest cosine distance.
To query these VSMs, we construct a new empty vector, set the indices of the terms in the query
equal to 1, then find the closest vectors in the space by measuring cosine similarity.
3.2 Augur API and Applications
Applications can draw from Augur’s contents to identify user activities, understand the uses of
objects, and make predictions about what a user might do next. To enable software development
under Augur, we present these three APIs and a proof-of-concept architecture that can augment
existing applications with if-this-then-that human semantics.
We begin by introducing the three APIs individually, then demonstrate additional example ap-
plications to follow. To more robustly evaluate Augur, we have built one of these applications,
Soundtrack for Life, into Google Glass hardware.
3.2.1 Identifying Activities
What are you currently doing? If Augur can answer this question, applications can potentially help
you with that activity, or determine how to behave given the context around you.
Suppose a designer wants to help people stick to their diets, and she notices that people often
forget to record their meals. So the designer decides to create an automatic meal photographer.
CHAPTER 3. MODELING HUMAN BEHAVIOR 22
She connects the user’s wearable camera to a scene-level object detection computer vision algorithm
such as R-CNN [32]. While she could program the system to fire a photo whenever the computer
vision algorithm recognizes an object class categorized as food, this would produce a large number
of false positives throughout the day, and would ignore a breadth of other signals such as silverware
and dining tables that might actually indicate eating.
So, the designer connects the computer vision output to Augur (Figure 3.1). Instead of program-
ming a manual set of object classes, the designer instructs Augur to fire a notification whenever the
user engages in the activity eat food. She refers to the activity using natural language, since this is
what Augur has indexed from fiction:
image = /* capture picture from user’s wearable camera */
if(augur.detect(image, "eat food"))
augur.broadcast("take photo");
The application takes an image at regular intervals. The detect function processes the latest
image in that stream, pings a deep learning computer vision server (http://www.clarifai.com/),
then runs its object results through Augur’s object-activity VSM to return activity predictions. The
broadcast function broadcasts an object affordance request keyed on the activity take photo: in this
case, the wearable camera might respond by taking a photograph.
Now, the user sits down for dinner, and the computer vision algorithm detects a plate, steak and
broccoli (Figure 3.1). A query to Augur returns:
Activity Score Frequency
fill plate 0.39 203
put food 0.23 1046
take plate 0.15 1321
eat food 0.14 2449
set plate 0.12 740
cook 0.10 6566
The activity eat food appears as a strong prediction, as is (further down) the more general activity
eat. The ensemble of objects reinforce each other: when the plate, steak and broccoli are combined
to form a query, eating has 1.4 times higher cosine similarity than for any of the objects individually.
The camera fires, and the meal is saved for later.
3.2.2 Expanding Activites with Object Affordances
How can you interact with your environment? If Augur knows how you can manipulate your sur-
roundings, it can help applications facilitate that interaction.
CHAPTER 3. MODELING HUMAN BEHAVIOR 23
Figure 3.3: Augur’s object affordance API translates a photo into a list of possible affordances. Forexample, Augur could help a blind user who is wearing an intelligent camera and says they want tosit. Here, Clarifai produced the object labels.
Object affordances can be useful for creating accessible technology. For example, suppose a blind
user is wearing an intelligent camera and tells the application they want to sit (Figure 3.3). Many
possible objects would let this person sit down, and it would take a lot of designer effort to capture
them all. Instead, using Augur’s object affordance VSM, an application could scan nearby objects
and find something sittable:
image = /* capture picture from user’s wearable camera */
if(augur.affordance(image, "sit"))
alert("sittable object ahead");
The affordance function will process the objects in the latest image, executing its block when
Augur notices an object with the specified affordance. Now, if the user happens to be within eyeshot
of a bench:
Activity Score Frequency
sit 0.13 600814
take seat 0.12 24257
spot 0.11 16132
slump 0.09 8985
plop 0.07 12213
Here the programmer didn’t need to stop and think about all the scenarios or objects where a
user might sit. Instead, they just stated the activity and Augur figured it out.
3.2.3 Predicting Future Activities
What will you do next? If Augur can predict your next activity, applications can react in advance
to better meet your needs in that situation. Activity predictions are particularly useful for helping
users avoid problematic behaviors, like forgetting their keys or spending too much money.
CHAPTER 3. MODELING HUMAN BEHAVIOR 24
In Apple’s Knowledge Navigator [5], the agent ignores a phone call when it knows that it would
be an inappropriate time to answer. Could Augur support this?
answer = augur.predict("answer call")
ignore = augur.predict("ignore call")
if(ignore > answer)
augur.broadcast("silence phone");
else
augur.broadcast("unsilence phone");
The augur.predict function makes new activity predictions based on the user’s activities over the
past several minutes. If the current context suggests that a user is using the restroom, for example,
the prediction API will know that answering a call is an unlikely next action. When provided with
an activity argument, augur.predict returns a cosine similarity value reflecting the possibility of
that activity happening in the near future. The activity ignore call has less cosine similarity than
answer call for most queries to Augur. But if a query ever indicates a greater cosine similarity for
ignore call, the application can silence the phone. As before, Augur broadcasts the desired activity
to any listening devices (such as the phone).
Suppose your phone rings while you are talking to your best friend about their relationship issues.
Thoughtlessly, you curse, and your phone stops ringing instantly:
Activity Score Frequency
throw phone 0.24 3783
ignore call 0.18 567
ring 0.18 7245
answer call 0.17 4847
call back 0.17 1883
leave voicemail 0.17 146
Many reactions besides cursing might also trigger ignore call. In this case, adding curse to the
prediction mix shifts the odds between ignoring and answering significantly. Other results like throw
phone reflect the biases in fiction. We will investigate the impact of these biases in our Evaluation.
3.2.4 Applications
Augur allows developers to build situationally reactive applications across many activities and con-
texts. Here we present three more applications designed to illustrate the potential of its API. We
have deployed one of these applications, A Soundtrack for Life, as a Google Glass prototype.
CHAPTER 3. MODELING HUMAN BEHAVIOR 25
The Autonomous Activity Journal
We often forget where we have gone and what we have done. Augur allows us to journal our activities
passively, automatically (and probabilistically):
predictions = augur.predict()
for(p in predictions where p.score > 0.8)
file.write("journal" , p.activity);
When Augur returns new predictions about our life, this program will write the most likely ones
to log. We might search this log later, or use it find patterns in our daily behavior. For example,
what days are we most likely to exercise? How often do we tend to go our to eat, or hang our with
friends? Some of Augur’s predictions will inevitably be false positives, but in aggregate they may
provide useful analytics into our lives.
The Coffee-Aware Smart Home
In Weiser’s ubiquitous computing vision [82], he introduces the idea of calm computing via a scenario
where a woman wakes up and her smart home asks if she wants coffee. Augur’s activity prediction
API can support this vision:
if(augur.predict("make coffee") { askAboutCoffee(); }
Suppose that your alarm goes off, signaling to Augur that your activity is wake up. Your smart
coffeepot can start brewing when Augur predicts you want to make coffee:
Activity Score Frequency
want breakfast 0.38 852
throw blanket 0.38 728
shake awake 0.37 774
hear shower 0.36 971
take bath 0.35 1719
make coffee 0.34 779
check clock 0.34 2408
After people wake up in the morning, they are likely to make coffee. They may also want
breakfast, another task a smart home might help us with.
Spending Money Wisely
We often spend more money than we have. Augur can help us maintain a greater awarness of our
spending habits, and how they affect our finances. If we are reminded of our bank balence before
CHAPTER 3. MODELING HUMAN BEHAVIOR 26
Figure 3.4: A Soundtrack for Life is a Google Glass application that plays musicians based on theuser’s predicted activity, for example associating working with The Glitch Mob.
spending money, we may be less inclined to spend it on frivolous things:
if(predict("pay") {
balance = secure_bank_query();
speak("your balance is "+ balance);
}
If Augur predicts we are likely to pay for something, it will tell us how much money we have left
in our account. What might trigger this prediction?
Activity Score Frequency
scan 0.19 5319
ring 0.19 7245
pay 0.17 23405
swipe 0.17 1800
shop 0.13 3761
For example, when you enter a store, you may be about to pay for something. The pay prediction
also triggers on ordering food or coffee, entering a cafe, gambling, and calling a taxi.
Activity Score Frequency
hail taxi 0.96 228
pay 0.96 181
call taxi 0.96 359
get taxi 0.96 368
tell address 0.95 463
get suitcase 0.82 586
CHAPTER 3. MODELING HUMAN BEHAVIOR 27
A Soundtrack for Life
Many of life’s activities are accompanied by music: you might cook to the refined arpeggios of
Vivaldi, exercise to the dark ambivalence of St. Vincent, and work to the electronic pulse of the
Glitch Mob. Through an activity detection system we have built into Google Glass (Figure 3.4),
Augur can arrange a soundtrack for you that suits your daily preferences. We built a physical
prototype for this application as it takes advantage of the full range of activities Augur can detect.
var act2music = {
"cook": "Vivaldi", "drive": "The Decemberists",
"surfing": "Sea Wolf", "buy": "Atlas Genius",
"work": "Glitch Mob", "exercise": "St. Vincent",
};
var act = augur.predict();
if (act in act2music){
play(act2music[act]);
}
For example, if you are brandishing a spoon before a pot on the stove, you are likely cooking.
Augur plays Vivaldi.
Activity Score Frequency
cook 0.50 6566
pour 0.39 757
place 0.37 25222
stir 0.37 2610
eat 0.34 49347
3.3 Evaluation
Can fiction tell us what we need in order to endow our interactive systems with basic knowledge
of human activities? In this section, we investigate this question through three studies. First, we
compare Augur’s activity predictions to human activity predictions in order to understand what
forms of bias fiction may have introduced. Second, we test Augur’s ability to detect common
activities over a two-hour window of daily life. Third, to stress test Augur over a wider range of
activities, we evaluate its activity predictions on a dataset of 50 images sampled from the Instagram
hashtag #dailylife.
CHAPTER 3. MODELING HUMAN BEHAVIOR 28
3.3.1 Bias of Fiction
If fiction were truly representative of our lives, we might be constantly drawing swords and kissing
in the rain. Our first evaluation investigates the character and prevelance of fiction bias. We tested
how closely a distribution of 1000 activities sampled from Augur’s knowledge base compared against
human-reported distributions. While these human-reported distributions may differ somewhat from
the real world, they offer a strong sanity check for Augur’s predictions.
Method
To sample the distribution of activities in Augur, we first randomly sampled 100 objects from the
knowledge base. We then used Augur’s activity identification API to select 10 human activities most
related to each object by cosine similarity. In general, these selected activities tended to be relatively
common (e.g., cross and park for the object “street”). We normalized these sub-distributions such
that the frequencies of their activities summed to 100.
Next, for each object we asked five workers on Amazon Mechanical Turk to estimate the relative
likelihood of its selected activities. For example, given a piano: “Imagine a random person is around
a piano 100 times. For each action in this list, estimate how many times that action would be taken.
The overall counts must sum to 100.” We asked for integer estimates because humans tend to be
more accurate when estimating frequencies [31].
Finally, we computed the estimated true human distribution (ETH) as the mean distribution
across the five human estimates. We compared the mean absolute error (MAE) of Augur and the
individual human estimates against the ETH.
Results
Augur’s MAE when compared to the ETH is 12.46%, which means that, on average, its predictions
relative to the true human distribution are off by slightly more than 12%. The mean MAE of the
individual human distributions when compared to the ETH is 6.47%, with a standard deviation of
3.53%. This suggests that Augur is biased, although its estimates are not far outside the variance
of individual humans.
Investigating the individual distributions of activities suggests that the vast majority of Augur’s
prediction error is caused by a few activities in which its predictions differ radically from the humans.
In fact, for 84% of the tested activities Augur’s estimate is within 4% of the ETH. What accounts
for the these few radically different estimates?
The largest class of prediction error is caused by general activities such as look. For example,
when considering raw co-occurrence frequencies, people look at clocks much more often than they
check the time, because look occurs far more often in general. When estimating the distribution of
activities around clock, human estimators put most of their weight on check time, while Augur put
CHAPTER 3. MODELING HUMAN BEHAVIOR 29
Figure 3.5: We deployed an Augur-powered wearable camera in a field test over common dailyactivities, finding average rates of 96% recall and 71% precision for its classifications.
nearly all its weight on look. Similar mistakes involved the common but understated activities of
getting into cars or going to stores. Human estimators favored driving cars and shopping at stores.
A second and smaller class of error is caused by strong connections between dramatic events that
take place more often in fiction than in real life. For example, Augur put nearly all of its prediction
weight for cats on hissing while humans distributed theirs more evenly across a cat’s possible activi-
ties. In practice, we saw few of these overdramaticized instances in Augur’s applications and it may
be possible to use paid crowdsourcing to smooth out them out. Further, this result suggests that
the ways fiction deviates from real life may be more at the macro-level of plot and situation, and
less at the level of micro-behaviors. Yes, fictional characters sometimes find themselves defending
their freedom in court against a murder charge. However, their actions within that courtroom do
tend to mirror reality — they don’t tend to leap onto the ceiling or draw swords.
3.3.2 Field test of A Soundtrack for Life
Our second study evaluates Augur through a field test of our Glass application, A Soundtrack for
Life. We recorded a two-hour sample of one user’s day, in which she walked around campus, ordered
coffee, drove to a shopping center, and bought groceries, among other activities (Figure 3.5).
Method
We gave a Google Glass loaded with A Soundtrack for Life to a volunteer and asked her, over a two
hour period, to to enact the following eight activities: walk, buy, eat, read, sit, work, order, and
drive. We then turned on the Glass, set the Soundtrack’s sampling rate to 1 frame every 10 seconds,
and recorded all data. The Soundtrack logged its predictions and images to disk.
Blind to Augur’s predictions, we annotated all image frames with a set of correct activities.
Frames could consist of no labeled activities, one activity, or several. For example, a subject sitting
at a table filled with food might be both sitting and eating. We included plausible activities among
this set. For example, when the subject approaches a checkout counter, we included pay both under
circumstances in which she did ultimately purchase something, and also under others in which she did
not. Over these annotated image frames, we computed precision and recall for Augur’s predictions.
CHAPTER 3. MODELING HUMAN BEHAVIOR 30
Activity Ground Truth Frames Precision Recall
Walk 787 91% 99%
Drive 545 63% 100%
Sit 374 59% 86%
Work 115 44% 97%
Buy 78 89% 83%
Read 33 82% 87%
Eat 12 53% 83%
Average 71% 96%
Table 3.1: We find average rates of 96% recall and 71% precision over common activities in thedataset. Here Ground Truth Frames refers to the total number of frames labeled with each activity.
Results
We find rates of 96% recall and 71% precision across activity predictions in the dataset (Figure 3.1).
When we break up these rates by activity, Augur succeeds best at activities like walk, buy and read,
with precision and recall score higher than 82%. On the other hand, we see that the activities work,
drive, and sit cause the majority of Augur’s errors. Work is triggered by a diverse set of contextual
elements. People work at cafes or grocercy stores (for their jobs), or do construction work, or work
on intellectual tasks, like writing research papers on their laptops. Our image annotations did not
capture all these interpretations of work, so Augur’s disagreement with our labeling is not surprising.
Drive is also triggered by a large number of contexuntual elements, including broad scene descriptors
like “store” or “cafe,” presumably because fictional characters often drive to these places. And sit is
problematic mostly because it is triggered by the common scene element “tree” (real-world people
probably do this less often than fictional characters). We also observe simpler mistakes: for example,
our computer vision algorithm thought the bookstore our subject visited was a restaurant, causing
a large precision hit to eat.
3.3.3 A stress test over #dailylife
Our third evaluation investigates whether a broad set of inputs to Augur would produce meaningful
activity predictions. We tested the quality of Augur’s predictions on a dataset of 50 images sampled
from the Instagram hashtag #dailylife. These images were taken in a variety of environments across
the world, including homes, city streets, workplaces, restaurants, shopping malls and parks. First,
we sought to measure whether Augur predicts meaningful activities given the objects in the image.
Second, we compared Augur’s predictions to the human activity that best describes each scene.
CHAPTER 3. MODELING HUMAN BEHAVIOR 31
Quality Samples Percent Success
Augur VSM predictions 1000 94%
Augur VSM scene recall 50 82%
Computer vision object detection 50 62%
Table 3.2: As rated by external experts, the majority of Augur’s predictions are high-quality.
Method
To construct a dataset of images containing real daily activites, we sampled 50 scene images from the
most recent posts to the Instagram #dailylife hashtag 3, skipping 4 images that did not represent
real scenes of people or objects, such as composite images and drawings.
We ran each image through an object detection service to produce a set of object tags, then
removed all non-object tags with WordNet. For each group of objects, we used Augur to generate
20 activity predictions, making 1000 in total.
We used two external evaluators to independently analyze each of these predictions as to their
plausibility given the input objects, and blind to the original photo. A third external evaluator
decided any disagreements. High quality predictions describe a human activity that is likely given
the objects in a scene: for example, using the objects street, mannequin, mirror, clothing, store
to predict the activity buy clothes. Low quality predictions are unlikely or nonsensical, such as
connecting car, street, ford, road, motor to the activity hop.
Next, we showed evaluators the original image and asked them to decide: 1) whether computer
vision had extracted the set of objects most important to understanding the scene 2) whether one
of Augur’s predictions accurately described the most important activity in each scene.
Results
The evaluators rated 94% of Augur’s predictions are high quality (Table 3.2). Among the 44 that
were low quality, many can be accounted for by tagging issues (e.g., “sink” being mistagged as a
verb). The others are largely caused by relatively uncommon objects connecting to frequent and
overly-abstract activities, for example the uncommon object “tableware” predicts “pour cereal”.
Augur makes activity predictions that accurately describe 82% of the images, despite the fact
that CV extracted the most important objects in only 62%. Augur’s knowledge base is able to
compensate for some noise in the neural net: across those images with good CV extraction, Augur
succeeded at correctly predicting the most relevant activity on 94%.
3https://instagram.com/explore/tags/dailylife/
CHAPTER 3. MODELING HUMAN BEHAVIOR 32
3.4 Discussion
Augur’s design presents a set of opportunities and limitations. First, we acknowledge that data-
driven approaches are not panaceas. Just because a pattern appears in data does not mean that
it is interpretable. For example, “boyfriend is responsible” is a statistical pattern in our text, but
it isn’t necessarily useful. Life is full of uninterpretable correlations, and developers using Augur
should be careful not to trigger unusual behaviors with such results. A crowdsourcing layer that
verifies Augur’s predictions in a specific topic area may help filter out any confusing artifacts.
Similarly, while fiction allows us to learn about an enormous and diverse set of activities, in some
cases it may present a vocabulary that is too open ended. Activities may have similar meanings,
or overly broad ones (like work in our evaluation). How does a user know which to use? In our
testing, we have found that choice of phrase is often unimportant. For example, the cosine similarity
between hail taxi and call taxi is 0.97, which means any trigger for one is in practice equivalent to
the other (or take taxi or get taxi). In this sense a large vocabulary is actively helpful. However, for
other activities choice of phrase does matter, and to identify and collpase these activities, we again
see potential for the refinement of Augur’s model through crowdsourcing.
In the process of pursuing this research, we found ourselves in many data mining dead ends.
Human behavior is complex, and natural language is complex. Our initial efforts included heavier-
handed integration with WordNet to identify object classes such as locations and peoples’ names;
unfortunately, “Virginia” is both. This results in many false positives. Likewise, activity prediction
requires an order of magnitude more data to train than the other APIs given the N2 nature of
its skip-grams. Our initial result was that very few scenarios lent themselves to accurate activity
prediction. Our solution was to simplify our model (e.g., look only at pronouns) and gather ten
times the raw data from Wattpad. In this case, more data beat more modeling intelligence.
More broadly, Augur suggests a reinterpretation of our role as designers. Until now, the designer’s
goal in interactive systems has been to articulate the user’s goals, then fashion an interface specifically
to support those goals. Augur proposes a kind of “open-space design” where the behaviors may be
left open to the users to populate, and the designer’s goal is to design reactions that enable each of
these goals. To support such an open-ended design methdology, we see promise in Augur’s natural
language descriptions. Activities such as “sit down”, “order dessert” and “go to the movies” are
not complex activity codes but human-language descriptions. We speculate that each of Augur’s
activities could become a command. Suppose any device in a home could respond to a request to
“turn down the lights”. Today, Siri has tens of commands; Augur has potentially thousands.
Chapter 4
Modeling Signals in Human
Language
Human language is colored by a broad range of topics, but existing text analysis tools only focus
on a small number of them. Here we present Empath, a tool that can generate and validate new
lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate
the category violence). Empath draws connotations between words and phrases by learning a neural
embedding across billions of words on the web. Given a small set of seed words that characterize a
category, Empath uses its neural embedding to discover related terms, then validates the category
with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories
we have generated such as neglect, government, and social media. We show that Empath’s data-
driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.
4.1 Empath
Empath analyzes text across hundreds of topics and emotions. Like LIWC and other dictionary-
based tools, it counts category terms in a text document. However, Empath covers a broader set of
categories than other tools, and can generate and validate new categories with a few seed words.
4.1.1 Designing Empath’s categories
Empath provides 200 human validated categories, which cover topics like violence, depression, or
femininity. We drew these categories from common concepts in the ConceptNet knowledge base and
Parrott’s hierarchy of emotions [71]. While Empath’s topical and emotional categories stem from
different sources of knowledge, we generate member terms for both kinds of categories in the same
33
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 34
social media war violence technology fear pain hipster contempt
facebook attack hurt ipad horror hurt vintage disdain
instagram battlefield break internet paralyze pounding trendy mockery
notification soldier bleed download dread sobbing fashion grudging
selfie troop broken wireless scared gasp designer haughty
account army scar computer tremor torment artsy caustic
timeline enemy hurting email despair groan 1950s censure
follower civilian injury virus panic stung edgy sneer
Table 4.1: Empath can analyze text across hundreds of data-driven categories. Here we provide asample of representative terms in 8 sample categories.
way. Given a set of seed terms (from ConceptNet or the Parrott hierarchy), Empath learns from a
large corpus of text to predict and validate hundreds of similar categorical terms.
We generate category terms by querying a vector space model trained by a neural network on
a large corpus of text. This model allows Empath to examine the similarity between words across
many dimensions of meaning. For example, given seed words like “facebook” and “twitter,’ Empath
finds related terms like “pinterest” and “selfie.”
Training a neural word embedding model
To train Empath’s model, we adapt the skip-gram architecture introduced by Mikolov et al. [60].
This is an unsupervised learning that teaches a neural network to predict co-occurring words in a
corpus. For example, the network might learn that “death” predicts a nearby occurrence of the word
“carrion,” but not of “incest.” Over training the network learns a representation of each word that
is predictive of its context, and we can then borrow these representations, called neural embeddings,
to map words onto a vector space.
More formally, for word w and context C in a network with negative sampling, a skip-gram
network will learn weights that maximize the dot product w · wc and minimize w · wn for wc ∈ Cand wn sampled randomly from the vocabulary. The context C of a word is determined by a sliding
window over the document, of a size typically in (0,7).
We train our network on data from Wattpad, Reddit, and the New York Times [26, 24, 25].
The network uses a hidden layer of 150 neurons (which defines the dimensionality of the embedding
space), a sliding window size of five, a minimum word count of thirty (i.e., a word must occur at
least thirty times to appear in the training set), negative sampling, and down-sampling of frequent
terms. These techniques reflect current best practices in language modeling [61].
Building categories with a vector space
We use the neural embeddings created by our skip-gram network to construct a vector space model
(VSM). Similar models trained on neural embeddings, such as word2vec, enable powerful forms of
analogous reasoning (e.g., the vector arithmetic for the terms “King - Man + Queen” produces a
vector close to “Woman”) [55]. This model allows Empath to discover member terms for categories.
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 35
Empath Category Words that passed filter Words removed
Domestic Work chore, vacuum, scrubbing, laundry find
Dance ballet, rhythm, jukebox, dj, song buds
Aggression lethal, spite, betray, territorial censure
Attractive alluring, cute, swoon, dreamy, cute defiantly
Nervousness uneasiness, paranoid, fear, worry nostalgia
Furniture chair, mattress, desk, antique crate
Heroic underdog, gutsy, rescue, underdog spoof
Exotic aquatic, tourist, colorful, seaside rural
Meeting office, boardroom, presentation homework
Fashion stylist, shoe, tailor, salon, trendy yoga
Table 4.2: Crowd workers found 95% of the words generated by Empath’s unsupervised model tobe related to its categories. However, machine learning is not perfect, and some unrelated termsslipped through (“Did not pass” above), which the crowd then removed.
VSMs encode concepts as vectors, where each dimension of the vector v ∈ Rn conveys a feature
relevant to the concept. For Empath, each vector v is a word, and each of its dimensions defines the
weight of its connection to one of the hidden layer neurons. The space is M(n× h) where n is the
size of our vocabulary (40,000), and h the number of hidden nodes in the network (150).
Empath’s VSM selects member terms for its categories (e.g., social media, violence, shame) by
using cosine similarity, a similarity measure over vector spaces, to find nearby terms in the space.
Concretely, we search the vector spaces on multiple seed terms by querying on the vector sum of
those terms—a kind of reasoning by analogy. From a small seed of words, Empath can gather
hundreds of terms related to a given category, and then use these terms for textual analysis.
4.1.2 Refining categories with crowd validation
Human-validated categories can ensure that accidental terms do not slip into a lexicon. By filtering
Empath’s categories through the crowd, we offer the benefits of both modern NLP and human
validation: increasing category precision, and more carefully validating category contents.
To validate each of Empath’s categories, we created a crowdsourcing pipeline on Amazon Me-
chanical Turk. We divided the total number of words to be filtered across many separate tasks,
where each task consists of twenty words to be rated for a given category. For each of these words,
workers select a relationship on a four point scale: not related, weakly related, related, and strongly
related. We ask three independent workers to complete each task at a cost of $0.14 per task. Prior
work has shown that three workers are enough for reliable results in labeling tasks, given high quality
contributors [72]. So, if we want to filter a category of 200 words, we would have 200/20 = 10 tasks,
which must be completed by three workers, at a total cost of 10∗3∗0.14 = $4.2 for this category. We
limit tasks to Masters workers to ensure quality and aggregate crowdworker feedback by majority
vote. Workers demonstrated high agreement on the labeling task (81%).
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 36
4.1.3 Empath API and web service
Finally, to help researchers analyze text over new kinds of categories, we have released Empath as
a web service and open source library. The web service1 allows users to analyze documents across
Empath’s built-in categories, generate new unsupervised categories, and request new categories be
validated using our crowdsourcing pipeline. The open source library2 is written in Python and
similarly returns document counts across Empath’s built-in validated categories.
4.2 Empath Applications
To motivate the opportunities that Empath creates, we first present three example analyses that
illustrate its breadth and flexibility. In general, Empath allows researchers to perform text analyses
over a broader set of topical and emotional categories than existing tools, and also to create and val-
idate new categories on demand. Following this section, we explain the techniques behind Empath’s
model in more detail.
4.2.1 Example 1: Understanding deception in hotel reviews
What kinds of words accompany our lies? In our first example, we use Empath to analyze a dataset
of deceptive hotel reviews reported previously by Ott el al. [66]. This dataset contains 3200 truthful
hotel reviews mined from TripAdvisor.com and deceptive reviews created by workers on Amazon
Mechanical Turk, split among positive and negative ratings. The original study found that liars tend
to write more imaginatively, use less concrete language, and incorporate less spatial information into
their lies.
Exploring the deception dataset
We ran Empath’s full set of categories over the truthful and deceptive reviews, and produced ag-
gregate statistics for each. Using normalized means of the category counts for each group, we then
computed odds ratios and p-values for the categories most likely to appear in deceptive and truthful
reviews. All the results we report are significant after a Bonferroni correction (α = 2.5e−5).
Our results provide new evidence in support of the Ott et al. study, suggesting that deceptive
reviews convey stronger sentiment across both positively and negatively charged categories, and tend
towards exaggerated language (Figure 4.1). The liars more often use language that is tormented (2.5
odds) or joyous (2.3 odds), for example “it was torture hearing the sounds of the elevator which
just would never stop” or “I got a great deal and I am so happy that I stayed here.” The truth-
tellers more often discuss concrete ideas and phenomena like the ocean (1.6 odds,), vehicles (1.7
1http://empath.stanford.edu2https://github.com/Ejhfast/empath
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 37
Figure 4.1: Deceptive reviews convey stronger sentiment across both positively and negativelycharged categories. In contrast, truthful reviews show a tendency towards more mundane activ-ities and physical objects.
odds) or noises (1.7 odds), for example “It seemed like a nice enough place with reasonably close
beach access” or “they took forever to Valet our car.” We see a tendency towards more mundane
activities among the truth-tellers through categories like eating (1.3 odds), cleaning (1.3 odds), or
hygiene (1.2 odds). “I ran the shower for ten minutes without ever receiving any hot water.” For
the liars interactions seem to be more evocative, involving death (1.6 odds) or partying (1.3 odds).
“The party that keeps you awake will not be your favorite band practicing for their next concert.”
For exploratory research questions, Empath provides a high-level view over many potential cat-
egories, some of which a researcher may not have thought to investigate. Lying hotel reviewers, for
example, may not have realized they give themselves away by fixating on smell (1.4 odds), “the
room was pungent with what smelled like human excrement”, or their systematic overuse of emo-
tional terms, producing significantly higher odds ratios for 13 of Empath’s 32 emotional categories.
Truthful reviews, on the other hand, display higher odds ratios for none of Empath’s emotional
categories.
Spatial language in lies
While the original study provided some evidence that liars use less spatially descriptive language, it
wasn’t able to test the theory directly. Using Empath, we can generate a new set of human validated
terms that capture this idea, creating a new spatial category. To do so, we tell Empath to seed the
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 38
Figure 4.2: We use Empath to replicate the work of Golder and Macy, investigating how mood onTwitter relates to time of day. The signals reported by Empath and LIWC by hour are stronglycorrelated for positive (r=0.87) and negative (r=0.90) sentiment.
category with the terms “big”, “small”, and “circular”. Empath then discovers a series of related
terms and uses the crowd to validate them, producing the cluster:
circular, small, big, large, huge, gigantic, tiny, rectangular, rectangle, massive, giant, enormous, smallish,
rounded, middle, oval, sized, size, miniature, circle, colossal, center, triangular, shape, boxy, round,
shaped, decorative, ...
When we then add the new spatial category to our analysis, we find it favors truthful reviews
by 1.2 odds (p < 0.001). Truth-tellers use more spatial language, for example, “the room that we
originally were in had a huge square cut out of the wall that had exposed pipes, bricks, dirt and
dust.” In aggregate, liars are not as apt in these concrete details.
4.2.2 Example 2: Mood on Twitter and time of day
In our final example, we use Empath to investigate the relationship between mood on twitter and
time of day, replicating the work of Golder and Macy [33]. While the corpus of tweets analyzed by
the original paper is not publicly available, we reproduce the paper’s findings on a smaller corpus
of 591,520 tweets from the PST time-zone, running LIWC on our data as an additional benchmark
(Figure 4.2).
The original paper shows a low of negative sentiment in the morning that rises over the rest
of the day. We find a similar relationship on our data with both Empath and LIWC: a low in
the morning (around 8am), peaking to a high around 11pm. The signals reported by Empath and
LIWC over each hour are strongly correlated (r=0.90). Using a 1-way ANOVA to test for changes in
mean negative affect by hour, Empath reports a highly significant difference (F (23, 591520) = 17.2,
p < 0.001), as does LIWC (F = 6.8, p < 0.001). For positive sentiment, Empath and LIWC again
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 39
replicate similarly with strong correlation between tools (r=0.87). Both tools once more report
highly significant ANOVAs by hour: Empath F = 5.9, p < 0.001; LIWC F = 7.3, p < 0.001.
4.3 Evaluation
Here we evaluate Empath’s crowd filtered and unsupervised predictions against gold standard cate-
gories in LIWC.
4.3.1 Comparing Empath and LIWC
The broad reach of our dataset allows Empath to classify documents among a large number of
categories. But how accurate are these categorical associations? Human inspection and crowd
filtering of Empath’s categories (Table 4.2) provide some evidence, but ideally we would like to
answer this question in a more quantitative way.
Fortunately, LIWC has been extensively validated by researchers [68], so we can use it to bench-
mark Empath’s predictions across the categories that they share in common. If we can demonstrate
that Empath provides very similar results across these categories, this would suggest that Empath’s
predictions are close to achieving gold standard accuracy. Here we compare the predictions of
Empath and LIWC over 12 shared categories: sadness, anger, positive emotion, negative emotion,
sexual, money, death, achievement, home, religion, work, and health.
Method
To compare all tools, we created a mixed textual dataset evenly divided among tweets [64], StackEx-
change opinions [19], movie reviews [67], hotel reviews [66], and chapters sampled from four classic
novels on Project Gutenberg (David Copperfield, Moby Dick, Anna Karenina, and The Count of
Monte Cristo) [1]. This mixed corpus contains more than 2 million words in total across 4500
individual documents.
Next we selected two parameters for Empath: the minimum cosine similarity for category inclu-
sion and the seed words for each category (we fixed the size of each category at a maximum of 200
words). To choose these parameters, we divided our mixed text dataset into a training corpus of
900 documents and a test corpus of 3500 documents. We selected up to five seed words that best
approximated each LIWC category, and found that a minimum cosine similarity of 0.5 offered the
best performance. We then also created crowd filtered versions of these categories.
We ran all tools over the documents in the test corpus, recorded their category word counts,
then used these counts to compute Pearson correlations between all shared categories, as well as
aggregate overall correlations. Pearson’s r measures the linear correlation between two variables,
and returns a value between (-1,1), where 1 is total positive correlation, 0 is no correlation, and 1 is
total negative correlation. These correlations speak to how well one tool approximates another.
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 40
LIWC Category Empath Empath+Crowd Emolex General Inquirer
Positive 0.944 0.950 0.955 0.971
Negative 0.941 0.936 0.954 0.945
Sadness 0.890 0.907 0.852
Anger 0.889 0.894 0.837
Achievement 0.915 0.903 0.817
Religion 0.893 0.908 0.902
Work 0.859 0.820 0.745
Home 0.919 0.941
Money 0.902 0.878
Health 0.866 0.898
Sex 0.928 0.935
Death 0.856 0.901
Average 0.900 0.906 0.899 0.876
Table 4.3: We compared the classifications of LIWC, EmoLex and Empath across thirteen categories,finding strong correlation between tools. The first column represents comparisons between Empath’sunsupervised model against LIWC, the second after crowd filtering against LIWC, the third betweenEmoLex and LIWC, and the fourth between the General Inquirer and LIWC.
To anchor this analysis, we collected benchmark Pearson correlations against LIWC for GI and
EmoLex (two existing human validated lexicons). We found a benchmark correlation of 0.876 be-
tween GI and LIWC over positive emotion, negative emotion, religion, work, and achievement, and a
correlation of 0.899 between EmoLex and LIWC over positive emotion, negative emotion, anger, and
sadness. While EmoLex and GI are commonly regarded as gold standards, they correlate imperfectly
with LIWC. We take this as evidence that gold standard lexicons can disagree: if Empath approx-
imates their performance against LIWC, it agrees with LIWC as well as other carefully-validated
dictionaries agree with LIWC.
Finally, to test the importance of choosing seed terms, we re-ran our evaluation while permuting
the seed words in Empath’s categories. Over one trial, we dropped one seed term from each category.
Over another, we replaced one term from each category with a similar alternative (e.g., “church” to
“chapel”, or “kill” to “murder”).
Results
Empath shares overall average Pearson correlations of 0.90 (unsupervised) and 0.906 (crowd) with
LIWC (Table 4.3). Over the emotional categories, Empath and LIWC agree at correlations of 0.884
(unsupervised) and 0.90 (crowd), comparing favorably with EmoLex’s correlation of 0.899. Over GI’s
benchmark categories, Empath reports 0.893 (unsupervised) and 0.91 (crowd) correlations against
LIWC, stronger performance than GI (0.876). On average, adding a crowd filter to Empath improves
its correlations with LIWC by 0.006. We plot Empath’s best and worst category correlations with
LIWC in Figure 4.3. These scores indicate that Empath and LIWC are strongly correlated – similar
to the correlation between LIWC and other published and validated tools.
In permuting Empath’s seed terms, we found it retained high unsupervised agreement with
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 41
Figure 4.3: Empath categories strongly agreed with LIWC, at an average Pearson correlation of 0.90.Here we plot Empath’s best and worst correlations with LIWC. Each dot in the plot corresponds toone document. Empath’s counts are graphed on the x-axis, LIWC’s on the y-axis.
LIWC (between 0.82 and 0.88). The correlation between tools was most strongly affected when we
dropped seeds that added a unique meaning to a category. For example, death is seeded with the
words “bury”, “coffin”, “kill”, and “corpse.” When we removed “kill” from the death’s seed list,
Empath lost the adversarial aspects of death (embodied in words like “war”, “execute”, or “murder”)
and fell to 0.82 correlation with LIWC for that category. Removing death’s other seed words did
not have nearly so strong an affect. On the other hand, replacing seeds with alternative forms or
synonyms (e.g., “hate” to “hatred”, or “kill” to “murder”) usually had little impact on Empath’s
correlations with LIWC.
4.4 Discussion
Empath demonstrates an approach that crosses traditional text analysis metaphors with advances
in deep learning. Here we discuss our results and the limitations of our approach.
4.4.1 The role of human validation
While adding a crowd filter to Empath improves its overall correlations with LIWC, the improve-
ment is not statistically significant. Even more surprisingly, the crowd does not always improve
agreement at the level of individual categories. For example, across the categories negative emotion,
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 42
achievement, and work, the crowd filter slightly decreases Empath’s agreement with LIWC. When we
inspected the output of the crowd filtering step to determine what had caused this effect, we found
in a small number of cases in which the crowd was overzealous. For example, the word “semester”
appears in LIWC’s work category, but the crowd removed it from Empath. Should “semester” be
in a work category? This disagreement highlights the inherent ambiguity of constructing lexicons.
In our case, when the crowd filters out a common word shared by LIWC (like “semester”), this
causes overall agreement across the corpus to decrease (through additional false negatives), despite
the appropriate removal of many other less common words.
As we see in our results, this scenario does not happen often, and when it does happen the
effect size is small. We suggest that crowd validation offers the qualitative benefit of removing false
positives from analyses, while on the whole performing almost identically to (and usually slightly
better than) the unfiltered version of Empath.
4.4.2 Data-driven: who is actually driving?
Empath, like any data-driven system, is ultimately at the mercy of its data – garbage in, garbage
out. While fiction allows Empath to learn an approximation of the gold-standard categories that
define tools like LIWC, its data-driven reasoning may succeed less well on corner cases of analysis
and connotation. Just because fictional characters often pull guns out of gloveboxes, for example,
does not mean the two should be strongly connected in Empath’s categories.
Contrary to this critique, we have found that fiction is a useful training dataset for Empath given
its abundance of concrete descriptors and emotional terms. When we replaced the word embeddings
learned by our model with alternative embeddings trained on Google News [60], we found its average
unsupervised correlation with LIWC decreased to 0.84. The Google News embeddings performed
better after significance testing on only one category, death (0.91), and much worse on several of the
others, including religion (0.78) and work (0.69). This may speak to the limited influence of fiction
bias. Fiction may suffer from the overly fanciful plot events and motifs that surround death (e.g.
suffocation, torture), but it captures more relevant words around most categories.
4.4.3 Limitations
Empath’s design decisions suggest a set of limitations. First, while Empath reports high Pearson
correlations with LIWC’s categories, it is possible that other more qualitative properties are im-
portant to lexical categories. Two lexicons can be statistically similar on the basis of word counts,
and yet one might be easier to interpret than the other, offer more representative words, or present
fewer false positives or negatives. At a higher level, the number and kinds of categories available
in Empath present a related concern. We created these categories in a data-driven manner. Do
they offer the right balance and breadth of topics? We have not evaluated Empath over these more
qualitative aspects of usability.
CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 43
Second, we have not tested how well Empath’s categories generalize beyond the core set it shares
with LIWC. Do these new categories perform as well in practice? While Empath’s categories are all
generated and validated in the same way, we have seen though our evaluation that choice of seed
words can be important. What makes for a good set of seed terms? And how do we best discover
them? In future work, we hope to investigate these questions more closely.
Finally, while fiction provides a powerful model for generating lexical categories, we have also
seen that, for certain topics (e.g. death in Google News), other corpora may have even greater
potential. Could different datasets be targeted at specific categories? Mining an online fashion
forum, for example, might allow Empath to learn a more comprehensive sense of style, or Hacker
News might give it a more nuanced view of technology and startups. We see potential for training
Empath on other text beyond fiction.
4.4.4 Statistical false positives
Social science aims to avoid Type I errors — false claims that statistically appear to be true. Because
Empath expands the number of categories available for analysis, it is important to consider the risk
of a scientist analyzing so many categories that one of them, through sheer randomness, appears
to be elevated in the text. In this paper, we used Bonferroni correction to handle the issue, but
there are more mature methods available. For example, Holm’s method and FDR are often used in
statistical genomics to test thousands of hypotheses. In the case of regression analysis, it is likewise
important not to do so-called “garbage can regressions” that include every possible predictor. In this
case, models that penalize complexity (e.g., non-zero coefficients) are most appropriate, for example
LASSO or logistic regression with an L1 penalty.
Chapter 5
Modeling Patterns in Code
Interfaces need explicit rules to support users, yet common practices are uncodified across many do-
mains such as programming and writing. We hypothesize that by modeling this emergent practice,
interfaces can support a far broader set of user needs. To explore this idea, we built Codex, a knowl-
edge base that records common practice for the Ruby programming language by indexing over three
million lines of popular code. Codex enables new data-driven interfaces for programming systems:
statistical linting, identifying code that is unlikely to occur in practice and may constitute a bug;
pattern annotation, automatically discovering common programming idioms and annotating them
with metadata using expert crowdsourcing; and library generation, constructing a utility package
that encapsulates and reflects emergent software practice. We evaluate these applications to find
Codex captures a broad swatch of programming practice, statistical linting detects problematic code
snippets, and pattern annotation discovers nontrivial idioms such as basic HTTP authentication and
database migration templates. Our work suggests that operationalizing practice-driven knowledge
in structured domains such as programming can enable a new class of user interfaces.
5.1 Codex
Norms of practice and convention emerge for software systems that aren’t codified in documentation.
Codex uncovers these norms by processing and aggregating millions of lines of open source code from
popular Ruby projects on Github.
5.1.1 Indexing and Abstraction
To build its database, Codex indexes more that 3,000,000 lines of code from 100 popular Ruby
projects on Github. It gathers these projects through the Github API by sorting all Ruby projects
on the number of watchers and then selecting the 100 projects most watched by other Github
44
CHAPTER 5. MODELING PATTERNS IN CODE 45
users. Codex first breaks apart a project recursively into all constituent AST nodes and annotates
these nodes with metadata; next, it normalizes all the AST nodes and collapses those that share a
normalized form into a single generalized database entry. The unparsed representation of each of
these normalized nodes is a Codex snippet.
Snippets of Ruby source code tend to be syntactically unique due to high variance in identifier
names and primitive values. Pattern finding tools usually need to abstract away some properties if
they are to find meaningful statistical patterns [38, 8, 20]. While we might implement normalization
in many different ways, Codex groups together snippets that are functionally similar by standardizing
the names of local variables and primatives. For some snippets (e.g., variable assignment) Codex
also keeps track of the original identifiers to enable variable name analysis.
Specifically, Codex’s normalization renames variable identifiers, strings, symbols, and numbers.
The first unique variable in a snippet would be renamed var0, the next var1, the first string str0,
and so on. Codex does not normalize class constants and function calls, as these abstractions
provide information important to Codex’s task-oriented search functionality and statistical linting.
As programmers use many different variable names and primitive values when accomplishing a
specific task, abstracting away these names helps Codex represent the core behavior of a snippet.
For instance, consider the Ruby snippet:
[:CHI, :UIST].map do |z|
z.to_s + ‘‘is a conference’’
end
After normalization, this snippet will be:
[:sym1, :sym2].map do |var1|
var1.to_s + ‘‘str1’’
end
Normalization works less well when such primitives (e.g., specific string or number values) are
vital to the interpretation of a snippet. In the future, we will only normalize snippet variable names
and identifiers if there is sufficient entropy in their definitions across similar snippets. Snippets with
vital identifiers are likely to be more consistent. Other normalization schemes may succeed as well,
but we find that this approach successfully collapses most similar snippets together.
Codex applies a map-reduce to the database, collapsing AST nodes with the same normalized
form into a single AST entry. We collect additional parameters as part of the map-reduce step:
files, a list of each file in which the snippet occurs; projects, a list of projects in which the snippets
appears; count, the total number of times a snippet has appeared; file count, the number of times a
snippet has appeard in unique files; and project count, the number of times a snippet has appeared in
unique projects. Codex uses these parameters to enable the statistical and pattern finding modules.
Codex users the Parser and AST Ruby gems by whitequark for AST processing. We have deployed
the Codex database on Heroku, using RethinkDB and MongoHQ.
CHAPTER 5. MODELING PATTERNS IN CODE 46
5.1.2 Statistical Analysis Module
Codex has modules that enable high-level and low-level pattern detection. First we describe the
low-level module, which focuses on syntactical patterns that occur among AST nodes.
The statistical analysis module allows Codex to warn users when code is unlikely. Codex decides
this likelihood using a set of statistics: the frequency of the snippet and also the frequencies of
component forms of the snippet (e.g., .to s and .split for .split.to s). When a snippet’s compo-
nent forms are sufficiently common and the snippet itself is sufficienctly uncommon, Codex labels it
unlikely; that is, a snippet s must have occurred fewer than t times and all its component pieces, ci
must have occurred at most ti times.
Detecting Surprisingly Unlikely Code
Codex indexes many kinds of AST nodes (e.g., blocks, conditionals, assignment statements, function
calls, function definitions), but it conducts syntactic analysis upon a subset of these nodes. The
function by which a snippet of unlikely code is declared surprising differs based upon the type of
node in question. We discuss four representative analyses we have built to demonstrate the system:
1. Function Call Analysis: This analysis checks how many times a function has been called with
a given “type signature”, which Codex defines as the kind of AST nodes passed as arguments
(not the runtime type of the expression), relative to the number of times the function has been
called with other kinds of signatures. If a sufficiently common function appears with a type
signature that is very rarely observed by Codex, this may suggest problematic code. In split(’
’,2), s is split(string,number); c1 is the name of the function; c2 is the function signature,
e.g., [string, number]). Codex checks how many times split is called with string and integer
arguments relative to other kinds of arguments.
2. Function Chaining Analysis: This analysis checks how many times one function has been
chained with another; that is, the result of some first function is used as the caller of some
second function. Here s is the function chain, e.g., split.to s; c1 is the first function, e.g,
split; and c2 is the second, e.g., to s. Two functions that are often used but never chained
together suggest unusual code.
3. Block Return Value Analysis: This analysis checks how many times a certain kind of block has
returned a certain kind of value. For instance, it would be legal but unusual to write the code
things.each { |x| x.to s }, which does transform every element in the things list to a string,
but does not alter things itself since to s does not change the state of its caller (to change
the values in names, a programmer might instead use the expression x = x.to s inside the each
block). Here s is a kind of block with a particular return type, e.g., a each block with return
type of the to s function; c1 is a kind of block, e.g., an each block; and c2 is a kind of block
return type, e.g., blocks returning the to s function.
CHAPTER 5. MODELING PATTERNS IN CODE 47
4. Identifier Analysis: This analysis checks how many times a variable identifier has been assigned
with a certain type of primitive. Often variable names suggest the type of the variable that they
reference; this analysis allows Codex to warn programmers about misleading or unconventional
variable names (e.g., str = 0 or my array = {}). Here s is the variable name as assigned to a
particular type, e.g., str = 0; and c1 is the variable name, e.g., str.
5.1.3 Pattern Finding Module
Whereas the statistical analysis module focuses on low-level syntactical structure, the pattern finding
module detects a set of high-level Ruby idioms and example snippets commonly reused by program-
mers. By constructing an appropriate query over the normalized snippets in its database, Codex can
find snippets that isolate common programming idioms. The pattern finding module also enables
other specific kinds of queries based on context (e.g., searching for certain library methods called
from within a map block.)
The general form of Codex’s pattern finding consists of a single query that is applied to the
database of abstracted snippets; we intend it to filter out snippets that programmers are less likely
to find interesting or useful. The query has five parameters, corresponding to attributes stored in
the database, and ordered here by their selectivity:
1. Project Count : the number of unique projects in which an abstracted snippet has occurred. A
lower bound of 2% of the number of projects indexed by codex filters out snippets that tend
to be longer and more idiosyncratic.
2. Total Count : the total number of times an abstracted snippet has occurred. An upper bound
of the 90% percentile filters out overly trivial snippets (e.g., var0 = var1).
3. File Count : the total number of unique files in an abstracted snippet has occurred. An upper
bound of 20% of the count of an abstracted snippet filters out snippets that are reused quite
a bit within one or more files; these snippets tend to be overly domain specific.
4. Token Count : the number of unique variables, function calls, and primitives that occur in an
abstracted snippet. An upper bound of the 80% percentile of all snippet token counts filters
out overly domain specific code.
5. Function Count : the number of unique function calls in a snippet. A lower bound of 2 filters
out trivial snippets.
These snippets are then passed to expert crowds, who attach metadata such as a title, description,
and measure of recommended usefulness.
Together, these parameters produce 9,693 abstracted snippets from the Codex database, corre-
sponding to 79,720 original snippets in the index. This query is designed to produce general purpose
snippets; other queries might be constructed differently to produce more domain specific results.
CHAPTER 5. MODELING PATTERNS IN CODE 48
Figure 5.1: The Codex IDE calls out a snippet of unlikely code by a yellow highlight in its gutter.Warning text appears in the footer.
5.2 Codex Applications
To ground the opportunities that Codex creates, we begin by introducing three software engineering
applications that draw on the Codex data model and high or low level code analysis. In general,
Codex enables interfaces and applications that are supported by emergent programming behavior
rather than a set of special-cased rules. Following this section, we discuss the techniques behind
these applications in more detail.
5.2.1 Statistical Linting
Sometimes, developers program badly: they write code that performs in unexpected ways or violates
language conventions. Poorly written code causes significant damage to software projects; bugs tax
programmers’ time and energy, and code written in an abstruse or non-idiomatic style is more
difficult to maintain [70, 30]. Given the complexity of programming languages, rule-based linters
can’t catch much of this unusual or non-idiomatic code.
Codex operates on the insight that poorly written code is often syntactically different from well
written code. For example, functions might be used in the wrong combination or order. So if we
collect and index a set of code that is representative of best practices, bad code will often diverge
syntactically from the code in this index. Not all syntactically divergent code is bad — the space
of well written Ruby programs is very large — but by applying high-precision detectors to a few
general AST patterns, Codex can detect syntactically divergent code that is likely to be problematic.
Function Chaining and Composition
Programmers frequently chain and compose functions and operators to create complex algorithmic
pipelines, but chaining the wrong kinds of functions together will often cause subtle program bugs.
For example, bugs might arise from functions chained in the wrong order, or variables added or
assigned in ways they should not be. Codex helps programmers find potential bugs in function
chains by identifying unlikely combinations of functions.
For example, if Ava is querying a database that has been normalized to lower case, she needs to
convert a string held by the variable name to lower case form. She intends to assign the lower case
CHAPTER 5. MODELING PATTERNS IN CODE 49
variant of name to the variable lower case name and use this new varible in her query. The Ruby
methods downcase and downcase! will both convert a string variable to lower case, and without
thinking too deeply, Ava codes: lower case name = name.downcase!
Unfortunately, Ava has forgotten that downcase! has a side-effect: it changes the variable name
in place and returns itself. The function she ought to have used is downcase, which returns a new
lower cased string and does not change name. When Ava later uses name elsewhere in her program,
it doesn’t hold the value she expects.
Codex indicates that the line of code is statistically unlikely: downcase! is not commonly chained
with an assignment statement (although such code is not technically incorrect). Codex notifies Ava
that it has observed downcase! 57 times, and the abstraction var = var.any method more than
100,000 times, but it has only encountered one variant of Ava’s combined snippet. However, Codex
has encountered variants of the correct snippet, lower case name = name.downcase, more than 200
times.
Block Return Value Analysis
Ruby programmers often manipulate data by passing blocks (lambda-like closures) to functions, but
using the wrong kind of block, or passing a block to the wrong kind of function, can process data in
unintended ways. Codex identifies unlikey pairings of functions and block return values.
For example, as part of data analysis pipeline, Ash wants to raise every number in a list to the
power of 2. He tries to do this with a map block, but encounters a problem (he uses the operator ^ in
place of **) and adds a puts (print) statement inside the map block to help him debug his mistake:
new_nums =
nums.map do |x|
x^2
puts x
end
In doing so, Ash has introduced another bug. The puts method returns nil, which means that
new nums will be a list of nil. When Ash runs his code, this new error complicates his old problem.
Codex returns a warning: most programmers do not return the method puts from a map block.
We can anchor this concern in data: Codex has observed map blocks 4297 times and puts statements
5335 times, but it has never observed puts as the last line (an implicit return) of a map block. However,
Codex observes that puts statements are a common return value of blocks that are predominantly
used for control flow, like each (observed 272 times), so it produces a warning.
Function Type Analysis
Passing the wrong kinds of arguments to a function, or passing positional arguments in the wrong
order, can lead to many subtle bugs — especially in duck typed languages like Ruby. However,
CHAPTER 5. MODELING PATTERNS IN CODE 50
by analyzing the kinds of AST nodes passed as positional arguments to functions, Codex can warn
users about unlikely function signatures.
For example, Severian wants to divide a few hundred datapoints into ten buckets, depending on
their id number. To do this, he needs to initialize an array of ten elements, where each element is a
hash. Severian codes: Array.new({},10).
Unfortunately, Severian doesn’t often initialize arrays with specific lengths and values, and he has
reversed the arguments of Array.new. When he executes his code, it fails with the error, “TypeError:
can’t convert Hash into Integer.”
Codex could have told Severian that programmers don’t often pass Array.new an argument list
composed of a string and integer. While Codex observes Array.new 674 times, it has never observed
Array.new with string and integer arguments. However, Codex observes the correct parameterization
Array.new(integer,string) several times, which is the correct version of Severian’s code.
Variable Name Analysis
Good variable names provide important signals about how a variable should be treated and lead
to more readable code [70]. Likewise, badly named variables can lead to poor code readability and
downstream program errors. By analyzing variable name associations with primitive values (e.g.,
Strings, Integers, Hashes), Codex can warn programmers about violations of naming conventions.
For example, Azazel is writing a complicated function to process a large dataset from a database
call. He is collecting the data in an an Array called array. However, he later realizes that a hash
would be simpler to manage and changes the variable type. In a rush, Azazel changes the variable’s
type but doesn’t bother to change its name: array = {}. Later, Ash, who is Azazel’s coworker, is
looking elsewhere in the function and sees a line array.keys { ... }. He wonders, does an Array
have keys? He hadn’t thought so.
Instead, Codex notifies Azazel that most programmers do not initialize a variable named array
with a Hash value. While Codex observes initializations to variables named array 116 times and
variables assigned a Hash value many thousands of times, it has never observed the two together.
Instead, Codex observes array = [] 46 times.
It is not wrong to assign a Hash value to a variable named array, but code that does so is likely
less readable and might lead to downstream errors. Codex can determine that such an assignment
violates Ruby convention. Likewise, Codex would notice integers stored in str or common loop
count iterators like i being initialized with other variable types.
The Codex IDE integrates CodexLint (Figure 5.1), allowing users to call up statistics about any
line of code in the editor. The linter also runs behind the scenes during development, highlighting
any unlikely code with a yellow overlay on the window gutter. When the cursor moves over a marked
line, a small message appears on the lower bar of the Codex window, e.g., “We have seen the function
split 30,000 times and strip 20,000 times, but we’ve never seen them chained together.”
CHAPTER 5. MODELING PATTERNS IN CODE 51
Example Codex Annotated Snippets
HTTP Basic Auth
if var0.user
var1.basic_auth(var0.user, var0.password)
end
Sets the basic-auth parameters (username and password) before making an HTTP request, perhaps using Net::HTTP
Popping Options Hash from Arguments
if Hash === var0.last
var0.pop
else
{}
end
Pops the last element from the list ’var0’ if it is a Hash. Gives an empty hash if the last element is not a Hash
Raise StandardError
raise(StandardError.new("str0"))
Raise a StandardError exception using “str0” as exception message
Configure action controller to disable caching
config.action_controller.perform_caching=(false)
This will set a global configuration related to caching in action controller to false
Table 5.1: Codex identifies common programming snippets automatically, then feeds them to crowd-sourced expert programmers for metadata such as the bolded title and descriptive text.
5.2.2 Pattern Annotation
Many valuable programming idioms are not collected in documentation or on the web. While users
can access standard library documentation for core abstractions (e.g., for Ruby, http://ruby-doc.
org/), and libraries often ship with similar kinds of documentation provided by their maintainers, the
common idioms by which basic functions may be combined and extended often remain uncodified.
Instead, these idioms live in the minds of programmers and — sometimes — on the message boards
of communities and forums. Novice users of languages and libraries must “mind the gap” present in
official forms of documentation.
Codex fills in gaps of practice-driven knowledge by detecting common idioms as it indexes code
and sending them out to be filtered and annotated by a crowd of human experts. Codex finds
these idioms by selecting snippets in its database with query parameters such as commonality and
complexity. These selected snippets (e.g., that appear in a large number of unique projects and are
sufficiently nontrivial) are primed for annotation and human filtering. For instance, over the course
of its indexing, Codex identifies inject { |x,y| x + y } as a common snippet, occurring 15 times
CHAPTER 5. MODELING PATTERNS IN CODE 52
across 4 projects.
Next, Codex sends these snippets—strings of Ruby code, along with examples of them in use—to
a Ruby expert on oDesk, a paid expert crowdsourcing platform. The worker annotates the snippet
with a title (e.g., “sum all the elements in a list”), a description, and a vote of how useful the snippet
would be for an everyday Ruby programmer. Codex stores these annotations in its index along with
the original source snippet, making previously implicit knowledge explicit. Eventually, we envision
a community of Ruby programmers that annotates snippets of interest.
The Codex IDE uses this annotated snippet information to provide higher-level interpretability
to code. The annotations appear whenever a programmer opens a file containing the idiom. Users
benefit from annotated code under many different scenarios: perhaps using code scavenged from
a web tutorial; opening an unfamiliar file passed on by a collaborator; revisiting a segment of
copy/pasted code; or trying to recall the use of an idiosyncratic library function.
Consider one such user, Morwenna, who is collaborating with a colleague on a Ruby on Rails ap-
plication. Morwenna hasn’t had much experience with Rails, so she begins navigating the many files
of her colleague’s code in an attempt to build familiarity with the framework. While visiting con-
fig/application.rb, Morwenna comes across the snippet config.action controller.perform caching =
false and wonders what this means. Codex indicates the line has an annotation, so she asks to see
it. The annotation reads, “Turns off default Rails caching.”
The Codex IDE calls out and displays any available and relevant annotations (Figure 1.3). When
the cursor moves over a line where annotations are available, a user can call them into the sidebar.
We present examples of these annotated snippets in Table 5.1. In general, Codex’s annotation
system uncovers higher-level connections between more basic program components. For instance,
human workers can infer the relation of a snippet to some outside library, providing context that isn’t
explicitly present (e.g., Net:HTTP or Ruby on Rails). Similarly, Codex allows for the documentation
of higher-level idioms, where programmers can find each component in documention but not the
snippet itself, like the combination of raise and StandardError.new.
Querying for Understanding
In addition to the automatic idiom detection provided by the pattern finding module, users can query
Codex directly to better understand community practices around a line or block of code. Queryable
parameters include the type of AST node (e.g., a block, conditional, or function call), the body
string of the normalized code associated with a node, the body string of original code, the amount
of information contained in an AST node (i.e., a measure of code complexity), and the frequency of
a node’s occurrence across files and projects.
For instance, from a library-driven standpoint, suppose that programmers want to know more
about how people use the Net::HTTP class. They can query for all blocks that contain Net::HTTP.new,
sorting on the ones that occur most often. By the diversity of this result set, programmers gain
CHAPTER 5. MODELING PATTERNS IN CODE 53
Function Description
Array#sort by index(idx) Sort an array by the value at idx
Array#convert join(str) Converts each array element to a string then joins them all on str
Array#upto size Create a range, same size as the array
String#capital tokens(str) Capitalize all tokens in a string
Hash.nested Create a hash with default value {}Hash#get(key) Retrieve based on :key or “key”
File#try close Close a file if it’s open
Table 5.2: A sample of functions from CodexLib, detected in emergent programming practice andencapulated into a new standard library.
a sense of the kinds of context in which Net::HTTP is used — even more so, if any of the results
have been annotated by Codex’s crowdsourcing engine. This is a more query-driven approach to
example-driven development [12, 65].
Queries also have applications in other more IDE-specific components like auto-complete, where
the IDE might attempt to find the most common completion for a snippet of code, given additional
context. For example, with the line Hash.new and an open block, Codex suggests the completion block
{ |h,k| h[k] = [] }, which initializes the default value of a hash to a new empty Array. Codex’s
user query system enables a broad set of functionalities including code search, auto-complete, and
example discovery.
5.2.3 Library Generation
Many of the Ruby snippets discovered by Codex are modular, reusable components. The recompos-
able nature of these snippets suggests that programmers might benefit from their encapsulation in a
new standard library that reflects the “missing” functionality that Ruby programmers actually use.
Programmers may sometimes engage in unnecessary work: both the mechanical work of typing out
repetitive syntax, and also the mental work of caching task-oriented semantics in working memory.
Here we present CodexLib, a library created by emergent practice (Table 5.2). Unlike human
language, which evolves over time (e.g., “personal computer” becomes “PC” and “smartphone”
emerges to describe a new class of devices), programming languages and libraries often remain more
static. CodexLib suggests programming libraries can similarly evolve based on actual usage.
Consider one common Ruby idiom, creating a new Hash object where its default lookup value is
another empty Hash. This nested hash object allows programmers to write code in a matrix-like style,
e.g., hash[‘‘Gaiman’’][‘‘Coraline’’] = true. Programmers usually create a nested hash with the
snippet, Hash.new { |h,k| h[k] = {} }. The nested hash idiom is 22 characters long and involves
some nontrivial tracking of syntactic details, yet it appears in 66 times in 12 projects. Programmers
would likely benefit by the creation of a shorter library function. Using CodexLib, they can create
a new nested hash with the code Hash.nested, which is only 10 characters long and has far fewer
CHAPTER 5. MODELING PATTERNS IN CODE 54
opportunities for error.
Alternatively, consider the Ruby idiom to capitalize each word token in a string, which occurs
10 times across 5 different projects:
var0.split(/str0/).map do |var1|
var1.capitalize
end.join("str0")
This idiom is dense and not immediately self-descriptive; it contains four function calls and a
block within three lines. The code splits var0 on str0 (in practice, usually “ ”) to produce an array,
applies capitalize to each element in this array, the uses join to knit the array into a new string again
using str0. Programmers might benefit from a simpler way to express this task. Using CodexLib
they can achieve the same result with the shorthand code: var0.capital tokens(‘‘str0’’).
CodexLib is a layer on top of the Codex snippet database. To construct it, we extract the most
popular idioms and their crowdsourced descriptions from the database. For this small number of
functions, it is feasible to manually write function signatures and encapsulate them in new class
methods for Hash, Array, String, Float, File, and IO (Table 5.2). Programmers can download this
library as a Ruby gem at http://hci.st/codexlib.
5.3 Evaluation
Codex hypothesizes that we can build new software engineering interfaces by using databases that
model practice-driven knowedge for programming languages. In this section, we provide evidence
for three claims:
1. The 3,000,000 snippets in the Codex database are sufficient to characterize and analyze a
broad swath of program behavior. We measure the redundancy of AST nodes as Codex indexes
increasing amounts of code.
2. Codex captures a set of snippets that are recomposable and task-oriented. We ask oDesk Ruby
experts to describe and review a subset of the Codex patterns.
3. Codex allows us to identify unlikely code, without too many false positives. We evaluate the
number and kinds of warnings that Codex throws across a test set of 49,735 lines of code.
5.3.1 The Codex Database
The Codex database is composed of more than 3,000,000 lines of open source code, indexed from
100 popular Ruby projects on Github. These projects come from a diverse set of application areas,
CHAPTER 5. MODELING PATTERNS IN CODE 55
Figure 5.2: A plot of Codex’s hit rate as it indexes code over four random samples of file orderings.The y-axis plots the database hit rate, and the x-axis plots the number of lines of code indexed.
including programming languages, plugins, webservers, web applications, databases, testing suites,
and API wrappers.
We designed Codex to reflect programming practice. Programming is open ended — the number
of valid strings of source code in most languages is infinite — so no database can hold information
about every possible AST node or program. However, programming is also highly redundant when
examined at a small enough level of granularity [29]. Of the approximately 7 million AST nodes that
Codex has indexed, only 13% are unique after normalization. Among the more complex types of
AST nodes we see variablity in this redundance. For example, among block nodes 74% are unique,
and among class nodes 85% are unique (Table 5.3).
To evaluate the breadth of code that Codex knows about, we examine the overall hit rate of its
database as it indexes more code. That is, when indexing N lines of code, what percentage of its
normalized AST nodes have not been seen before as they are added to the database? We analyzed
the raw Codex dataset for values ranging from 92 to 3,000,000 lines of code across four random
samples of file ordering.
Codex’s hit rate exceeds 80% after 500,000 lines of code (Figure 5.2), meaning that Codex had
already observed 80% of the AST nodes after normalization. Different AST node types display
slightly different curves, with the same overall shape. Many of the nodes we are interested in for
statistical analysis are more complex, and so they are less amenable to the leveling of this curve.
However, were Codex to index more code, its hit rate would increase even futher.
5.3.2 Pattern Annotation
We asked professional Ruby programmers on the oDesk expert crowdsourcing marketplace to an-
notate 500 Codex snippets randomly sampled from the approximately 10,000 snippets that passed
Codex’s general pattern finding filter.
First, we asked crowdworkers to label each snippet with one of the categories: Data or Control
Flow, Standard library, External library, and Other (Table 5.4). The majority of snippets address
standard library tasks (76%), followed by external library tasks (14%), and tasks involving data or
CHAPTER 5. MODELING PATTERNS IN CODE 56
Node Type Percent Unique
Class definition 85%
Rescue statement 78%
Block statement 74%
Function definition 69%
If statement 66%
Interpolated string 29%
Function call 28%
Inlined hash 17%
Table 5.3: The percent of snippets that are unique after normalization for common AST node types.
Category Percent of Snippets
Standard Library 76%
External Library 14%
Data or Control Flow 9%
Table 5.4: Programmers from an expert crowdsourcing market annotated Codex’s idioms with theirusage type. The vast majority concern the use of standard, built-in libraries.
control flow (9%). None fell outside these categories (Other = 0%).
Next, we asked oDesk crowdworkers to answer: 1) Is this snippet a useful programming task or
idiom? 2) Can this snippet be encapsulated into a separate standalone function? 3) Is there a more
common way to write this snippet?
The oDesk Ruby experts reported that 86% of the snippets queued for annotation are useful,
96% are recomposable, and 91% have no more common form. These statistics indicate that Codex’s
pattern finding module produces snippets that are generally recomposable and reflective of good
programming practice.
5.3.3 Statistical Linting
Statistical linting relies upon the low-level properties of millions of lines of code to warn users about
code that is unlikely. Codex defines a general approach for detecting unlikely code, on which it
implements analyses for: type signatures, variable names, function chains, and block return types.
Here we evaluate to what extent CodexLint’s produces false positives through a training set of 49,735
lines of code.
As Codex seeks to identify unlikely code, and not program bugs, the distinction between true
positives and false positives is largely subjective. Inevitably, some users will want to be warned
about these properties, while others will not. Here we test the statistical linter against code known
to be of high quality. Supposing the number of warnings CodexLint suggests is small, relative to
the number of lines of code analyzed, this provides evidence that the statistical linting tool does not
CHAPTER 5. MODELING PATTERNS IN CODE 57
suggest too many false positives.
We based our CodexLint test set on 6 projects randomly sampled and withheld from the 100
repositories collected to build Codex’s index. The test set projects contain a total of 49,735 lines of
code, and all of these projects are popular and widely used, with more than 100 watchers on Gitub
(as the case for all the projects selected for indexing by Codex). Since 90% of the snippets annotated
through Codex’s pattern finding module are found by crowdsourced experts to be idiomatic, and
over 85% are rated as useful, we can safely assume that these projects generally do contain high-
quality code — the null hypothesis would be the principle, “garbage in, garbage out.” By treating
each warning it throws as a false positive, we arrive at a conservative estimate of the error rate.
Running CodexLint against the test set, we find that it generates 1248 warnings over 49,735 lines
of code; this suggests a conservative false positive rate of 2.5%.
The most common category of false positive involves functions and blocks that appear at least
a few times across a number of projects, but that haven’t been observed enough for Codex to
appropriately model their behavior. For example, nodes and uri are part of a HTML parsing
library that Codex has only seen used in a few files, and the system throws a warning about their
combination, e.g., nodes.uri. We are working on a new technique to detect sparse functions based on
library dependencies and additional program context that will handle them separately in analysis.
The second most common false positive occurs when Codex observes two AST nodes, neither
of them particularly uncommon, together in a new and valid way, e.g., lambda blocks returning a
function call to rand, which did not appear at all in Codex’s index. Programming is an open-ended
task, and there will always be valid combinations of expressions that a system like Codex has not
encountered.
Other false positives are more ambiguous. For example, one project passes the map function
a string, which would usually produce an error. This project had overridden map to support new
functionality. Similarly, another file assigns a variable named @requests an integer value, and Codex
has only ever observed @requests as an array. Programmers might be well served by changing their
code in response to these warnings.
Finally, this false positive rate will decrease as the size of Codex’s index grows and fewer correct
code paths surprise it. As the statistical linting algorithm is based upon probability thresholds, users
can make the linter even more conservative by adjusting these thresholds — analogous to adjusting
the parameters of traditional linters.
5.4 Discussion and Limitations
The approach that Codex takes has limitations, many of which we plan to address with future work.
First, while we have collected evidence that suggests Codex’s index is large enough to encompass a
broad swath of program behavior, it is likely that many applications — such as pattern annotation
CHAPTER 5. MODELING PATTERNS IN CODE 58
and statistical linting — would benefit from a larger index of code. We have tested Codex with
indexes as large as ten million lines of code, with no significant difference in the kinds of nodes and
statisitical properties it detects. However, as the size of the index grows, there will be fewer and
fewer edge cases and false positives, and Codex will more easily detect idioms and make precise
statistical statements about combinations of AST nodes. Codex must balance its desire for more
coverage against the danger of indexing lower-quality code.
Second, many more kinds of program analyses can be defined beyond Codex’s current abstrac-
tions. All the analyses tested in the current version of Codex rely upon local properties of AST
nodes, and not the surrounding program context. By incorporating more of this context into analy-
ses, we might detect more complex properties (e.g., detecting that a user hasn’t initialized a MySQL
database wrapper).
Third, due to the subjective nature of CodexLint’s warnings, we have not determined a precise
rate of true positives and false positives. In future work, we might ask programmers to evaluate these
warnings, to better determine how often they are useful. Moreover, this paper does not address the
general question: do programmers really find it useful to know when they are violating convention?
We can determine the answer more concretely through longitudinal study.
Finally, while Codex models practice-driven knowledge for the Ruby programming language, our
techniques for processing AST nodes and generating statistics are applicable to any AST structure
or language. For example, it might be feasible to generate a Codex database for JavaScript by
crawling highly-trafficked web pages. Moreover, while we focused on a dynamic language due to its
popularity and flexibility of naturalistic usage, static languages provide additional metadata that
Codex could leverage. Extending Codex’s analyses to these other languages remains future work.
Chapter 6
Discussion
This thesis presents three systems that contribute techniques for modeling user behavior at scale,
operationalizing these models to enable new applications across human behavior, language, and code.
These systems solve a number of challenges, but introduce and motivate many others. For example,
how can we choose good representations for modeling system knowledge in open-ended domains such
as human life? And how we can build useful systems on top of such open-ended models, when we
do not know in advance what kind of information they may encode? In this section, I motivate and
discuss a series of these open questions, lessons, and challenges.
6.1 Data Mining in HCI
My work draws on data mining techniques to advance research in human-computer interaction.
These techniques allow systems to better understand user behavior: for instance, in the domains of
writing, programming, or ubiquitous computing. Systems can then leverage this understanding to
adapt and react to current or future behavior.
Today, data mining is most often applied in the service of low level interactions. For example, a
device might learn from a user’s history of touch interactions to better decide what they are trying
to click on. The reason for this is two-fold: first, such targeted interactions provide a large amount
of easily interpretable training data; and second, improving the accuracy of such small interactions
reliably creates significant impact when improvements are deployed at scale.
The work I have presented offers a different perspective on the opportunities that data mining
presents to HCI, imagining how these techniques might help interfaces understand and help users
with higher level behaviors. This approach more often allows systems to engage in new kinds of
interactions with users, as opposed to refining existing interactions. For example, in our Augur
work, we reflect on what ubiquitous computing systems could do if they could understand the
many thousands of activities that people engage in, and the relationships between those activities.
59
CHAPTER 6. DISCUSSION 60
However, achieving a high level understanding of user behavior through data mining is challenging
for exactly the same reasons that achieving low level understanding data is tractable. You need to
answer some tough questions. Where does the data come from? How do you represent the high level
patterns you are interested in? And how reliable are the discovered associations?
Our work on Codex, focused as it was on programming tools, had the easiest time answering
these questions. Open source code is abundant on the web and can be represented through high
level AST-based parses, and interactions designed to help users can be drawn into an IDE in a
relatively innocuous way. For example, if a linting suggestion based on a high level code pattern
is wrong, the worst that might happen is you waste a user’s time by bringing it to their attention.
In contrast, our Augur work, which was focused on human life and behavior in the broadest sense,
had a difficult time with these same questions. Data about human behavior is not abundant, has
no natural representation, and inferences made on the basis of such behaviors can have damaging
real world consequences (even something as simple as turning off a light can be quite bad under
the wrong circumstances). So with Augur we needed to be much more creative about the source of
data—fiction—and how we could represent human behavior in a useful way.
Along these lines, the greatest opportunities in data mining for HCI will likely bring new datasets
and creative insights to old problems, as Augur brought fiction to ubiquitous computing. These
opportunities may also lie in domains where the data in question falls more naturally into a useful
high level representation (such as program ASTs) that can be applied to known problems. Our
Empath work provides some supporting evidence for this idea, as we applied the lessons we learned
in Augur to a much narrower problem—the creation of new lexicons—and produced a tool with
significant impact, used by many researchers including Facebook’s Data Science team.
6.2 Biases in Data-Driven Interfaces
It is important to understand the biases in our datasets and the models that we generate from them.
All of the work I have presented here contains such biases. Augur makes predictions about human
life based on the actions that characters take in fiction, learning from a source biased by drama
and stereotype. Empath draws on word associations learned from a wide variety of texts written by
many thousands of individuals, yet will often generate lexical categories that succumb to stereotypes
of race and gender, reflecting the broader attitudes of society. Even Codex is biased by the common
tendencies of the programmers who published the code it analyzed. And of course, interfaces based
upon the models learned from such data will be biased themselves.
Understanding biases is most important when models are trained on datasets that are quite
different from the ones they are being applied to. For example, the recent spectacular failures of
some computer vision algorithms can be traced back to training them on datasets that did not
contain enough racial diversity to match the populations they were analyzing [74]. This kind of
CHAPTER 6. DISCUSSION 61
issue is particularly relevant to projects such as Augur, which draw their strength from the fact
that they are using novel and abundant sources of data. Fiction is of great benefit to Augur in that
it allows the system to turn a microscope on thousands of human lives without a large-scale data
collection effort or the need to invade anyone’s privacy. But fiction is also a great weakness in that
these human lives under analysis may not be realistic ones.
Many researchers are woking on techniques that seek to bridge this disconnect. For example,
suppose we could collect a small but realistic distribution of the relationships between human ac-
tivities. If we then had a method that could compare the distribution drawn from fiction with the
real distribution, and identify dimensions along which the model exhibits dramatic bias, or gender
bias, or bias towards violence, we might use that information to transform the fictional distribution
and de-bias the model. Such an approach has been taken to remove gender stereotypes from word
embeddings [10], a technique which could be directly applied to tools like Empath, and might be
modified to apply to the fiction-based models.
However, even if we remove all the biases we can quantify, we still need to deal with the biases
we cannot, such as biases of absence. If a model is built upon books written only in the nineteenth
century, for instance, it is unlikely to contain much information about interactions between members
of same sex couples. And even if we know that a bias of absence exists, there is no way to address
that absence without simply finding a better dataset. In fact, we encountered this issue in our work.
Commercial fiction is not the most abundant source of mundane activities, and so we found more
suitable data: amateur fiction writers are less experienced in the craft of writing, and tend to leave
more of those details in. Sometimes finding more or better data is the only good solution.
As interactive systems become ever more driven by user and community data, we must consider
the potential biases that may spread from the data to the system. Analyzing and correcting for such
biases should become an important step in the design process. This is especially true for any work
that follows from this thesis, which relies heavily on unsupervised or semi-supervised learning.
6.3 Data Power vs. Modeling Power
One recurring challenge in this thesis work has been whether to put more effort into finding more
data or into building more sophisticated models. In some projects, such as Codex, coming up with
a better model made all the difference and increased the power of the system. In other projects,
such as Augur, we only succeeded once we had acquired several orders of magnitude more data and
ultimately threw away much of the original model’s complexity.
For many of the unsolved problems at the intersection of machine learning and HCI, the limiting
factor is the data. Give an off-the-shelf recurrent neural network (RNN) enough fiction, and it should
have little problem generating a realistic set of character behaviors over time. Give a similar RNN
enough mappings between English and code, and it will soon be translating your language into code
CHAPTER 6. DISCUSSION 62
fragments. Choice of model can be important on the margins, but not as important as having a
large dataset that captures the kind of relationships you need.
However, the representational choices for a model’s features, inputs, and outputs remain critical
for defining the interaction boundaries of a system. This is often the hardest part of an HCI project
that mixes modeling and design, as it dictates the space of possible interactions. For example,
Augur reasons about activities that are defined as a verb phrases with a human subject, for example
the phrase park car. These human activities are its basic units of reasoning and so determine the
kinds of predictions the model can make and the ways it can empower interactions. You give the
model a signal for park car and it might predict something like open door, perhaps allowing a
sufficiently intelligent car to open the door for you. You give the model a few pieces of context,
maybe screen, desk, and computer, and it can predict something like working, perhaps enforcing
a set of notification preferences you have assigned to that context. These kinds of input/output
relationships do not appear out of nowhere. They require deep thinking about the design space you
want to enable, and how you might gather the necessary signals from the data. The process is far
more involved than simply throwing a neural network at a new dataset.
With systems like those presented in this thesis work, designers no longer need to plan in advance
every possible behavior they want an interface to understand. However, as researchers and meta-
designers of systems that enable systems, we still need to think ahead to the space of behaviors we
want to capture in the models that we create. This space represents the power of a model, and its
potential to enable new and useful interactions.
Chapter 7
Conclusion
Interfaces can benefit from understanding user needs across a diverse set of domains: activity pre-
diction, writing, and programming, among many others. While supervised learning techniques and
other statistical models provide powerful tools for learning patterns from user data, they still re-
quire a system designer to formulate a set of hypotheses in advance: a set of questions upon which
those models can be trained. In contrast, this thesis shows how interfaces can operationalize semi-
supervised or unsupervised models trained on data drawn from these domains to reason about user
actions in way unanticipated by any system designer.
Moving forwards, I aim to explore how we can extend this approach to draw data from community
resources in a way that goes on to empower those resources, creating bidirectional interactions
between systems and their sources of data. I envision a future of where systems engage in a virtuous
cycle: a system first learns from a community, then goes on to empower work in that community,
and finally learns again from what it has empowered the community to do.
63
Bibliography
[1] Project gutenberg. In https://www.gutenberg.org/.
[2] Gregory D Abowd, Anind K Dey, Peter J Brown, Nigel Davies, Mark Smith, and Pete Steggles.
Towards a better understanding of context and context-awareness. In Handheld and ubiquitous
computing, pages 304–307. Springer, 1999.
[3] Etyan Adar, Mira Dontcheva, and Gierad Laput. CommandSpace: Modeling the relationships
between tasks, descriptions and features. In Proc. UIST 2014.
[4] Marzieh Ahmadzadeh, Dave Elliman, and Colin Higgins. An analysis of patterns of debugging
among novice computer science students. In Proc. ITiCSE 2005.
[5] Apple Computer. Knowledge Navigator, 1987.
[6] Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix, and William Pugh.
Using static analysis to find bugs. In In IEEE Software 2008.
[7] Niranjan Balasubramanian, Stephen Soderl, and Oren Etzioni. Rel-grams: A probabilistic
model of relations in text. AKBC-WEKEX Workshop 2012.
[8] Ira D. Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. Clone
detection using abstract syntax trees. In Proc. ICSM 1998.
[9] Michael S Bernstein, Jaime Teevan, Susan Dumais, Daniel Liebling, and Eric Horvitz. Direct
answers for search queries in the long tail. In Proc. CHI 2012.
[10] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In
Advances in Neural Information Processing Systems, pages 4349–4357, 2016.
[11] Margaret M Bradley and Peter J Lang. Affective norms for english words (anew): Instruction
manual and affective ratings. In Technical Report C-1, The Center for Research in Psychophys-
iology, University of Florida, 1999.
64
BIBLIOGRAPHY 65
[12] Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. Example-centric
programming: integrating web search into the development environment. In Proc. CHI 2010.
[13] Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. Oppor-
tunistic programming: Writing code to prototype, ideate, and discover. In In IEEE Software
2009.
[14] Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. Two
studies of opportunistic programming: interleaving web foraging, learning, and writing code.
In Proc. CHI 2009.
[15] Raymond P. L. Buse and Westley Weimer. Synthesizing api usage examples. In Proc. ICSE
2012.
[16] Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their
participants. In Proc. ACL 2009.
[17] Angel X. Chang and Christopher D. Manning. Tokensregex: Defining cascaded regular expres-
sions over tokens. Technical Report 2014.
[18] Sunny Consolvo, David W. McDonald, Tammy Toscos, Mike Y. Chen, Jon Froehlich, Beverly
Harrison, Predrag Klasnja, Anthony LaMarca, Louis LeGrand, Ryan Libby, Ian Smith, and
James A. Landay. Activity sensing in the wild: A field trial of Ubifit Garden. In Proc. CHI
’08, CHI ’08, pages 1797–1806, 2008.
[19] Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher
Potts. A computational approach to politeness with application to social factors. In Proc of
ACL 2013.
[20] Stephane Ducasse, Matthias Rieger, and Serge Demeyer. A language independent approach for
detecting duplicated code. In Proc. ICSM 1999.
[21] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as
deviant behavior: a general approach to inferring errors in systems code. In Proc. SOSP 2001.
[22] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical resource for
opinion mining. In Proceedings of LREC 2006.
[23] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information
extraction. In Proc. EMNLP 2011.
[24] Ethan Fast and Eric Horvitz. Identifying dogmatism in social media: Signals and models. In
Proc. EMNLP 2016.
BIBLIOGRAPHY 66
[25] Ethan Fast and Eric Horvitz. Long-term trends in the public perception of artificial intelligence.
In AAAI, pages 963–969, 2017.
[26] Ethan Fast, Will McGrath, Pranav Rajpurkar, and Michael Bernstein. Mining human behaviors
from fiction to power interactive systems. In Proc. CHI 2016.
[27] Ethan Fast, Daniel Steffe, Lucy Wang, Michael Bernstein, and Joel Brandt. Emergent, crowd-
scale programming practice in the ide. In Proc. CHI 2014.
[28] Adam Fourney, Richard Mann, and Michael Terry. Query-feature graphs: bridging user vocab-
ulary and system functionality. In Proc. UIST 2011.
[29] Mark Gabel and Zhendong Su. A study of the uniqueness of source code. In Proc. FSE 2010.
[30] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements
of Reusable Object-Oriented Software. 1994.
[31] Gerd Gigerenzer. How to make cognitive illusions disappear: Beyond heuristics and biases. In
In European review of social psychology 1991.
[32] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In Proc. CVPR ’14, 2014.
[33] Scott A. Golder and Michael W. Macy. Diurnal and seasonal mood vary with work, sleep, and
daylength across diverse cultures. In Science, volume 333, pages 1878–1881, 2011.
[34] Max Goldman, Greg Little, and Robert C. Miller. Collabode: collaborative coding in the
browser. In Proc. CHASE 2011.
[35] Max Goldman and Robert C. Miller. Codetrail: Connecting source code and web resources. In
Proc. VL/HCC 2009J.
[36] Mark Grechanik, Chen Fu, Qing Xie, Collin McMillan, Denys Poshyvanyk, and Chad Cumby.
Exemplar: Executable examples archive. In Proc. ICSE 2010.
[37] S. Greenberg and I. H. Witten. How users repeat their actions on computers: Principles for
design of history mechanisms. In Proc. CHI ’88.
[38] Bjorn Hartmann, Daniel MacDougall, Joel Brandt, and Scott R. Klemmer. What would other
programmers do: suggesting solutions to error messages. In Proc. of CHI 2010.
[39] Bjorn Hartmann, Leslie Wu, Kevin Collins, and Scott R. Klemmer. Programming by a sample:
rapidly creating web applications with d.mix. In Proc. UIST 2007.
BIBLIOGRAPHY 67
[40] Vasileios Hatzivassiloglou and Kathleen R McKeown. Predicting the semantic orientation of
adjectives. In Proceedings of the 35th annual meeting of the association for computational
linguistics and eighth conference of the european chapter of the association for computational
linguistics, pages 174–181. Association for Computational Linguistics, 1997.
[41] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the
naturalness of software. In Proc. ICSE 2012.
[42] Reid Holmes, Robert J. Walker, and Gail C. Murphy. Strathcona example recommendation
tool. In Proc. FSE 2005.
[43] Oliver Hummel, Werner Janjic, and Colin Atkinson. Code conjurer: Pulling reusable software
out of thin air.
[44] C. Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of
social media text. In Proc. AAAI 2014.
[45] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip-
tions.
[46] Emre Kiciman. Towards learning a knowledge base of actions from experiential microblogs. In
AAAI Spring Symposium, 2015.
[47] Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. An ethnographic study of
copy and paste programming practices in oopl. In Proc. ISESE 2004.
[48] Andrew J. Ko and Brad A. Myers. Designing the whyline: a debugging interface for asking
questions about program behavior. In Proc. the CHI 2004.
[49] Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence of
massive-scale emotional contagion through social networks. In Proceedings of the National
Academy of Sciences, volume 111, pages 8788–8790, 2014.
[50] Ranjitha Kumar, Arvind Satyanarayan, Cesar Torres, Maxine Lim, Salman Ahmad, Scott R
Klemmer, and Jerry O Talton. Webzeitgeist: Design Mining the Web. In Proc. CHI 2013.
[51] Ivan Laptev, Marcin Marszaek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic
human actions from movies. In Proc. CVPR 2008.
[52] Yong Jae Lee, C Lawrence Zitnick, and Michael F Cohen. Shadowdraw: real-time user guidance
for freehand drawing. In Proc. SIGGRAPH ’11, 2011.
[53] Yang Li. Reflection: Enabling event prediction as an on-device service for mobile interaction.
In Proc. UIST ’14, UIST ’14, pages 689–698, 2014.
BIBLIOGRAPHY 68
[54] H. Liu and P. Singh. Conceptnet – a practical commonsense reasoning tool-kit. In BT Tech-
nology Journal 2004.
[55] Qun Luo and Weiran Xu. Learning word vectors efficiently using shared representations and
document representations. In Proc. AAAI 2015.
[56] David Mandelin, Lin Xu, Rastislav Bodık, and Doug Kimelman. Jungloid mining: helping to
navigate the api jungle. In Proc. PLDI 2005.
[57] Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. CommunityCommands. In
Proc. UIST 2009.
[58] Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. CommunityCommands. In
Proc. UIST 2009.
[59] William McGrath, Mozziyar Etemadi, Shuvo Roy, and Bjrn Hartmann. fabryq: Using phones
as gateways to communicate with smart devices from the web. In Proc. EICS 2015, EICS ’15.
[60] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Proc. NIPS 2013.
[61] Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space
word representations. In Proc. NAACL-HLT 2013.
[62] George A. Miller. WordNet: A lexical database for english. In In Commun. ACM 1995.
[63] Saif M. Mohammad and Peter D. Turney. Crowdsourcing a word-emotion association lexicon.
In Computational Intelligence, volume 29, pages 436–465, 2013.
[64] Saif M Mohammad, Xiaodan Zhu, Svetlana Kiritchenko, and Joel Martin. Sentiment, emotion,
purpose, and style in electoral tweets. In Information Processing & Management. Elsevier, 2014.
[65] Mathew Mooty, Andrew Faulring, Jeffrey Stylos, and Brad A. Myers. Calcite: Completing code
completion for constructors using crowds. In Proc. VL/HCC 2010.
[66] Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. Finding deceptive opinion spam
by any stretch of the imagination. In Proc. ACL 2011.
[67] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification
using machine learning techniques. Proc. ACL 2002.
[68] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word
count: Liwc 2001. In Mahway: Lawrence Erlbaum Associates 71 2001.
[69] Naiyana Sahavechaphan and Kajal Claypool. Xsnippet: mining for sample code. In Proc.
OOPSLA 2006.
BIBLIOGRAPHY 69
[70] Robert C. Seacord, Daniel Plakosh, and Grace A. Lewis. Modernizing Legacy Systems: Software
Technologies, Engineering Process and Business Practices. 2003.
[71] Phillip Shaver, Judith Schwartz, Donald Kirson, and Cary O’connor. Emotion knowledge:
further exploration of a prototype approach. In Journal of personality and social psychology,
volume 52, page 1061. American Psychological Association, 1987.
[72] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data
quality and data mining using multiple, noisy labelers. In Proc. SIGKDD 2008.
[73] Ian Simon, Dan Morris, and Sumit Basu. MySong: automatic accompaniment generation for
vocal melodies. In Proc. CHI 2008.
[74] Tom Simonite. When it comes to gorillas, google remains blind. In Wired.
[75] Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, An-
drew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over
a sentiment treebank. Proc. EMNLP 2013.
[76] Philip J Stone, Dexter C Dunphy, and Marshall S Smith. The general inquirer: A computer
approach to content analysis. MIT press, 1966.
[77] Jeffrey Stylos and Brad A. Myers. Mica: A web-search tool for finding api components and
examples. In Proc. VL/HCC 2006.
[78] Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant for reusing open source
code on the web. In Proc. ASE 2007.
[79] Peter D. Turney. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised
classification of reviews. In Proc. ACL 2002.
[80] Peter D Turney, Patrick Pantel, et al. From frequency to meaning: Vector space models of
semantics. Journal of artificial intelligence research, 37(1):141–188, 2010.
[81] Raoul-Gabriel Urma and Alan Mycroft. Programming language evolution via source code query
languages. In Proc. PLATEAU 2012.
[82] Mark Weiser. The computer for the 21st century. In Scientific American, volume 265, pages
94–104. Nature Publishing Group, 1991.
[83] Yunwen Ye and Gerhard Fischer. Supporting reuse by delivering task-relevant and personalized
information. In Proc. ICSE 2002.
[84] YoungSeok Yoon and Brad Meyers. A longitudinal study of programmers’ backtracking. In
Proc. VL/HCC 2014.