CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR...capturing human behavior and language for interactive...

CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR

INTERACTIVE SYSTEMS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Ethan Fast

August 2018

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/bk979gs1829

© 2018 by Ethan Fast. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/bk979gs1829

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Michael Bernstein, Primary Adviser


Maneesh Agrawala


Eric Horvitz,

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

From smart homes that prepare coffee when we wake, to phones that know not to interrupt us dur-

ing important conversations, our collective visions of human-computer interaction (HCI) imagine a

future in which computers understand a broad range of human behaviors. Today our systems fall

short of these visions, however, because this range of behaviors is too large for designers or program-

mers to capture manually. In this thesis I will present three systems that mine and operationalize

an understanding of human life from large text corpora. The first system, Augur, focuses on what

people do in daily life: capturing many thousands of relationships between human activities (e.g.,

taking a phone call, using a computer, going to a meeting) and the scene context that surrounds

them. The second system, Empath, focuses on what people say: capturing hundreds of linguistic

signals through a set of pre-generated lexicons, and allowing computational social scientists to create

new lexicons on demand. The final system, Codex, explores how similar models can empower an

understanding of emergent programming practice across millions of lines of open source code. Be-

tween these projects, I will demonstrate how semi-supervised and unsupervised learning can enable

many new applications and analyses for interactive systems.

iv

Acknowledgments

This thesis is dedicated to the many people who made it possible:

• Binbin Chen, who in addition to everything else can always be counted on to help brainstorm

and refine ideas;

• Jon Bassen, whose influence is similarly present in the work;

• my parents Kevin and Kathy Fast, supporting me in everything I do;

• my advisor Michael Bernstein, who first introduced me to HCI and has shaped my thinking

in profound ways;

• the many mentors I have had over the course of my PhD, including Eric Horvitz, Alex Aiken,

Joel Brandt, and Maneesh Agrawala;

• the undergraduate and graduate students I have collaborated with, including Will McGrath,

Pranav Rajpurkar, Julia Mendelsohn, Daniel Steffee, Lucy Wang, and Colleen Lee;

• my many friends and colleagues in the Stanford HCI group;

• my undergraduate mentor and advisor Westley Weimer, who taught me how to do research.

I was supported by a NSF Graduate Fellowship and a Brown Institute Grant for Media Innovation

over my time at Stanford. Special thanks to these groups for funding my work.

v

Contents

Abstract iv

Acknowledgments v

1 Introduction 1

1.1 Human Life and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Human Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work 8

2.1 Modeling Human Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Mining community data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Ubiquitous computing interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Knowledge representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Modeling Human Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Extracting signal from text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Text mining and modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Modeling Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Mining software repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Bugfinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Learning from code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4 Data-driven interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Modeling Human Behavior 14

3.1 Augur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Human Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.2 Object Affordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.3 Connections between activities . . . . . . . . . . . . . . . . . . . . . . . . . . 16

vi

3.1.4 A data mining DSL for natural language . . . . . . . . . . . . . . . . . . . . . 17

3.1.5 Mining activity patterns from text . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.6 Vector space model for retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Augur API and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Identifying Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Expanding Activites with Object Affordances . . . . . . . . . . . . . . . . . . 22

3.2.3 Predicting Future Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Bias of Fiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Field test of A Soundtrack for Life . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.3 A stress test over #dailylife . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Modeling Signals in Human Language 33

4.1 Empath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Designing Empath’s categories . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Refining categories with crowd validation . . . . . . . . . . . . . . . . . . . . 35

4.1.3 Empath API and web service . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Empath Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Example 1: Understanding deception in hotel reviews . . . . . . . . . . . . . 36

4.2.2 Example 2: Mood on Twitter and time of day . . . . . . . . . . . . . . . . . . 38

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.1 Comparing Empath and LIWC . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 The role of human validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.2 Data-driven: who is actually driving? . . . . . . . . . . . . . . . . . . . . . . 42

4.4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.4 Statistical false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Modeling Patterns in Code 44

5.1 Codex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 Indexing and Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.2 Statistical Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1.3 Pattern Finding Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Codex Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Statistical Linting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.2 Pattern Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

vii

5.2.3 Library Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.1 The Codex Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.2 Pattern Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.3 Statistical Linting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Discussion 59

6.1 Data Mining in HCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Biases in Data-Driven Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Data Power vs. Modeling Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Conclusion 63

Bibliography 64

viii

List of Tables

3.1 We find average rates of 96% recall and 71% precision over common activities in the

dataset. Here Ground Truth Frames refers to the total number of frames labeled with

each activity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 As rated by external experts, the majority of Augur’s predictions are high-quality. . 31

4.1 Empath can analyze text across hundreds of data-driven categories. Here we provide

a sample of representative terms in 8 sample categories. . . . . . . . . . . . . . . . . 34

4.2 Crowd workers found 95% of the words generated by Empath’s unsupervised model

to be related to its categories. However, machine learning is not perfect, and some

unrelated terms slipped through (“Did not pass” above), which the crowd then removed. 35

4.3 We compared the classifications of LIWC, EmoLex and Empath across thirteen cate-

gories, finding strong correlation between tools. The first column represents compar-

isons between Empath’s unsupervised model against LIWC, the second after crowd

filtering against LIWC, the third between EmoLex and LIWC, and the fourth between

the General Inquirer and LIWC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Codex identifies common programming snippets automatically, then feeds them to

crowdsourced expert programmers for metadata such as the bolded title and descrip-

tive text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 A sample of functions from CodexLib, detected in emergent programming practice

and encapulated into a new standard library. . . . . . . . . . . . . . . . . . . . . . . 53

5.3 The percent of snippets that are unique after normalization for common AST node

types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Programmers from an expert crowdsourcing market annotated Codex’s idioms with

their usage type. The vast majority concern the use of standard, built-in libraries. . 56

ix

List of Figures

1.1 Augur mines human activities from a large dataset of modern fiction. Its statisti-

cal associations give applications an understanding of when each activity might be

appropriate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Empath learns word embeddings from 1.8 billion words of fiction, makes a vector

space from these embeddings that measures the similarity between words, uses seed

terms to define and discover new words for each of its categories, and finally filters

its categories using crowds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Codex draws on millions of lines of open source code to create software engineer-

ing interfaces that integrate emergent programming practice. Here, Codex’s pattern

annotation calls out popular idioms that appear in the user’s code. . . . . . . . . . . 5

3.1 Augur’s activity detection API translates a photo into a set of likely relevant activities.

For example, the user’s camera might automatically photojournal the food whenever

the user may be eating food. Here, Clarifai produced the object labels. . . . . . . . . 21

3.2 Augur’s APIs map input images through a deep learning object detector, then ini-

tializes the returned objects into a query vector. Augur then compares that vector to

the vectors representing each activity in its database and returns those with lowest

cosine distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Augur’s object affordance API translates a photo into a list of possible affordances.

For example, Augur could help a blind user who is wearing an intelligent camera and

says they want to sit. Here, Clarifai produced the object labels. . . . . . . . . . . . . 23

3.4 A Soundtrack for Life is a Google Glass application that plays musicians based on

the user’s predicted activity, for example associating working with The Glitch Mob. 26

3.5 We deployed an Augur-powered wearable camera in a field test over common daily

activities, finding average rates of 96% recall and 71% precision for its classifications. 29

4.1 Deceptive reviews convey stronger sentiment across both positively and negatively

charged categories. In contrast, truthful reviews show a tendency towards more mun-

dane activities and physical objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

x

4.2 We use Empath to replicate the work of Golder and Macy, investigating how mood

on Twitter relates to time of day. The signals reported by Empath and LIWC by

hour are strongly correlated for positive (r=0.87) and negative (r=0.90) sentiment. . 38

4.3 Empath categories strongly agreed with LIWC, at an average Pearson correlation of

0.90. Here we plot Empath’s best and worst correlations with LIWC. Each dot in

the plot corresponds to one document. Empath’s counts are graphed on the x-axis,

LIWC’s on the y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 The Codex IDE calls out a snippet of unlikely code by a yellow highlight in its gutter.

Warning text appears in the footer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 A plot of Codex’s hit rate as it indexes code over four random samples of file orderings.

The y-axis plots the database hit rate, and the x-axis plots the number of lines of

code indexed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xi

Chapter 1

Introduction

People don’t use systems in isolation. When we write an email, develop code, or interact with a

virtual assistant, we are engaging in activities that many other people have done before us—even

if they have not done so in exactly the same way. Often, the way we interact with these systems

becomes shared knowledge. This knowledge sharing can be explicit, as in sharing a code snippet

on StackOverflow, or it can be implicit, as in replying to an email using business jargon that we’ve

seen a colleague apply to a similar situation. In either case, the systems we use are embedded in a

broader context of how other people use them.

This shared context can allow systems to better understand or anticipate our needs. For example,

if you have been writing code using a cryptographic library that other people don’t use because it

is too slow, this is something you might like to know (and avoid). Or if you have just come back

from a long run on a hot day, a smart home might guess you’d like a glass of cold water. In both

cases, systems can learn from the behavior of others to predict information relevant to you. This

vision is far from new. Mark Weiser described a similar scenario decades ago [82], and other ideas

of intelligent interfaces have been around for even longer.

While these ideas have been successfully applied to a small set of activities known in advance

to a system designer [2], the path to achieving such predictions in more open-ended domains has

remained largely unexplored. Consider the domain of human life, something that a smart home ought

to understand. A useful system must encode knowledge about thousands of potential activities a

user might engage in (e.g., cooking dinner, reading a book, calling a friend), and beyond that, the

relationships between them. There are far too many of these activities and relationships to manually

define in a system. Similar problems exist in other open domains such as writing or programming.

Here again the set of potentially valuable signals and patterns is enormous and does not lend itself

to pre-specification by a system designer.

In this thesis, I explore how systems can leverage semi-supervised and unsupervised learning

techniques to better understand the communities of users that surround them. By applying these

1

CHAPTER 1. INTRODUCTION 2

techniques to model knowledge, systems can learn a large set of actionable concepts that do not

need to be defined in advance by a system designer. I present systems that address user needs

across three different domains—programming, ubiquitous computing, and text analysis—based on

similar unsupervised models applied to community data. In each of these domains, textual datasets

record what many people have said or done, allowing us to bootstrap a vocabulary of higher level

abstractions among such behaviors. For example, one of these systems can learn that paying bills

usually happens after ordering food without its designer ever deciding that these activities should

exist as concepts. Another has learned that a Ruby function that ends in a “‘!” should probably

modify its argument in place without anyone realizing that was an interesting syntactical analysis.

Yet another can tell a user that they are using language evocative of being a hipster.

Figure 1.1: Augur mines human activities from a large dataset of modern fiction. Its statisticalassociations give applications an understanding of when each activity might be appropriate.

1.1 Human Life and Behavior

Our most compelling visions of human-computer interaction depict worlds in which computers un-

derstand the breadth of human life. Mark Weiser’s first example scenario of ubiquitous computing,

for instance, imagines a smart home that predicts its user may want coffee upon waking up [82].

Apple’s Knowledge Navigator similarly knows not to let the user’s phone ring during a conversation

[5]. In science fiction, technology plays us upbeat music when we are sad, adjusts our daily routines

to match our goals, and alerts us when we leave the house without our wallet. In each of these

visions, computers understand the actions people take, and when.

Many years have passed since these visions were first articulated, and yet interactive systems

still lack a broad understanding of human behavior. Today, interaction designers instead create


special-case rules and single-use machine learning models. The resulting systems can, for example,

teach a phone (or Knowledge Navigator) not to respond to calls during a calendar meeting. But

even the most clever developer cannot encode behaviors and responses for every human activity – we

also ignore calls while eating lunch with friends, doing focused work, or using the restroom, among

many other situations. To achieve this breadth, we need a knowledge base of human activities, the

situations in which they occur, and the causal relationships between them. Even the web and social

media, serving as large datasets of human record, do not offer this information readily.

To solve this problem, we show it is possible to create a broad knowledge base of human behavior

by text mining a large dataset of modern fiction. Fictional human lives provide surprisingly accurate

accounts of real human activities. While we tend to think about stories in terms of the dramatic

and unusual events that shape their plots, stories are also filled with prosaic information about how

we navigate and react to our everyday surroundings. Over many millions of words, these mundane

patterns are far more common than their dramatic counterparts. Characters in modern fiction turn

on the lights after entering rooms; they react to compliments by blushing; they do not answer their

phones when they are in meetings. Our knowledge base, Augur (Figure 1.1), learns these associations

by mining 1.8 billion words of modern fiction from the online writing community Wattpad.

There are far too many human activities to enumerate in advance, much less to train and validate

independent predictive models over. We use an unsupervised vector space model to model relation-

ships between activities and scene context. We first extract activities through subject-verb-object

sequences determined by a dependency parse, then train a neural network or predict relationships

between these activities and their context over millions of lines of fiction. The weights learned by

the neural network produce a vector space that provides a representation of activities in terms of

other activities and scene context. This vector space encodes many thousands of relationships: for

example, associating activities such as eating with hundreds of food items and relevant tools such as

cutlery, plates and napkins, or associating one activate such as enter store with many others, such

as shop, grab cart, and pay. We go on to demonstrate how these models can be leveraged by a new

class of interactive systems, such as an automatic food diary or a system that warns users about

their bank balance when they are about to spend money, and evaluate the system through a user

study and deployment on Google Glass.

Figure 1.2: Empath learns word embeddings from 1.8 billion words of fiction, makes a vector spacefrom these embeddings that measures the similarity between words, uses seed terms to define anddiscover new words for each of its categories, and finally filters its categories using crowds.


1.2 Human Language

Just as there is breadth in human life, there is also breadth in human language. Language is rich

in subtle signals. The previous sentence, for example, conveys connotations of wealth (“rich”),

cleverness (“subtle”), communication (“language”, “signals”), and positive sentiment (“rich”). A

growing body of work in human-computer interaction, computational social science and social com-

puting uses tools to identify these signals: for example, detecting emotional contagion in status

updates or linguistic correlates of deception [49, 66].

High quality lexicons allow us to analyze language at scale and across a broad range of signals. For

example, researchers often use LIWC (Linguistic Inquiry and Word Count) to analyze social media

posts, counting words in lexical categories like sadness, health, and positive emotion [68]. LIWC

offers many advantages: it is fast, easy to interpret, and extensively validated. Researchers can

easily inspect and modify the terms in its categories — word lists that, for example, relate “scream”

and “war” to the emotion anger. But like other popular lexicons, LIWC is small: it has only 40

topical and emotional categories, many of which contain fewer than 100 words. Further, many

potentially useful categories like violence or social media don’t exist in current lexicons, requiring

creating of new gold standard word lists. Other categories may benefit from updating with modern

terms like “paypal” for money or “selfie” for leisure.

To solve these problems, we have created Empath: a tool that allows researchers to generate and

validate new lexical categories on demand, using a combination of machine learning and crowdsourc-

ing. For example, using the seed terms “twitter” and “facebook,” we can generate and validate a

category for social media. Empath also analyzes text across 200 built-in, pre-validated categories

such as neglect (deprive, refusal), government (embassy, democrat), strength (tough, forceful), and

technology (ipad, android). Empath combines modern NLP techniques with the benefits of hand-

made lexicons: its categories are word lists, easily extended and fast. And like LIWC (but unlike

other machine learning models), Empath’s contents have been validated by humans.

Empath is powered by a skip-gram network that captures words in a neural embedding [60]. This

embedding learns associations between words and their context, providing a model of connotation.

We use similarity comparisons in the resulting vector space to map a vocabulary of 59,690 words

onto Empath’s 200 categories (and beyond, onto user-defined categories). We then can filter these

relationships through the crowd to efficiently construct new, human validated dictionaries. We show

how Empath’s model can replicate and extend classic work in classifying deceptive language [66]

and analyzing mood on twitter [33]. Finally, we further validate Empath by comparing its analyses

against LIWC, a lexicon of gold standard categories that have been psychometrically validated. We

find the correlation between Empath and LIWC across a mixed-corpus dataset is high both with

(r=0.906) and without (0.90) the crowd filter. In sum, Empath shares high correlation with gold

standard lexicons, yet it also offers analyses over a dynamic set of categories.


Figure 1.3: Codex draws on millions of lines of open source code to create software engineeringinterfaces that integrate emergent programming practice. Here, Codex’s pattern annotation callsout popular idioms that appear in the user’s code.

1.3 Code Patterns

Just as human language provides structure that we can leverage to build representations of activities

or encode relationships between words and concepts, program code gives us even more precise

structure that we can exploit to enable new kinds of programming tools and analyses.

In software development, the way people adapt to a system can be just as informative as its

original design. User practice and designer intention differ across several levels of abstraction:

programmers use library APIs in undocumented and unexpected ways [56], language idioms evolve

over time [81], and programmers repurpose source code for new tasks [8, 20]. Norms emerge for

programming systems that aren’t codified in documentation or on the web. What is the best library

to use for a task? Does this code follow common practice? How is a language being used today?

We can examine the ecosystem of open source software to find answers to these practice-driven

questions. The informal rules and conventions of programming languages and libraries are implicitly

present in open source projects, which, when analyzed, often illuminate the ways people code that are

too complex or uncommon to appear in official forms of documentation. We can then operationalize

this knowledge to support everyday programming practice.

To achieve this end, we present Codex : a knowledge base that models practice-driven knowledge

for the Ruby programming language. Codex provides a living, queryable database of how program-

mers write code, informed by popular open source Ruby projects. The system normalizes program

abstract syntax trees (ASTs) to collapse similar idioms and identifiers, filters these idioms and anno-

tates them using paid crowd experts, and then allows applications to query its database in support

of new data-driven programming interfaces.

In the domain of programming, emergent practice develops at both the high-level of idioms, for


example a code snippet that initializes a nested hash, and at the low-level of syntactical combinations

of code, for example blocks that return the result of an addition operation. Codex seeks to capture

both higher-level patterns of reuseable program components and lower-level combinations and chains

of more basic programming units. Through the pattern finding module, Codex identifies commonly

reused Ruby idioms. This module uses typicality analysis to identify idioms such as Hash.new { |h,k|

h[k] = {} }, the most accepted way to initialize a nested hash table. Expert crowds then attach

metadata to these idioms, such as a title, description, and measure of recommended usefulness.

Alternatively, using the statistical analysis module, Codex can compute the frequencies of AST

node combinations, describing the uniqueness of syntatical patterns.

We present three applications that demonstrate how Codex supports programming practice and

software engineering interfaces. First, pattern annotation automatically annotates Ruby idioms

inside the IDE and presents these annotated snippets through a search interface. Second, statistical

linting identifies problematic syntax by checking code features (e.g., the kinds of AST nodes used

as function signatures or return values) against a large database of trusted and idiomatic snippets;

more generally, these statistics give programmers a tool to quantify the uniqueness of their code.

Finally, library generation pulls particularly common Ruby idioms into a new standard library —

authored not by individual developers but by emergent software practice — helping programmers

avoid the redefinition of common program components.

Codex enables new software engineering applications that are supported by large-scale program-

ming behavior rather than sets of special-cased rules. While other projects have crowdsourced

documentation for existing library functions [65, 15], mined code to enable query-based searching

for patterns or examples [56, 78], or embedded example-finding tools into an IDE [12, 14, 35, 69],

Codex augments traditional data mining techniques with crowds, presenting a broad data-driven

window into programming convention. We demonstrate how these kinds of emergent behavior can

inform new design opportunities for user interfaces.

1.4 Thesis Overview

To begin, Chapter 2 introduces a set of challenges faced by systems that seek to take advantage of

unsupervised models trained on unstructured datasets such as text and code. It then situates the

contributions of this thesis within the context of prior systems and models. Following this:

• Chapter 3 introduces Augur: a system that captures the relationships between more than

50,000 human activities and surrounding objects. I show that fiction provides a surprisingly

accurate source of knowledge about the activities of daily life, and how this knowledge can be

captured through unsupervised text mining to enable a new class of applications.

• Chapter 4 extends these ideas to Empath: a tool that analyses a broad set of signals in text and

allows computational social scientists to generate new lexical categories on demand through


and unsupervised word embedding model.

• Chapter 5 shows how similar ideas apply to code. Codex is a database that models practice-

driven knowledge for programming languages informed by unsupervised models trained on

open source projects. This system enables new software engineering applications that are

supported by large-scale programming behavior rather than sets of known rules.

Finally, the remaining chapters reflect on the contributions of this thesis. Chapter 6 discusses new

questions, challenges, and opportunities raised by this work. Chapter 7 concludes with a vision of

systems that feed data back to the communities they draw from, inspiring a virtuous cycle.

Chapter 2

Related Work

For an interactive system to reason about an open-ended domain such as human life, it needs to

understand both user vocabulary—how a user seeks to interact with it—and also system vocabulary,

the underlying domain language upon which a system operates. This thesis builds on a history of

work that has similarly mapped user vocabulary to the domain language of a system. For example,

query-feature graphs show how user terminology can be connected with commands run by an inter-

active system [28], and other systems such as CommandSpace expand upon this idea to show how

such a mapping can exist when both user language and the set of commands executed by a system

are learned from existing community resources and textual datasets [3, 57].

In prior work, however, the user and system vocabularies are known in advance to the model

under construction, either through explicit tags in the text under analysis or through manual entry

by the researchers constructing the model. It is an open challenge in many domains to instead

learn these high level patterns from lower level components, as the systems I present do through

their analyses. For example, Codex leverages the tree structure of code to mine large, common

subtrees that are used and repeated across many projects, then relates these subtrees to their

surrounding context. Similarly, human activities do not have a natural, high level representation in

text, a challenge Augur overcomes by combining the generality of learned vector space models with

patterns extracted through regularities in English language.

A second general challenge faced by work that attempts to bridge user and system vocabularies

is how the resulting models can then be used by interactive systems. For example, a system may

be able to relate the user work “mask” to the system command “layer mask” in an application like

Photoshop, which is useful for search. But are there applications for these models that extend beyond

information retrieval? This thesis engages with how such models can be embedded in interactive

systems, and what interactions are empowered by this new information. For example, Augur shows

how scene context provided by a Google Glass can feed activity predictions learned from an analysis

of fiction, leading to downstream applications such as an automatic food diary or an application

8

CHAPTER 2. RELATED WORK 9

that warns you when you are spending too much money. Similarly, Codex shows how a linting tool

trained on millions of lines of open source code can be warn users when their code deviates from

conventional idioms of a language.

In the following sections, I introduce three independent systems that work along these lines,

leveraging unsupervised and semi-supervised data mining techniques to develop a new class of inter-

active applications. Here I motivate the problems these systems solve and discuss how they extend

existing work in their domains.

2.1 Modeling Human Behavior

The first system I present in this thesis, Augur, is a knowledge base that uses fiction to connect

human activities to objects and their behaviors. This system draws on a large body of related work

in commonsense knowledge representation and ubiquitous computing, as well as prior work in data

mining and unsupervised language modeling.

2.1.1 Mining community data

Augur is inspired by many existing techniques for mining user behavior from data. For example,

query-feature graphs show how to encode the relationships between high-level descriptions of user

goals and underlying features of a system [28], even when these high-level descriptions are different

from an application’s domain language [3]. Researchers have applied these techniques to applications

such as AutoCAD [57] and Photoshop [3], where the user’s description of a domain and that domain’s

underlying mechanics are often disjoint. With Augur, we introduce techniques that mine real-world

human activities that typically occur outside of software.

Other systems have developed powerful domain-specific support by leveraging user traces. For

example, in the programming community, research systems have captured emergent practice in

open source code [27], drawn on community support for debugging computer programs [38], and

modeled how developers backtrack and revise their programs [84]. In mobile computing, the space

of user actions is small enough that it is often possible to predict upcoming actions [53]. In design,

a large dataset of real-world web pages can help guide designers to find appropriate ideas [50].

Creativity-support applications can use such data to suggest backgrounds or alternatives to the

current document [52, 73]. Augur complements these techniques by focusing on unstructured data

such as text and modeling everyday life rather than behavior within the bounds of one program.

2.1.2 Ubiquitous computing interfaces

Ubiquitous computing research and context-aware computing aim to empower interfaces to benefit

from the context in which they are being used [59, 2]. Their visions motivated the creation of


our knowledge base (e.g., [82, 5]). Some applications have aimed to model specific activities or

contexts such as jogging and cycling (e.g., [18]). Augur aims to augment these models with a

broader understanding of human life. For example, what objects might be nearby before someone

starts jogging? What activities do people perform before they decide to go jogging? Doing so could

improve the design and development of many such applications.

2.1.3 Knowledge representation

We draw on work in natural language processing, information extraction, and computer vision to

distill human activites from fiction. Prior work discusses how to extract patterns from text by parsing

sentences [16, 23, 7, 17]. We adapt and extend these approaches in our text mining domain-specific

language, producing an alternative that is more declarative and potentially easier to inspect and

reason about. Other work in NLP and CV has shown how vector space models can extract useful

patterns from text [61], or how other machine learning algorithms can generate accurate image labels

[45] and classify images given a small closed set of human actions [51]. Augur draws on insights

from these approaches to make conditional predictions over thousands of human activities.

Our research also benefits from prior work in commonsense knowledge representation. Existing

databases of linguistic and commonsense knowledge provide networks of facts that computers should

know about the world [54]. Augur captures a set of relations that focus more deeply on human be-

havior and the causal relationships between human activities. We draw on forms of commonsense

knowledge, like the WordNet hierarchy of synonym sets [62], to more precisely extract human ac-

tivities from fiction. Parts of this vocabulary may be mineable from social media, if they are of the

sort that people are likely to advertise on Twitter [46]. We find that fiction offers a broader set of

local activities.

2.2 Modeling Human Language

The second system I present in this thesis, Empath, analyzes text across hundreds of topics and

emotions. Like LIWC and other dictionary-based tools, it counts category terms in a text document.

However, Empath covers a broader set of categories than other tools, and users can generate and

validate new categories with a few seed words. Empath inherits from a rich ecosystem of tools

and applications for text analysis, and draws on the insights of prior work in data mining and

unsupervised language modeling.

2.2.1 Extracting signal from text

Text analysis via dictionary categories has a long history in academic research. LIWC, for example,

is an extensively validated dictionary that offers a total of 62 syntactic (e.g., present tense verbs,


pronouns), topical (e.g., home, work, family) and emotional (e.g., anger, sadness) categories [68].

The General Inquirer (GI) is another human curated dictionary that operates over a broader set

of topics than LIWC (e.g., power, weakness), but fewer emotions [76]. Other tools like EmoLex,

ANEW, and SentiWordNet are designed to analyze larger sets of emotional categories [63, 11, 22].

While Empath’s analyses are similarly driven by dictionary-based word counts, Empath operates

over a more extensive set of categories, and can generate and validate new categories on demand

using unsupervised language modeling.

Work in sentiment analysis has developed powerful techniques to classify text across positive

and negative polarity [75], but has also benefited from simpler, transparent models and rules [44].

Empath draws on the complementary strengths of these ideas, using the power of unsupervised

machine learning to create human-interpretable feature sets for the analysis of text. One of Empath’s

goals is to embed modern NLP techniques in a way that offers the transparency of dictionaries like

LIWC.

2.2.2 Text mining and modeling

A large body of prior work has investigated unsupervised language modeling. For example, re-

searchers have learned sentiment models from the relationships between words [40], classified the

polarity of reviews in an unsupervised fashion [79], discovered patterns of narrative in text [16], and

(more recently) used neural networks to model word meanings in a vector space [60]. We borrow

from the last of these approaches in constructing of Empath’s unsupervised model.

Empath also takes inspiration from techniques for mining human patterns from data. Augur

likewise mines text on the web to learn human activities for interactive systems [26]. Augur’s

evaluation indicated that with regard to low-level behaviors such as actions, these data provide a

surprisingly accurate mirror of human behavior. Empath contributes a different perspective, that

text on the web can be an appropriate tool for learning a breadth of topical and emotional categories,

to the benefit of social science. In other research communities, systems have used unsupervised

models to capture emergent practice in open source code [27] or design [50]. In Empath, we adapt

these techniques to mine natural language for its relation to emotional and topical categories.

Finally, Empath also benefits from prior work in commonsense knowledge representation. Ex-

isting databases of linguistic and commonsense knowledge provide networks of facts that computers

should know about the world [54, 62, 22]. We draw on some of this knowledge, like the ConceptNet

hierarchy, when seeding Empath’s categories. Further, Empath itself captures a set of relations

on the topical and emotional connotations of words. Some aspects of these connotations may be

mineable from social media, if they are of the sort that people are likely to advertise on Twitter [46].


2.3 Modeling Code Patterns

The final system I present in this thesis, Codex, analyses millions of lines of open source code

to uncover undocumented norms of practice and convention. Codex builds upon related work in

software repository mining, program analysis, and data-driven interfaces.

2.3.1 Mining software repositories

Codex draws on techniques from software repository mining to extract patterns from a large body of

open source code. Other researchers have mined code for software patterns and redundant code using

code normalization or typicality [56, 8, 20, 41, 65, 15]. However, much of this research emphasizes

the discovery of known design patterns and is oriented towards applications such as refactoring of

duplicate code, while Codex discovers new patterns from the ground up. Further, Codex combines

typicality analysis with expert crowdsourcing to build its database — an approach independant of

any particular code normalization scheme.

Databases can also systematize knowledge about open source code. However, these databases

are usually designed to enable specific forms of code search [83, 78], example-finding [43, 36, 69], or

autocompletion [42], either query based or automatic. While tools designed for specific use cases

may be highly optimized for their tasks, Codex enables a broader set of applications, including

pattern annotation and detecting problematic code through statistical linting.

2.3.2 Bugfinding

One of Codex’s core applications is to help programmers avoid bugs. Much work has focused on

tools for static and dynamic analysis [6, 21]. Other work has focused on helping users debug their

programs through program analysis or crowdsourced aggregation of user activities [38, 4, 34, 48, 65].

Codex does not explicitly try to discover bugs in programs; rather, it notifies users when code

violates convention. This is a subtle but important difference: code may be syntactically correct but

semantically unusual and error-prone.

2.3.3 Learning from code examples

Codex takes inspiration from prior research on code example finding and reuse. Some of these tools

rely on official forms of documentation [12] and others focus on real code from the web [69, 39, 77].

Codex generalizes this work — it covers a broader set of examples than manually curated datasets

and can determine when an example is a one-off and when it represents more general practice. Codex

also enables a more powerful search over examples through AST analysis, benefits from the human-

powered filtering and annotation, and makes possible many applications besides example-finding.


Researchers have also addressed how programmers make use of example code, whether the code

is copy-pasted [47] or foraged from documentation or online examples [14, 13, 35]. By formalizing

embedded software practice, Codex is able to support programmers through a larger space of ex-

amples and lower-level conventions. Many of these idioms and code snippets may not have been

formally discussed on the web.

2.3.4 Data-driven interfaces

Codex draws on insights from data-driven interfaces in non-programming domains. Users can gain

much through querying and exploration. For example, Webzeigeist allows designers to query a large

corpus of rendered web sites [50]. Crowd data also allows interactive systems to transform a partial

sketch of the users intent into a complete state, for example matching a sung melody against a large

database of music to produce an automatic backup band [73]. Algorithms can then identify patterns

in crowd behavior and percolate them up to the interface, for example answering a wide variety of

user queries, demonstrating how a given feature is used in practice [9, 28, 58], or predicting likely

actions from past history [37]. Codex demonstrates that the more structured nature of programming

languages provides a platform for more powerful interactive support such as error finding.

Chapter 3

Modeling Human Behavior

From smart homes that prepare coffee when we wake, to phones that know not to interrupt us

during important conversations, our collective visions of HCI imagine a future in which computers

understand a broad range of human behaviors. Today our systems fall short of these visions, however,

because this range of behaviors is too large for designers or programmers to capture manually. In this

chapter, we instead demonstrate it is possible to mine a broad knowledge base of human behavior

by analyzing more than one billion words of modern fiction. Our resulting knowledge base, Augur,

trains vector models that can predict many thousands of user activities from surrounding objects in

modern contexts: for example, whether a user may be eating food, meeting with a friend, or taking a

selfie. Augur uses these predictions to identify actions that people commonly take on objects in the

world and estimate a user’s future activities given their current situation. We demonstrate Augur-

powered, activity-based systems such as a phone that silences itself when the odds of you answering

it are low, and a dynamic music player that adjusts to your present activity. A field deployment of

an Augur-powered wearable camera resulted in 96% recall and 71% precision on its unsupervised

predictions of common daily activities. A second evaluation where human judges rated the system’s

predictions over a broad set of input images found that 94% were rated sensible.

3.1 Augur

Augur is a knowledge base that uses fiction to connect human activities to objects and their behav-

iors. We begin with an overview of the basic activities, objects, and object affordances in Augur,

then then explain our approach to text mining and modeling.

14

CHAPTER 3. MODELING HUMAN BEHAVIOR 15

3.1.1 Human Activities

Augur is primarily oriented around human activities, which we learn from verb phrases that have hu-

man subjects, for example “he opens the fridge” or “we turn off the lights.” Through co-occurrence

statistics that relate objects and activities, Augur can map contextual knowledge onto human be-

havior. For example, we can ask Augur for the five activities most related to the object “facebook”

(in modern fiction, characters use social media with surprising frequency):

Activity Score Frequency

message 0.71 1456

get message 0.53 4837

chat 0.51 4417

close laptop 0.45 1480

open laptop 0.39 1042

Here score refers to the cosine similarity between a vector-embedded query and activities in the

Augur knowledge base (we’ll soon explain how we arrive at this measure).

Like real people, fictional characters waste plenty of time messaging or chatting on Facebook.

They also engage in activities like post, block, accept, or scroll feed.

Similarly, we can look at relations that connect multiple objects. What activities occur around

a shirt and tie? Augur captures not only the obvious sartorial applications, but notices that shirts

and ties often follow specific other parts of the morning routine such as take shower :


wear 0.05 58685

change 0.04 56936

take shower 0.04 14358

dress 0.03 16701

slip 0.03 59965

In total, Augur relates 54,075 human activities to 13,843 objects and locations. While the head

of the distribution contributes many observed activities (e.g., extremely common activities like ask

or open door), a more significant portion lie in the bulk of the tail. These less common activities,

like reply to text message or take shower, make up much of the average fictional human’s existence.

Further out, as the tail diminishes, we find less frequent but still semantically interesting activities

like throw out flowers or file bankruptcy.

Augur associates each of its activities with many objects, even activities that appear relatively

infrequently. For example, unfold letter occurs only 203 times in our dataset, yet Augur connects

it to 1072 different objects (e.g., handwriting, envelope). A more frequent activity like take picture


occurs 10,249 times, and is connected with 5,250 objects (e.g., camera, instagram). The abundance

of objects in fiction allows us to make inferences for a large number of activities.

3.1.2 Object Affordances

Augur also contains knowlege about object affordances: actions that are strongly associated with

specific objects. To mine object affordances, Augur looks for subject-verb-object sentences with

objects either as their subject or direct object. Understanding these behaviors allows Augur to

reason about how humans might interact with their surroundings. For example, the ten most

related affordances for a car:


honk horn 0.38 243

buckle seat-belt 0.37 203

roll window 0.35 279

start engine 0.34 898

shut car-door 0.33 140

open car-door 0.33 1238

park 0.32 3183

rev engine 0.32 113

turn on radio 0.30 523

drive home 0.26 881

Cars undergo basic interactions like roll window and buckle seat-belt surprisingly often. These

are relatively mundane activities, yet abundant in fiction.

Like the distribution of human activities, the distribution of objects is heavy-tailed. The head of

this distribution contains objects such as phone, bag, book, and window, which all appear more than

one million times. The thick “torso” of the distribution is made of objects such as plate, blanket,

pill, and wine, which appear between 30,000 and 100,000 times. On the fringes of the distribution

are more idiosyncratic objects such as kindle (the e-book reader), heroin, mouthwash, and porno,

which appear between 500 and 1,500 times.

3.1.3 Connections between activities

Augur also contains information about the connections between human activities. To mine for

sequential activties, we can look at extracted activities that co-occur within a small span of words.

Understanding which activities occur around each other allows Augur to make predictions about

what a person might do next.

For example, we can ask Augur what happens after someone orders coffee:



eat 0.48 49347

take order 0.40 1887

take sip 0.39 11367

take bite 0.39 6914

pay 0.36 23405

Even fictional characters, it seems, must pay for their orders.

Likewise, Augur can use the connections between activities to determine which activities are

similar to one another. For example, we can ask for activities similar to the social media photography

trend of take selfie:


snap picture 0.78 1195

post picture 0.76 718

take photo 0.67 1527

upload picture 0.58 121

take picture 0.57 10249

By looking for activities with similar object co-occurrence patterns, we can find near-synonyms.

3.1.4 A data mining DSL for natural language

Creating Augur requires methods that can extract relevant information from large-scale text and

then model it. Exploring the patterns in a large corpus of text is a difficult and time consuming

process. While constructing Augur, we tested many hypotheses about the best way to capture

human activties. For example, we asked: what level of noun phrase complexity is best? Some

complexity is useful. The pattern run to the grocery store is more informative for our purposes than

run to the store. But too much complexity can hurt predictions. If we capture phrases like run

to the closest grocery store, our data stream becomes too sparse. Worse, when iterating on these

hypotheses, even the cleanest parser code tends not to be easily reusable or interpretable.

To help us more quickly and efficiently explore our dataset, we created TC (Text Combinator),

a data mining DSL for natural language. TC allows us to build parsers that capture patterns in a

stream of text data, along with aggregate statistics about these patterns, such as frequency and co-

occurrence counts, or the mutual information (MI) between relations. TC’s scripts can be easier to

understand and reuse than hand-coded parsers, and its execution can be streamed and parallelized

across a large text dataset.

TC programs can model syntactic and semantic patterns to answer questions about a corpus.


For example, suppose we want to figure out what kinds of verbs often affect laptops:

laptop = [DET]? ([ADJ]+)? "laptop"

verb_phrase = [VERB] laptop-

freq(red_vp)

Here the laptop parser matches phrases like “a laptop” or “the old broken laptop” and returns

exactly the matched phrase. The verb phrase parser matches pharses like “throw the broken laptop”

and returns just the verb in the phrase (e.g., “throw”). The freq aggregator keeps a count of unique

tokens in the output stream of the verb phrase parser. On a small portion of our corpus, we see as

output:

open 11

close 7

shut 6

restart 4

To clarify the syntax for this example: square brackets (e.g., [NOUN]) define a parser that matches

on a given part of speech, quotes (e.g., "laptop") matches on an exact string, whitespace is an implicit

then-combinator (e.g., [NOUN] [NOUN] matches two sequential nouns), a question mark (e.g., [DET]?

optionally matches an article like “a” or “the”, also matching on the empty string), a plus (e.g.,

[VERB]+ matches on as many verbs as appear consecutively), and a minus (e.g., [NOUN]- matches on

a noun but removes it from the returned match).

We wrote the compiler for TC in Python. Behind the scenes, our compiler transforms an input

program into a parser combinator, instantiates the parser as a Python generator, then runs the

generator to lazily parse a stream of text data. Aggregation commands (e.g., freq frequency counting

and MI for MI calculation) are also Python generators, which we compose with a parser at compile

time. Given many input files, TC also supports parallel parsing and aggregation.

3.1.5 Mining activity patterns from text

To build the Augur knowledge base, we index more than one billion words of fiction writing from

600,000 stories written by more than 500,000 writers on the Wattpad writing community1. Wattpad

is a community where amateur writers can share their stories, oriented mostly towards writers of

genre fiction. Our dataset includes work from 23 of these genres, including romance, science fiction,

and urban fantasy, all of which are set in the modern world.

Before processing these stories, we normalize them using the spaCy part of speech tagger and

lemmatizer2. The tagger labels each word with its appropriate part of speech given the context

1http://wattpad.com2https://honnibal.github.io/spaCy/)


of a sentence. Part of speech tagging is important for words that have multiple senses and might

otherwise be ambiguous. For example, “run” is a noun in the phrase, “she wants to go for a run”,

but a verb in the phrase “I run into the arms of my reviewers.” The lemmatizer converts each word

into its singular and present-tense form. For example, the plural noun “soldiers” can be lemmatized

to the singular “soldier” and the past tense verb “ran” to the present “runs.”

Activity-Object statistics

Activity-object statistics connect commonly co-occurring objects and human activities. These statis-

tics will help Augur detect activities from a list of objects in a scene. We define activities as verb

phrases where the subject is a human, and objects as compound noun phrases, throwing away

adjectives. To generate these edges, we run the TC script:

human_pronoun = "he" | "she" | "i" | "we" | "they"

np = [DET]? ([ADJ]- [NOUN])+

vp = human_pronoun ([VERB] [ADP])+

MI(freq(co-occur(np, vp, 50)))

For example, backpack co-occurs with pack 2413 times, and radio co-occurs with singing 7987

times. Given the scale of our data, Augur’s statistics produce meaningful results by focusing just

on pronoun-based sentences.

In this TC script, mutual information processes our final co-occurence statistics to calculate the

mututal information of our relations, where A and B are the frequencies of two relations, and the

term AB is the frequency of collocation between A and B:

MI(A,B) = log

(AB

A ∗B

)MI describes how much one term of a co-occurrence tells us about the other. For example, if

people type with every kind of object in equal amounts, then knowing there is a computer in your

room doesn’t mean much about whether you are typing. However, if people type with computers

far more often than anything else, then knowing there is a computer in your room tells us significant

information, statistically, about what you might be doing.

Object-affordance statistics

The object-affordance statistic connects objects directly to their uses and behaviors, helping Augur

understand how humans can interact with the objects in a scene. We define object affordances as

verb phrases where an object serves as either the subject or direct object of the phrase, and we again

capture physical objects as compound noun phrases. To generate these edges, we run the TC script:

np = [DET]? ([ADJ]- [NOUN])+


vp = ([VERB] [ADP])+

svo = np vp np?

MI(freq(svo))

For example, coffee is spilled 229 times, and facebook is logged into 295 times.

Activity-Activity statistics

Activity-activity statistics count the times that an activity is followed by another activity, helping

Augur make predictions about what is likely to happen next. To generate these statistics, we run

the TC script:

human_pronoun = "he" | "she" | "i" | "we" | "they"

vp = human_pronoun ([VERB] [ADP])+

MI(freq(skip-gram(vp,2,50)))

Activity-activity statistics tend to be more sparse, but Augur can still uncover patterns. For

example, wash hair precedes blow dry hair 64 times, and get text (e.g., receive a text message)

precedes text back 34 times.

In this TC script, skip-gram(vp,2,50) constructs skip-grams of length n = 2 sequential vp

matches on a window size of 50. Unlike co-occurrence counts, skip-grams are order-dependent,

helping Augur find potential causal relationships.

3.1.6 Vector space model for retrieval

Augur’s three statistics are not enough by themselves to make useful predictions. These statistics

represent pairwise relationships and only allow prediction based on a single element of context (e.g.,

activity predictions from a single object), ignoring any information we might learn from similar

co-occurrences with other terms. For many applications it is important to have a more global view

of the data.

To make these global relationships available, we embed Augur’s statistics into a vector space

model (VSM), allowing Augur to enhance its predictions using the signal of multiple terms. Queries

based on multiple terms narrow the scope of possibility in Augur’s predictions, strengthing predic-

tions common to many query terms, and weaking those that are not.

VSMs encode concepts as vectors, where each dimension of the vector conveys a feature relevant

to the concept. For Augur, these dimensions are defined by MI > 0 with Laplace smoothing (by a

constant value of 10), which in practice reduces bias towards uncommon human activities [80].

Augur has three VSMs. 1). Object-Activity : each vector is a human activity and its dimensions

are smoothed MI between it and every object. 2). Object-Affordance: each vector is an affordance

and its dimensions are smoothed MI between it and every object. 3). Activity-Prediction: each

vector is a activity and its dimensions are smoothed MI between it and every other activity.


Figure 3.1: Augur’s activity detection API translates a photo into a set of likely relevant activities.For example, the user’s camera might automatically photojournal the food whenever the user maybe eating food. Here, Clarifai produced the object labels.

Figure 3.2: Augur’s APIs map input images through a deep learning object detector, then initializesthe returned objects into a query vector. Augur then compares that vector to the vectors representingeach activity in its database and returns those with lowest cosine distance.

To query these VSMs, we construct a new empty vector, set the indices of the terms in the query

equal to 1, then find the closest vectors in the space by measuring cosine similarity.

3.2 Augur API and Applications

Applications can draw from Augur’s contents to identify user activities, understand the uses of

objects, and make predictions about what a user might do next. To enable software development

under Augur, we present these three APIs and a proof-of-concept architecture that can augment

existing applications with if-this-then-that human semantics.

We begin by introducing the three APIs individually, then demonstrate additional example ap-

plications to follow. To more robustly evaluate Augur, we have built one of these applications,

Soundtrack for Life, into Google Glass hardware.

3.2.1 Identifying Activities

What are you currently doing? If Augur can answer this question, applications can potentially help

you with that activity, or determine how to behave given the context around you.

Suppose a designer wants to help people stick to their diets, and she notices that people often

forget to record their meals. So the designer decides to create an automatic meal photographer.


She connects the user’s wearable camera to a scene-level object detection computer vision algorithm

such as R-CNN [32]. While she could program the system to fire a photo whenever the computer

vision algorithm recognizes an object class categorized as food, this would produce a large number

of false positives throughout the day, and would ignore a breadth of other signals such as silverware

and dining tables that might actually indicate eating.

So, the designer connects the computer vision output to Augur (Figure 3.1). Instead of program-

ming a manual set of object classes, the designer instructs Augur to fire a notification whenever the

user engages in the activity eat food. She refers to the activity using natural language, since this is

what Augur has indexed from fiction:

image = /* capture picture from user’s wearable camera */

if(augur.detect(image, "eat food"))

augur.broadcast("take photo");

The application takes an image at regular intervals. The detect function processes the latest

image in that stream, pings a deep learning computer vision server (http://www.clarifai.com/),

then runs its object results through Augur’s object-activity VSM to return activity predictions. The

broadcast function broadcasts an object affordance request keyed on the activity take photo: in this

case, the wearable camera might respond by taking a photograph.

Now, the user sits down for dinner, and the computer vision algorithm detects a plate, steak and

broccoli (Figure 3.1). A query to Augur returns:


fill plate 0.39 203

put food 0.23 1046

take plate 0.15 1321

eat food 0.14 2449

set plate 0.12 740

cook 0.10 6566

The activity eat food appears as a strong prediction, as is (further down) the more general activity

eat. The ensemble of objects reinforce each other: when the plate, steak and broccoli are combined

to form a query, eating has 1.4 times higher cosine similarity than for any of the objects individually.

The camera fires, and the meal is saved for later.

3.2.2 Expanding Activites with Object Affordances

How can you interact with your environment? If Augur knows how you can manipulate your sur-

roundings, it can help applications facilitate that interaction.


Figure 3.3: Augur’s object affordance API translates a photo into a list of possible affordances. Forexample, Augur could help a blind user who is wearing an intelligent camera and says they want tosit. Here, Clarifai produced the object labels.

Object affordances can be useful for creating accessible technology. For example, suppose a blind

user is wearing an intelligent camera and tells the application they want to sit (Figure 3.3). Many

possible objects would let this person sit down, and it would take a lot of designer effort to capture

them all. Instead, using Augur’s object affordance VSM, an application could scan nearby objects

and find something sittable:

image = /* capture picture from user’s wearable camera */

if(augur.affordance(image, "sit"))

alert("sittable object ahead");

The affordance function will process the objects in the latest image, executing its block when

Augur notices an object with the specified affordance. Now, if the user happens to be within eyeshot

of a bench:


sit 0.13 600814

take seat 0.12 24257

spot 0.11 16132

slump 0.09 8985

plop 0.07 12213

Here the programmer didn’t need to stop and think about all the scenarios or objects where a

user might sit. Instead, they just stated the activity and Augur figured it out.

3.2.3 Predicting Future Activities

What will you do next? If Augur can predict your next activity, applications can react in advance

to better meet your needs in that situation. Activity predictions are particularly useful for helping

users avoid problematic behaviors, like forgetting their keys or spending too much money.


In Apple’s Knowledge Navigator [5], the agent ignores a phone call when it knows that it would

be an inappropriate time to answer. Could Augur support this?

answer = augur.predict("answer call")

ignore = augur.predict("ignore call")

if(ignore > answer)

augur.broadcast("silence phone");

else

augur.broadcast("unsilence phone");

The augur.predict function makes new activity predictions based on the user’s activities over the

past several minutes. If the current context suggests that a user is using the restroom, for example,

the prediction API will know that answering a call is an unlikely next action. When provided with

an activity argument, augur.predict returns a cosine similarity value reflecting the possibility of

that activity happening in the near future. The activity ignore call has less cosine similarity than

answer call for most queries to Augur. But if a query ever indicates a greater cosine similarity for

ignore call, the application can silence the phone. As before, Augur broadcasts the desired activity

to any listening devices (such as the phone).

Suppose your phone rings while you are talking to your best friend about their relationship issues.

Thoughtlessly, you curse, and your phone stops ringing instantly:


throw phone 0.24 3783

ignore call 0.18 567

ring 0.18 7245

answer call 0.17 4847

call back 0.17 1883

leave voicemail 0.17 146

Many reactions besides cursing might also trigger ignore call. In this case, adding curse to the

prediction mix shifts the odds between ignoring and answering significantly. Other results like throw

phone reflect the biases in fiction. We will investigate the impact of these biases in our Evaluation.

3.2.4 Applications

Augur allows developers to build situationally reactive applications across many activities and con-

texts. Here we present three more applications designed to illustrate the potential of its API. We

have deployed one of these applications, A Soundtrack for Life, as a Google Glass prototype.


The Autonomous Activity Journal

We often forget where we have gone and what we have done. Augur allows us to journal our activities

passively, automatically (and probabilistically):

predictions = augur.predict()

for(p in predictions where p.score > 0.8)

file.write("journal" , p.activity);

When Augur returns new predictions about our life, this program will write the most likely ones

to log. We might search this log later, or use it find patterns in our daily behavior. For example,

what days are we most likely to exercise? How often do we tend to go our to eat, or hang our with

friends? Some of Augur’s predictions will inevitably be false positives, but in aggregate they may

provide useful analytics into our lives.

The Coffee-Aware Smart Home

In Weiser’s ubiquitous computing vision [82], he introduces the idea of calm computing via a scenario

where a woman wakes up and her smart home asks if she wants coffee. Augur’s activity prediction

API can support this vision:

if(augur.predict("make coffee") { askAboutCoffee(); }

Suppose that your alarm goes off, signaling to Augur that your activity is wake up. Your smart

coffeepot can start brewing when Augur predicts you want to make coffee:


want breakfast 0.38 852

throw blanket 0.38 728

shake awake 0.37 774

hear shower 0.36 971

take bath 0.35 1719

make coffee 0.34 779

check clock 0.34 2408

After people wake up in the morning, they are likely to make coffee. They may also want

breakfast, another task a smart home might help us with.

Spending Money Wisely

We often spend more money than we have. Augur can help us maintain a greater awarness of our

spending habits, and how they affect our finances. If we are reminded of our bank balence before


Figure 3.4: A Soundtrack for Life is a Google Glass application that plays musicians based on theuser’s predicted activity, for example associating working with The Glitch Mob.

spending money, we may be less inclined to spend it on frivolous things:

if(predict("pay") {

balance = secure_bank_query();

speak("your balance is "+ balance);

}

If Augur predicts we are likely to pay for something, it will tell us how much money we have left

in our account. What might trigger this prediction?


scan 0.19 5319

ring 0.19 7245

pay 0.17 23405

swipe 0.17 1800

shop 0.13 3761

For example, when you enter a store, you may be about to pay for something. The pay prediction

also triggers on ordering food or coffee, entering a cafe, gambling, and calling a taxi.


hail taxi 0.96 228

pay 0.96 181

call taxi 0.96 359

get taxi 0.96 368

tell address 0.95 463

get suitcase 0.82 586


A Soundtrack for Life

Many of life’s activities are accompanied by music: you might cook to the refined arpeggios of

Vivaldi, exercise to the dark ambivalence of St. Vincent, and work to the electronic pulse of the

Glitch Mob. Through an activity detection system we have built into Google Glass (Figure 3.4),

Augur can arrange a soundtrack for you that suits your daily preferences. We built a physical

prototype for this application as it takes advantage of the full range of activities Augur can detect.

var act2music = {

"cook": "Vivaldi", "drive": "The Decemberists",

"surfing": "Sea Wolf", "buy": "Atlas Genius",

"work": "Glitch Mob", "exercise": "St. Vincent",

};

var act = augur.predict();

if (act in act2music){

play(act2music[act]);

}

For example, if you are brandishing a spoon before a pot on the stove, you are likely cooking.

Augur plays Vivaldi.


cook 0.50 6566

pour 0.39 757

place 0.37 25222

stir 0.37 2610

eat 0.34 49347

3.3 Evaluation

Can fiction tell us what we need in order to endow our interactive systems with basic knowledge

of human activities? In this section, we investigate this question through three studies. First, we

compare Augur’s activity predictions to human activity predictions in order to understand what

forms of bias fiction may have introduced. Second, we test Augur’s ability to detect common

activities over a two-hour window of daily life. Third, to stress test Augur over a wider range of

activities, we evaluate its activity predictions on a dataset of 50 images sampled from the Instagram

hashtag #dailylife.


3.3.1 Bias of Fiction

If fiction were truly representative of our lives, we might be constantly drawing swords and kissing

in the rain. Our first evaluation investigates the character and prevelance of fiction bias. We tested

how closely a distribution of 1000 activities sampled from Augur’s knowledge base compared against

human-reported distributions. While these human-reported distributions may differ somewhat from

the real world, they offer a strong sanity check for Augur’s predictions.

Method

To sample the distribution of activities in Augur, we first randomly sampled 100 objects from the

knowledge base. We then used Augur’s activity identification API to select 10 human activities most

related to each object by cosine similarity. In general, these selected activities tended to be relatively

common (e.g., cross and park for the object “street”). We normalized these sub-distributions such

that the frequencies of their activities summed to 100.

Next, for each object we asked five workers on Amazon Mechanical Turk to estimate the relative

likelihood of its selected activities. For example, given a piano: “Imagine a random person is around

a piano 100 times. For each action in this list, estimate how many times that action would be taken.

The overall counts must sum to 100.” We asked for integer estimates because humans tend to be

more accurate when estimating frequencies [31].

Finally, we computed the estimated true human distribution (ETH) as the mean distribution

across the five human estimates. We compared the mean absolute error (MAE) of Augur and the

individual human estimates against the ETH.

Results

Augur’s MAE when compared to the ETH is 12.46%, which means that, on average, its predictions

relative to the true human distribution are off by slightly more than 12%. The mean MAE of the

individual human distributions when compared to the ETH is 6.47%, with a standard deviation of

3.53%. This suggests that Augur is biased, although its estimates are not far outside the variance

of individual humans.

Investigating the individual distributions of activities suggests that the vast majority of Augur’s

prediction error is caused by a few activities in which its predictions differ radically from the humans.

In fact, for 84% of the tested activities Augur’s estimate is within 4% of the ETH. What accounts

for the these few radically different estimates?

The largest class of prediction error is caused by general activities such as look. For example,

when considering raw co-occurrence frequencies, people look at clocks much more often than they

check the time, because look occurs far more often in general. When estimating the distribution of

activities around clock, human estimators put most of their weight on check time, while Augur put


Figure 3.5: We deployed an Augur-powered wearable camera in a field test over common dailyactivities, finding average rates of 96% recall and 71% precision for its classifications.

nearly all its weight on look. Similar mistakes involved the common but understated activities of

getting into cars or going to stores. Human estimators favored driving cars and shopping at stores.

A second and smaller class of error is caused by strong connections between dramatic events that

take place more often in fiction than in real life. For example, Augur put nearly all of its prediction

weight for cats on hissing while humans distributed theirs more evenly across a cat’s possible activi-

ties. In practice, we saw few of these overdramaticized instances in Augur’s applications and it may

be possible to use paid crowdsourcing to smooth out them out. Further, this result suggests that

the ways fiction deviates from real life may be more at the macro-level of plot and situation, and

less at the level of micro-behaviors. Yes, fictional characters sometimes find themselves defending

their freedom in court against a murder charge. However, their actions within that courtroom do

tend to mirror reality — they don’t tend to leap onto the ceiling or draw swords.

3.3.2 Field test of A Soundtrack for Life

Our second study evaluates Augur through a field test of our Glass application, A Soundtrack for

Life. We recorded a two-hour sample of one user’s day, in which she walked around campus, ordered

coffee, drove to a shopping center, and bought groceries, among other activities (Figure 3.5).

Method

We gave a Google Glass loaded with A Soundtrack for Life to a volunteer and asked her, over a two

hour period, to to enact the following eight activities: walk, buy, eat, read, sit, work, order, and

drive. We then turned on the Glass, set the Soundtrack’s sampling rate to 1 frame every 10 seconds,

and recorded all data. The Soundtrack logged its predictions and images to disk.

Blind to Augur’s predictions, we annotated all image frames with a set of correct activities.

Frames could consist of no labeled activities, one activity, or several. For example, a subject sitting

at a table filled with food might be both sitting and eating. We included plausible activities among

this set. For example, when the subject approaches a checkout counter, we included pay both under

circumstances in which she did ultimately purchase something, and also under others in which she did

not. Over these annotated image frames, we computed precision and recall for Augur’s predictions.


Activity Ground Truth Frames Precision Recall

Walk 787 91% 99%

Drive 545 63% 100%

Sit 374 59% 86%

Work 115 44% 97%

Buy 78 89% 83%

Read 33 82% 87%

Eat 12 53% 83%

Average 71% 96%

Table 3.1: We find average rates of 96% recall and 71% precision over common activities in thedataset. Here Ground Truth Frames refers to the total number of frames labeled with each activity.

Results

We find rates of 96% recall and 71% precision across activity predictions in the dataset (Figure 3.1).

When we break up these rates by activity, Augur succeeds best at activities like walk, buy and read,

with precision and recall score higher than 82%. On the other hand, we see that the activities work,

drive, and sit cause the majority of Augur’s errors. Work is triggered by a diverse set of contextual

elements. People work at cafes or grocercy stores (for their jobs), or do construction work, or work

on intellectual tasks, like writing research papers on their laptops. Our image annotations did not

capture all these interpretations of work, so Augur’s disagreement with our labeling is not surprising.

Drive is also triggered by a large number of contexuntual elements, including broad scene descriptors

like “store” or “cafe,” presumably because fictional characters often drive to these places. And sit is

problematic mostly because it is triggered by the common scene element “tree” (real-world people

probably do this less often than fictional characters). We also observe simpler mistakes: for example,

our computer vision algorithm thought the bookstore our subject visited was a restaurant, causing

a large precision hit to eat.

3.3.3 A stress test over #dailylife

Our third evaluation investigates whether a broad set of inputs to Augur would produce meaningful

activity predictions. We tested the quality of Augur’s predictions on a dataset of 50 images sampled

from the Instagram hashtag #dailylife. These images were taken in a variety of environments across

the world, including homes, city streets, workplaces, restaurants, shopping malls and parks. First,

we sought to measure whether Augur predicts meaningful activities given the objects in the image.

Second, we compared Augur’s predictions to the human activity that best describes each scene.


Quality Samples Percent Success

Augur VSM predictions 1000 94%

Augur VSM scene recall 50 82%

Computer vision object detection 50 62%

Table 3.2: As rated by external experts, the majority of Augur’s predictions are high-quality.

Method

To construct a dataset of images containing real daily activites, we sampled 50 scene images from the

most recent posts to the Instagram #dailylife hashtag 3, skipping 4 images that did not represent

real scenes of people or objects, such as composite images and drawings.

We ran each image through an object detection service to produce a set of object tags, then

removed all non-object tags with WordNet. For each group of objects, we used Augur to generate

20 activity predictions, making 1000 in total.

We used two external evaluators to independently analyze each of these predictions as to their

plausibility given the input objects, and blind to the original photo. A third external evaluator

decided any disagreements. High quality predictions describe a human activity that is likely given

the objects in a scene: for example, using the objects street, mannequin, mirror, clothing, store

to predict the activity buy clothes. Low quality predictions are unlikely or nonsensical, such as

connecting car, street, ford, road, motor to the activity hop.

Next, we showed evaluators the original image and asked them to decide: 1) whether computer

vision had extracted the set of objects most important to understanding the scene 2) whether one

of Augur’s predictions accurately described the most important activity in each scene.

Results

The evaluators rated 94% of Augur’s predictions are high quality (Table 3.2). Among the 44 that

were low quality, many can be accounted for by tagging issues (e.g., “sink” being mistagged as a

verb). The others are largely caused by relatively uncommon objects connecting to frequent and

overly-abstract activities, for example the uncommon object “tableware” predicts “pour cereal”.

Augur makes activity predictions that accurately describe 82% of the images, despite the fact

that CV extracted the most important objects in only 62%. Augur’s knowledge base is able to

compensate for some noise in the neural net: across those images with good CV extraction, Augur

succeeded at correctly predicting the most relevant activity on 94%.

3https://instagram.com/explore/tags/dailylife/


3.4 Discussion

Augur’s design presents a set of opportunities and limitations. First, we acknowledge that data-

driven approaches are not panaceas. Just because a pattern appears in data does not mean that

it is interpretable. For example, “boyfriend is responsible” is a statistical pattern in our text, but

it isn’t necessarily useful. Life is full of uninterpretable correlations, and developers using Augur

should be careful not to trigger unusual behaviors with such results. A crowdsourcing layer that

verifies Augur’s predictions in a specific topic area may help filter out any confusing artifacts.

Similarly, while fiction allows us to learn about an enormous and diverse set of activities, in some

cases it may present a vocabulary that is too open ended. Activities may have similar meanings,

or overly broad ones (like work in our evaluation). How does a user know which to use? In our

testing, we have found that choice of phrase is often unimportant. For example, the cosine similarity

between hail taxi and call taxi is 0.97, which means any trigger for one is in practice equivalent to

the other (or take taxi or get taxi). In this sense a large vocabulary is actively helpful. However, for

other activities choice of phrase does matter, and to identify and collpase these activities, we again

see potential for the refinement of Augur’s model through crowdsourcing.

In the process of pursuing this research, we found ourselves in many data mining dead ends.

Human behavior is complex, and natural language is complex. Our initial efforts included heavier-

handed integration with WordNet to identify object classes such as locations and peoples’ names;

unfortunately, “Virginia” is both. This results in many false positives. Likewise, activity prediction

requires an order of magnitude more data to train than the other APIs given the N2 nature of

its skip-grams. Our initial result was that very few scenarios lent themselves to accurate activity

prediction. Our solution was to simplify our model (e.g., look only at pronouns) and gather ten

times the raw data from Wattpad. In this case, more data beat more modeling intelligence.

More broadly, Augur suggests a reinterpretation of our role as designers. Until now, the designer’s

goal in interactive systems has been to articulate the user’s goals, then fashion an interface specifically

to support those goals. Augur proposes a kind of “open-space design” where the behaviors may be

left open to the users to populate, and the designer’s goal is to design reactions that enable each of

these goals. To support such an open-ended design methdology, we see promise in Augur’s natural

language descriptions. Activities such as “sit down”, “order dessert” and “go to the movies” are

not complex activity codes but human-language descriptions. We speculate that each of Augur’s

activities could become a command. Suppose any device in a home could respond to a request to

“turn down the lights”. Today, Siri has tens of commands; Augur has potentially thousands.

Chapter 4

Modeling Signals in Human

Language

Human language is colored by a broad range of topics, but existing text analysis tools only focus

on a small number of them. Here we present Empath, a tool that can generate and validate new

lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate

the category violence). Empath draws connotations between words and phrases by learning a neural

embedding across billions of words on the web. Given a small set of seed words that characterize a

category, Empath uses its neural embedding to discover related terms, then validates the category

with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories

we have generated such as neglect, government, and social media. We show that Empath’s data-

driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

4.1 Empath

Empath analyzes text across hundreds of topics and emotions. Like LIWC and other dictionary-

based tools, it counts category terms in a text document. However, Empath covers a broader set of

categories than other tools, and can generate and validate new categories with a few seed words.

4.1.1 Designing Empath’s categories

Empath provides 200 human validated categories, which cover topics like violence, depression, or

femininity. We drew these categories from common concepts in the ConceptNet knowledge base and

Parrott’s hierarchy of emotions [71]. While Empath’s topical and emotional categories stem from

different sources of knowledge, we generate member terms for both kinds of categories in the same

33

CHAPTER 4. MODELING SIGNALS IN HUMAN LANGUAGE 34

social media war violence technology fear pain hipster contempt

facebook attack hurt ipad horror hurt vintage disdain

instagram battlefield break internet paralyze pounding trendy mockery

notification soldier bleed download dread sobbing fashion grudging

selfie troop broken wireless scared gasp designer haughty

account army scar computer tremor torment artsy caustic

timeline enemy hurting email despair groan 1950s censure

follower civilian injury virus panic stung edgy sneer

Table 4.1: Empath can analyze text across hundreds of data-driven categories. Here we provide asample of representative terms in 8 sample categories.

way. Given a set of seed terms (from ConceptNet or the Parrott hierarchy), Empath learns from a

large corpus of text to predict and validate hundreds of similar categorical terms.

We generate category terms by querying a vector space model trained by a neural network on

a large corpus of text. This model allows Empath to examine the similarity between words across

many dimensions of meaning. For example, given seed words like “facebook” and “twitter,’ Empath

finds related terms like “pinterest” and “selfie.”

Training a neural word embedding model

To train Empath’s model, we adapt the skip-gram architecture introduced by Mikolov et al. [60].

This is an unsupervised learning that teaches a neural network to predict co-occurring words in a

corpus. For example, the network might learn that “death” predicts a nearby occurrence of the word

“carrion,” but not of “incest.” Over training the network learns a representation of each word that

is predictive of its context, and we can then borrow these representations, called neural embeddings,

to map words onto a vector space.

More formally, for word w and context C in a network with negative sampling, a skip-gram

network will learn weights that maximize the dot product w · wc and minimize w · wn for wc ∈ Cand wn sampled randomly from the vocabulary. The context C of a word is determined by a sliding

window over the document, of a size typically in (0,7).

We train our network on data from Wattpad, Reddit, and the New York Times [26, 24, 25].

The network uses a hidden layer of 150 neurons (which defines the dimensionality of the embedding

space), a sliding window size of five, a minimum word count of thirty (i.e., a word must occur at

least thirty times to appear in the training set), negative sampling, and down-sampling of frequent

terms. These techniques reflect current best practices in language modeling [61].

Building categories with a vector space

We use the neural embeddings created by our skip-gram network to construct a vector space model

(VSM). Similar models trained on neural embeddings, such as word2vec, enable powerful forms of

analogous reasoning (e.g., the vector arithmetic for the terms “King - Man + Queen” produces a

vector close to “Woman”) [55]. This model allows Empath to discover member terms for categories.


Empath Category Words that passed filter Words removed

Domestic Work chore, vacuum, scrubbing, laundry find

Dance ballet, rhythm, jukebox, dj, song buds

Aggression lethal, spite, betray, territorial censure

Attractive alluring, cute, swoon, dreamy, cute defiantly

Nervousness uneasiness, paranoid, fear, worry nostalgia

Furniture chair, mattress, desk, antique crate

Heroic underdog, gutsy, rescue, underdog spoof

Exotic aquatic, tourist, colorful, seaside rural

Meeting office, boardroom, presentation homework

Fashion stylist, shoe, tailor, salon, trendy yoga

Table 4.2: Crowd workers found 95% of the words generated by Empath’s unsupervised model tobe related to its categories. However, machine learning is not perfect, and some unrelated termsslipped through (“Did not pass” above), which the crowd then removed.

VSMs encode concepts as vectors, where each dimension of the vector v ∈ Rn conveys a feature

relevant to the concept. For Empath, each vector v is a word, and each of its dimensions defines the

weight of its connection to one of the hidden layer neurons. The space is M(n× h) where n is the

size of our vocabulary (40,000), and h the number of hidden nodes in the network (150).

Empath’s VSM selects member terms for its categories (e.g., social media, violence, shame) by

using cosine similarity, a similarity measure over vector spaces, to find nearby terms in the space.

Concretely, we search the vector spaces on multiple seed terms by querying on the vector sum of

those terms—a kind of reasoning by analogy. From a small seed of words, Empath can gather

hundreds of terms related to a given category, and then use these terms for textual analysis.

4.1.2 Refining categories with crowd validation

Human-validated categories can ensure that accidental terms do not slip into a lexicon. By filtering

Empath’s categories through the crowd, we offer the benefits of both modern NLP and human

validation: increasing category precision, and more carefully validating category contents.

To validate each of Empath’s categories, we created a crowdsourcing pipeline on Amazon Me-

chanical Turk. We divided the total number of words to be filtered across many separate tasks,

where each task consists of twenty words to be rated for a given category. For each of these words,

workers select a relationship on a four point scale: not related, weakly related, related, and strongly

related. We ask three independent workers to complete each task at a cost of $0.14 per task. Prior

work has shown that three workers are enough for reliable results in labeling tasks, given high quality

contributors [72]. So, if we want to filter a category of 200 words, we would have 200/20 = 10 tasks,

which must be completed by three workers, at a total cost of 10∗3∗0.14 = $4.2 for this category. We

limit tasks to Masters workers to ensure quality and aggregate crowdworker feedback by majority

vote. Workers demonstrated high agreement on the labeling task (81%).


4.1.3 Empath API and web service

Finally, to help researchers analyze text over new kinds of categories, we have released Empath as

a web service and open source library. The web service1 allows users to analyze documents across

Empath’s built-in categories, generate new unsupervised categories, and request new categories be

validated using our crowdsourcing pipeline. The open source library2 is written in Python and

similarly returns document counts across Empath’s built-in validated categories.

4.2 Empath Applications

To motivate the opportunities that Empath creates, we first present three example analyses that

illustrate its breadth and flexibility. In general, Empath allows researchers to perform text analyses

over a broader set of topical and emotional categories than existing tools, and also to create and val-

idate new categories on demand. Following this section, we explain the techniques behind Empath’s

model in more detail.

4.2.1 Example 1: Understanding deception in hotel reviews

What kinds of words accompany our lies? In our first example, we use Empath to analyze a dataset

of deceptive hotel reviews reported previously by Ott el al. [66]. This dataset contains 3200 truthful

hotel reviews mined from TripAdvisor.com and deceptive reviews created by workers on Amazon

Mechanical Turk, split among positive and negative ratings. The original study found that liars tend

to write more imaginatively, use less concrete language, and incorporate less spatial information into

their lies.

Exploring the deception dataset

We ran Empath’s full set of categories over the truthful and deceptive reviews, and produced ag-

gregate statistics for each. Using normalized means of the category counts for each group, we then

computed odds ratios and p-values for the categories most likely to appear in deceptive and truthful

reviews. All the results we report are significant after a Bonferroni correction (α = 2.5e−5).

Our results provide new evidence in support of the Ott et al. study, suggesting that deceptive

reviews convey stronger sentiment across both positively and negatively charged categories, and tend

towards exaggerated language (Figure 4.1). The liars more often use language that is tormented (2.5

odds) or joyous (2.3 odds), for example “it was torture hearing the sounds of the elevator which

just would never stop” or “I got a great deal and I am so happy that I stayed here.” The truth-

tellers more often discuss concrete ideas and phenomena like the ocean (1.6 odds,), vehicles (1.7

1http://empath.stanford.edu2https://github.com/Ejhfast/empath


Figure 4.1: Deceptive reviews convey stronger sentiment across both positively and negativelycharged categories. In contrast, truthful reviews show a tendency towards more mundane activ-ities and physical objects.

odds) or noises (1.7 odds), for example “It seemed like a nice enough place with reasonably close

beach access” or “they took forever to Valet our car.” We see a tendency towards more mundane

activities among the truth-tellers through categories like eating (1.3 odds), cleaning (1.3 odds), or

hygiene (1.2 odds). “I ran the shower for ten minutes without ever receiving any hot water.” For

the liars interactions seem to be more evocative, involving death (1.6 odds) or partying (1.3 odds).

“The party that keeps you awake will not be your favorite band practicing for their next concert.”

For exploratory research questions, Empath provides a high-level view over many potential cat-

egories, some of which a researcher may not have thought to investigate. Lying hotel reviewers, for

example, may not have realized they give themselves away by fixating on smell (1.4 odds), “the

room was pungent with what smelled like human excrement”, or their systematic overuse of emo-

tional terms, producing significantly higher odds ratios for 13 of Empath’s 32 emotional categories.

Truthful reviews, on the other hand, display higher odds ratios for none of Empath’s emotional

categories.

Spatial language in lies

While the original study provided some evidence that liars use less spatially descriptive language, it

wasn’t able to test the theory directly. Using Empath, we can generate a new set of human validated

terms that capture this idea, creating a new spatial category. To do so, we tell Empath to seed the


Figure 4.2: We use Empath to replicate the work of Golder and Macy, investigating how mood onTwitter relates to time of day. The signals reported by Empath and LIWC by hour are stronglycorrelated for positive (r=0.87) and negative (r=0.90) sentiment.

category with the terms “big”, “small”, and “circular”. Empath then discovers a series of related

terms and uses the crowd to validate them, producing the cluster:

circular, small, big, large, huge, gigantic, tiny, rectangular, rectangle, massive, giant, enormous, smallish,

rounded, middle, oval, sized, size, miniature, circle, colossal, center, triangular, shape, boxy, round,

shaped, decorative, ...

When we then add the new spatial category to our analysis, we find it favors truthful reviews

by 1.2 odds (p < 0.001). Truth-tellers use more spatial language, for example, “the room that we

originally were in had a huge square cut out of the wall that had exposed pipes, bricks, dirt and

dust.” In aggregate, liars are not as apt in these concrete details.

4.2.2 Example 2: Mood on Twitter and time of day

In our final example, we use Empath to investigate the relationship between mood on twitter and

time of day, replicating the work of Golder and Macy [33]. While the corpus of tweets analyzed by

the original paper is not publicly available, we reproduce the paper’s findings on a smaller corpus

of 591,520 tweets from the PST time-zone, running LIWC on our data as an additional benchmark

(Figure 4.2).

The original paper shows a low of negative sentiment in the morning that rises over the rest

of the day. We find a similar relationship on our data with both Empath and LIWC: a low in

the morning (around 8am), peaking to a high around 11pm. The signals reported by Empath and

LIWC over each hour are strongly correlated (r=0.90). Using a 1-way ANOVA to test for changes in

mean negative affect by hour, Empath reports a highly significant difference (F (23, 591520) = 17.2,

p < 0.001), as does LIWC (F = 6.8, p < 0.001). For positive sentiment, Empath and LIWC again


replicate similarly with strong correlation between tools (r=0.87). Both tools once more report

highly significant ANOVAs by hour: Empath F = 5.9, p < 0.001; LIWC F = 7.3, p < 0.001.

4.3 Evaluation

Here we evaluate Empath’s crowd filtered and unsupervised predictions against gold standard cate-

gories in LIWC.

4.3.1 Comparing Empath and LIWC

The broad reach of our dataset allows Empath to classify documents among a large number of

categories. But how accurate are these categorical associations? Human inspection and crowd

filtering of Empath’s categories (Table 4.2) provide some evidence, but ideally we would like to

answer this question in a more quantitative way.

Fortunately, LIWC has been extensively validated by researchers [68], so we can use it to bench-

mark Empath’s predictions across the categories that they share in common. If we can demonstrate

that Empath provides very similar results across these categories, this would suggest that Empath’s

predictions are close to achieving gold standard accuracy. Here we compare the predictions of

Empath and LIWC over 12 shared categories: sadness, anger, positive emotion, negative emotion,

sexual, money, death, achievement, home, religion, work, and health.

Method

To compare all tools, we created a mixed textual dataset evenly divided among tweets [64], StackEx-

change opinions [19], movie reviews [67], hotel reviews [66], and chapters sampled from four classic

novels on Project Gutenberg (David Copperfield, Moby Dick, Anna Karenina, and The Count of

Monte Cristo) [1]. This mixed corpus contains more than 2 million words in total across 4500

individual documents.

Next we selected two parameters for Empath: the minimum cosine similarity for category inclu-

sion and the seed words for each category (we fixed the size of each category at a maximum of 200

words). To choose these parameters, we divided our mixed text dataset into a training corpus of

900 documents and a test corpus of 3500 documents. We selected up to five seed words that best

approximated each LIWC category, and found that a minimum cosine similarity of 0.5 offered the

best performance. We then also created crowd filtered versions of these categories.

We ran all tools over the documents in the test corpus, recorded their category word counts,

then used these counts to compute Pearson correlations between all shared categories, as well as

aggregate overall correlations. Pearson’s r measures the linear correlation between two variables,

and returns a value between (-1,1), where 1 is total positive correlation, 0 is no correlation, and 1 is

total negative correlation. These correlations speak to how well one tool approximates another.


LIWC Category Empath Empath+Crowd Emolex General Inquirer

Positive 0.944 0.950 0.955 0.971

Negative 0.941 0.936 0.954 0.945

Sadness 0.890 0.907 0.852

Anger 0.889 0.894 0.837

Achievement 0.915 0.903 0.817

Religion 0.893 0.908 0.902

Work 0.859 0.820 0.745

Home 0.919 0.941

Money 0.902 0.878

Health 0.866 0.898

Sex 0.928 0.935

Death 0.856 0.901

Average 0.900 0.906 0.899 0.876

Table 4.3: We compared the classifications of LIWC, EmoLex and Empath across thirteen categories,finding strong correlation between tools. The first column represents comparisons between Empath’sunsupervised model against LIWC, the second after crowd filtering against LIWC, the third betweenEmoLex and LIWC, and the fourth between the General Inquirer and LIWC.

To anchor this analysis, we collected benchmark Pearson correlations against LIWC for GI and

EmoLex (two existing human validated lexicons). We found a benchmark correlation of 0.876 be-

tween GI and LIWC over positive emotion, negative emotion, religion, work, and achievement, and a

correlation of 0.899 between EmoLex and LIWC over positive emotion, negative emotion, anger, and

sadness. While EmoLex and GI are commonly regarded as gold standards, they correlate imperfectly

with LIWC. We take this as evidence that gold standard lexicons can disagree: if Empath approx-

imates their performance against LIWC, it agrees with LIWC as well as other carefully-validated

dictionaries agree with LIWC.

Finally, to test the importance of choosing seed terms, we re-ran our evaluation while permuting

the seed words in Empath’s categories. Over one trial, we dropped one seed term from each category.

Over another, we replaced one term from each category with a similar alternative (e.g., “church” to

“chapel”, or “kill” to “murder”).

Results

Empath shares overall average Pearson correlations of 0.90 (unsupervised) and 0.906 (crowd) with

LIWC (Table 4.3). Over the emotional categories, Empath and LIWC agree at correlations of 0.884

(unsupervised) and 0.90 (crowd), comparing favorably with EmoLex’s correlation of 0.899. Over GI’s

benchmark categories, Empath reports 0.893 (unsupervised) and 0.91 (crowd) correlations against

LIWC, stronger performance than GI (0.876). On average, adding a crowd filter to Empath improves

its correlations with LIWC by 0.006. We plot Empath’s best and worst category correlations with

LIWC in Figure 4.3. These scores indicate that Empath and LIWC are strongly correlated – similar

to the correlation between LIWC and other published and validated tools.

In permuting Empath’s seed terms, we found it retained high unsupervised agreement with


Figure 4.3: Empath categories strongly agreed with LIWC, at an average Pearson correlation of 0.90.Here we plot Empath’s best and worst correlations with LIWC. Each dot in the plot corresponds toone document. Empath’s counts are graphed on the x-axis, LIWC’s on the y-axis.

LIWC (between 0.82 and 0.88). The correlation between tools was most strongly affected when we

dropped seeds that added a unique meaning to a category. For example, death is seeded with the

words “bury”, “coffin”, “kill”, and “corpse.” When we removed “kill” from the death’s seed list,

Empath lost the adversarial aspects of death (embodied in words like “war”, “execute”, or “murder”)

and fell to 0.82 correlation with LIWC for that category. Removing death’s other seed words did

not have nearly so strong an affect. On the other hand, replacing seeds with alternative forms or

synonyms (e.g., “hate” to “hatred”, or “kill” to “murder”) usually had little impact on Empath’s

correlations with LIWC.

4.4 Discussion

Empath demonstrates an approach that crosses traditional text analysis metaphors with advances

in deep learning. Here we discuss our results and the limitations of our approach.

4.4.1 The role of human validation

While adding a crowd filter to Empath improves its overall correlations with LIWC, the improve-

ment is not statistically significant. Even more surprisingly, the crowd does not always improve

agreement at the level of individual categories. For example, across the categories negative emotion,


achievement, and work, the crowd filter slightly decreases Empath’s agreement with LIWC. When we

inspected the output of the crowd filtering step to determine what had caused this effect, we found

in a small number of cases in which the crowd was overzealous. For example, the word “semester”

appears in LIWC’s work category, but the crowd removed it from Empath. Should “semester” be

in a work category? This disagreement highlights the inherent ambiguity of constructing lexicons.

In our case, when the crowd filters out a common word shared by LIWC (like “semester”), this

causes overall agreement across the corpus to decrease (through additional false negatives), despite

the appropriate removal of many other less common words.

As we see in our results, this scenario does not happen often, and when it does happen the

effect size is small. We suggest that crowd validation offers the qualitative benefit of removing false

positives from analyses, while on the whole performing almost identically to (and usually slightly

better than) the unfiltered version of Empath.

4.4.2 Data-driven: who is actually driving?

Empath, like any data-driven system, is ultimately at the mercy of its data – garbage in, garbage

out. While fiction allows Empath to learn an approximation of the gold-standard categories that

define tools like LIWC, its data-driven reasoning may succeed less well on corner cases of analysis

and connotation. Just because fictional characters often pull guns out of gloveboxes, for example,

does not mean the two should be strongly connected in Empath’s categories.

Contrary to this critique, we have found that fiction is a useful training dataset for Empath given

its abundance of concrete descriptors and emotional terms. When we replaced the word embeddings

learned by our model with alternative embeddings trained on Google News [60], we found its average

unsupervised correlation with LIWC decreased to 0.84. The Google News embeddings performed

better after significance testing on only one category, death (0.91), and much worse on several of the

others, including religion (0.78) and work (0.69). This may speak to the limited influence of fiction

bias. Fiction may suffer from the overly fanciful plot events and motifs that surround death (e.g.

suffocation, torture), but it captures more relevant words around most categories.

4.4.3 Limitations

Empath’s design decisions suggest a set of limitations. First, while Empath reports high Pearson

correlations with LIWC’s categories, it is possible that other more qualitative properties are im-

portant to lexical categories. Two lexicons can be statistically similar on the basis of word counts,

and yet one might be easier to interpret than the other, offer more representative words, or present

fewer false positives or negatives. At a higher level, the number and kinds of categories available

in Empath present a related concern. We created these categories in a data-driven manner. Do

they offer the right balance and breadth of topics? We have not evaluated Empath over these more

qualitative aspects of usability.


Second, we have not tested how well Empath’s categories generalize beyond the core set it shares

with LIWC. Do these new categories perform as well in practice? While Empath’s categories are all

generated and validated in the same way, we have seen though our evaluation that choice of seed

words can be important. What makes for a good set of seed terms? And how do we best discover

them? In future work, we hope to investigate these questions more closely.

Finally, while fiction provides a powerful model for generating lexical categories, we have also

seen that, for certain topics (e.g. death in Google News), other corpora may have even greater

potential. Could different datasets be targeted at specific categories? Mining an online fashion

forum, for example, might allow Empath to learn a more comprehensive sense of style, or Hacker

News might give it a more nuanced view of technology and startups. We see potential for training

Empath on other text beyond fiction.

4.4.4 Statistical false positives

Social science aims to avoid Type I errors — false claims that statistically appear to be true. Because

Empath expands the number of categories available for analysis, it is important to consider the risk

of a scientist analyzing so many categories that one of them, through sheer randomness, appears

to be elevated in the text. In this paper, we used Bonferroni correction to handle the issue, but

there are more mature methods available. For example, Holm’s method and FDR are often used in

statistical genomics to test thousands of hypotheses. In the case of regression analysis, it is likewise

important not to do so-called “garbage can regressions” that include every possible predictor. In this

case, models that penalize complexity (e.g., non-zero coefficients) are most appropriate, for example

LASSO or logistic regression with an L1 penalty.

Chapter 5

Modeling Patterns in Code

Interfaces need explicit rules to support users, yet common practices are uncodified across many do-

mains such as programming and writing. We hypothesize that by modeling this emergent practice,

interfaces can support a far broader set of user needs. To explore this idea, we built Codex, a knowl-

edge base that records common practice for the Ruby programming language by indexing over three

million lines of popular code. Codex enables new data-driven interfaces for programming systems:

statistical linting, identifying code that is unlikely to occur in practice and may constitute a bug;

pattern annotation, automatically discovering common programming idioms and annotating them

with metadata using expert crowdsourcing; and library generation, constructing a utility package

that encapsulates and reflects emergent software practice. We evaluate these applications to find

Codex captures a broad swatch of programming practice, statistical linting detects problematic code

snippets, and pattern annotation discovers nontrivial idioms such as basic HTTP authentication and

database migration templates. Our work suggests that operationalizing practice-driven knowledge

in structured domains such as programming can enable a new class of user interfaces.

5.1 Codex

Norms of practice and convention emerge for software systems that aren’t codified in documentation.

Codex uncovers these norms by processing and aggregating millions of lines of open source code from

popular Ruby projects on Github.

5.1.1 Indexing and Abstraction

To build its database, Codex indexes more that 3,000,000 lines of code from 100 popular Ruby

projects on Github. It gathers these projects through the Github API by sorting all Ruby projects

on the number of watchers and then selecting the 100 projects most watched by other Github

44

CHAPTER 5. MODELING PATTERNS IN CODE 45

users. Codex first breaks apart a project recursively into all constituent AST nodes and annotates

these nodes with metadata; next, it normalizes all the AST nodes and collapses those that share a

normalized form into a single generalized database entry. The unparsed representation of each of

these normalized nodes is a Codex snippet.

Snippets of Ruby source code tend to be syntactically unique due to high variance in identifier

names and primitive values. Pattern finding tools usually need to abstract away some properties if

they are to find meaningful statistical patterns [38, 8, 20]. While we might implement normalization

in many different ways, Codex groups together snippets that are functionally similar by standardizing

the names of local variables and primatives. For some snippets (e.g., variable assignment) Codex

also keeps track of the original identifiers to enable variable name analysis.

Specifically, Codex’s normalization renames variable identifiers, strings, symbols, and numbers.

The first unique variable in a snippet would be renamed var0, the next var1, the first string str0,

and so on. Codex does not normalize class constants and function calls, as these abstractions

provide information important to Codex’s task-oriented search functionality and statistical linting.

As programmers use many different variable names and primitive values when accomplishing a

specific task, abstracting away these names helps Codex represent the core behavior of a snippet.

For instance, consider the Ruby snippet:

[:CHI, :UIST].map do |z|

z.to_s + ‘‘is a conference’’

end

After normalization, this snippet will be:

[:sym1, :sym2].map do |var1|

var1.to_s + ‘‘str1’’

end

Normalization works less well when such primitives (e.g., specific string or number values) are

vital to the interpretation of a snippet. In the future, we will only normalize snippet variable names

and identifiers if there is sufficient entropy in their definitions across similar snippets. Snippets with

vital identifiers are likely to be more consistent. Other normalization schemes may succeed as well,

but we find that this approach successfully collapses most similar snippets together.

Codex applies a map-reduce to the database, collapsing AST nodes with the same normalized

form into a single AST entry. We collect additional parameters as part of the map-reduce step:

files, a list of each file in which the snippet occurs; projects, a list of projects in which the snippets

appears; count, the total number of times a snippet has appeared; file count, the number of times a

snippet has appeard in unique files; and project count, the number of times a snippet has appeared in

unique projects. Codex uses these parameters to enable the statistical and pattern finding modules.

Codex users the Parser and AST Ruby gems by whitequark for AST processing. We have deployed

the Codex database on Heroku, using RethinkDB and MongoHQ.


5.1.2 Statistical Analysis Module

Codex has modules that enable high-level and low-level pattern detection. First we describe the

low-level module, which focuses on syntactical patterns that occur among AST nodes.

The statistical analysis module allows Codex to warn users when code is unlikely. Codex decides

this likelihood using a set of statistics: the frequency of the snippet and also the frequencies of

component forms of the snippet (e.g., .to s and .split for .split.to s). When a snippet’s compo-

nent forms are sufficiently common and the snippet itself is sufficienctly uncommon, Codex labels it

unlikely; that is, a snippet s must have occurred fewer than t times and all its component pieces, ci

must have occurred at most ti times.

Detecting Surprisingly Unlikely Code

Codex indexes many kinds of AST nodes (e.g., blocks, conditionals, assignment statements, function

calls, function definitions), but it conducts syntactic analysis upon a subset of these nodes. The

function by which a snippet of unlikely code is declared surprising differs based upon the type of

node in question. We discuss four representative analyses we have built to demonstrate the system:

1. Function Call Analysis: This analysis checks how many times a function has been called with

a given “type signature”, which Codex defines as the kind of AST nodes passed as arguments

(not the runtime type of the expression), relative to the number of times the function has been

called with other kinds of signatures. If a sufficiently common function appears with a type

signature that is very rarely observed by Codex, this may suggest problematic code. In split(’

’,2), s is split(string,number); c1 is the name of the function; c2 is the function signature,

e.g., [string, number]). Codex checks how many times split is called with string and integer

arguments relative to other kinds of arguments.

2. Function Chaining Analysis: This analysis checks how many times one function has been

chained with another; that is, the result of some first function is used as the caller of some

second function. Here s is the function chain, e.g., split.to s; c1 is the first function, e.g,

split; and c2 is the second, e.g., to s. Two functions that are often used but never chained

together suggest unusual code.

3. Block Return Value Analysis: This analysis checks how many times a certain kind of block has

returned a certain kind of value. For instance, it would be legal but unusual to write the code

things.each { |x| x.to s }, which does transform every element in the things list to a string,

but does not alter things itself since to s does not change the state of its caller (to change

the values in names, a programmer might instead use the expression x = x.to s inside the each

block). Here s is a kind of block with a particular return type, e.g., a each block with return

type of the to s function; c1 is a kind of block, e.g., an each block; and c2 is a kind of block

return type, e.g., blocks returning the to s function.


4. Identifier Analysis: This analysis checks how many times a variable identifier has been assigned

with a certain type of primitive. Often variable names suggest the type of the variable that they

reference; this analysis allows Codex to warn programmers about misleading or unconventional

variable names (e.g., str = 0 or my array = {}). Here s is the variable name as assigned to a

particular type, e.g., str = 0; and c1 is the variable name, e.g., str.

5.1.3 Pattern Finding Module

Whereas the statistical analysis module focuses on low-level syntactical structure, the pattern finding

module detects a set of high-level Ruby idioms and example snippets commonly reused by program-

mers. By constructing an appropriate query over the normalized snippets in its database, Codex can

find snippets that isolate common programming idioms. The pattern finding module also enables

other specific kinds of queries based on context (e.g., searching for certain library methods called

from within a map block.)

The general form of Codex’s pattern finding consists of a single query that is applied to the

database of abstracted snippets; we intend it to filter out snippets that programmers are less likely

to find interesting or useful. The query has five parameters, corresponding to attributes stored in

the database, and ordered here by their selectivity:

1. Project Count : the number of unique projects in which an abstracted snippet has occurred. A

lower bound of 2% of the number of projects indexed by codex filters out snippets that tend

to be longer and more idiosyncratic.

2. Total Count : the total number of times an abstracted snippet has occurred. An upper bound

of the 90% percentile filters out overly trivial snippets (e.g., var0 = var1).

3. File Count : the total number of unique files in an abstracted snippet has occurred. An upper

bound of 20% of the count of an abstracted snippet filters out snippets that are reused quite

a bit within one or more files; these snippets tend to be overly domain specific.

4. Token Count : the number of unique variables, function calls, and primitives that occur in an

abstracted snippet. An upper bound of the 80% percentile of all snippet token counts filters

out overly domain specific code.

5. Function Count : the number of unique function calls in a snippet. A lower bound of 2 filters

out trivial snippets.

These snippets are then passed to expert crowds, who attach metadata such as a title, description,

and measure of recommended usefulness.

Together, these parameters produce 9,693 abstracted snippets from the Codex database, corre-

sponding to 79,720 original snippets in the index. This query is designed to produce general purpose

snippets; other queries might be constructed differently to produce more domain specific results.


Figure 5.1: The Codex IDE calls out a snippet of unlikely code by a yellow highlight in its gutter.Warning text appears in the footer.

5.2 Codex Applications

To ground the opportunities that Codex creates, we begin by introducing three software engineering

applications that draw on the Codex data model and high or low level code analysis. In general,

Codex enables interfaces and applications that are supported by emergent programming behavior

rather than a set of special-cased rules. Following this section, we discuss the techniques behind

these applications in more detail.

5.2.1 Statistical Linting

Sometimes, developers program badly: they write code that performs in unexpected ways or violates

language conventions. Poorly written code causes significant damage to software projects; bugs tax

programmers’ time and energy, and code written in an abstruse or non-idiomatic style is more

difficult to maintain [70, 30]. Given the complexity of programming languages, rule-based linters

can’t catch much of this unusual or non-idiomatic code.

Codex operates on the insight that poorly written code is often syntactically different from well

written code. For example, functions might be used in the wrong combination or order. So if we

collect and index a set of code that is representative of best practices, bad code will often diverge

syntactically from the code in this index. Not all syntactically divergent code is bad — the space

of well written Ruby programs is very large — but by applying high-precision detectors to a few

general AST patterns, Codex can detect syntactically divergent code that is likely to be problematic.

Function Chaining and Composition

Programmers frequently chain and compose functions and operators to create complex algorithmic

pipelines, but chaining the wrong kinds of functions together will often cause subtle program bugs.

For example, bugs might arise from functions chained in the wrong order, or variables added or

assigned in ways they should not be. Codex helps programmers find potential bugs in function

chains by identifying unlikely combinations of functions.

For example, if Ava is querying a database that has been normalized to lower case, she needs to

convert a string held by the variable name to lower case form. She intends to assign the lower case


variant of name to the variable lower case name and use this new varible in her query. The Ruby

methods downcase and downcase! will both convert a string variable to lower case, and without

thinking too deeply, Ava codes: lower case name = name.downcase!

Unfortunately, Ava has forgotten that downcase! has a side-effect: it changes the variable name

in place and returns itself. The function she ought to have used is downcase, which returns a new

lower cased string and does not change name. When Ava later uses name elsewhere in her program,

it doesn’t hold the value she expects.

Codex indicates that the line of code is statistically unlikely: downcase! is not commonly chained

with an assignment statement (although such code is not technically incorrect). Codex notifies Ava

that it has observed downcase! 57 times, and the abstraction var = var.any method more than

100,000 times, but it has only encountered one variant of Ava’s combined snippet. However, Codex

has encountered variants of the correct snippet, lower case name = name.downcase, more than 200

times.

Block Return Value Analysis

Ruby programmers often manipulate data by passing blocks (lambda-like closures) to functions, but

using the wrong kind of block, or passing a block to the wrong kind of function, can process data in

unintended ways. Codex identifies unlikey pairings of functions and block return values.

For example, as part of data analysis pipeline, Ash wants to raise every number in a list to the

power of 2. He tries to do this with a map block, but encounters a problem (he uses the operator ^ in

place of **) and adds a puts (print) statement inside the map block to help him debug his mistake:

new_nums =

nums.map do |x|

x^2

puts x

end

In doing so, Ash has introduced another bug. The puts method returns nil, which means that

new nums will be a list of nil. When Ash runs his code, this new error complicates his old problem.

Codex returns a warning: most programmers do not return the method puts from a map block.

We can anchor this concern in data: Codex has observed map blocks 4297 times and puts statements

5335 times, but it has never observed puts as the last line (an implicit return) of a map block. However,

Codex observes that puts statements are a common return value of blocks that are predominantly

used for control flow, like each (observed 272 times), so it produces a warning.

Function Type Analysis

Passing the wrong kinds of arguments to a function, or passing positional arguments in the wrong

order, can lead to many subtle bugs — especially in duck typed languages like Ruby. However,


by analyzing the kinds of AST nodes passed as positional arguments to functions, Codex can warn

users about unlikely function signatures.

For example, Severian wants to divide a few hundred datapoints into ten buckets, depending on

their id number. To do this, he needs to initialize an array of ten elements, where each element is a

hash. Severian codes: Array.new({},10).

Unfortunately, Severian doesn’t often initialize arrays with specific lengths and values, and he has

reversed the arguments of Array.new. When he executes his code, it fails with the error, “TypeError:

can’t convert Hash into Integer.”

Codex could have told Severian that programmers don’t often pass Array.new an argument list

composed of a string and integer. While Codex observes Array.new 674 times, it has never observed

Array.new with string and integer arguments. However, Codex observes the correct parameterization

Array.new(integer,string) several times, which is the correct version of Severian’s code.

Variable Name Analysis

Good variable names provide important signals about how a variable should be treated and lead

to more readable code [70]. Likewise, badly named variables can lead to poor code readability and

downstream program errors. By analyzing variable name associations with primitive values (e.g.,

Strings, Integers, Hashes), Codex can warn programmers about violations of naming conventions.

For example, Azazel is writing a complicated function to process a large dataset from a database

call. He is collecting the data in an an Array called array. However, he later realizes that a hash

would be simpler to manage and changes the variable type. In a rush, Azazel changes the variable’s

type but doesn’t bother to change its name: array = {}. Later, Ash, who is Azazel’s coworker, is

looking elsewhere in the function and sees a line array.keys { ... }. He wonders, does an Array

have keys? He hadn’t thought so.

Instead, Codex notifies Azazel that most programmers do not initialize a variable named array

with a Hash value. While Codex observes initializations to variables named array 116 times and

variables assigned a Hash value many thousands of times, it has never observed the two together.

Instead, Codex observes array = [] 46 times.

It is not wrong to assign a Hash value to a variable named array, but code that does so is likely

less readable and might lead to downstream errors. Codex can determine that such an assignment

violates Ruby convention. Likewise, Codex would notice integers stored in str or common loop

count iterators like i being initialized with other variable types.

The Codex IDE integrates CodexLint (Figure 5.1), allowing users to call up statistics about any

line of code in the editor. The linter also runs behind the scenes during development, highlighting

any unlikely code with a yellow overlay on the window gutter. When the cursor moves over a marked

line, a small message appears on the lower bar of the Codex window, e.g., “We have seen the function

split 30,000 times and strip 20,000 times, but we’ve never seen them chained together.”


Example Codex Annotated Snippets

HTTP Basic Auth

if var0.user

var1.basic_auth(var0.user, var0.password)

end

Sets the basic-auth parameters (username and password) before making an HTTP request, perhaps using Net::HTTP

Popping Options Hash from Arguments

if Hash === var0.last

var0.pop

else

{}

end

Pops the last element from the list ’var0’ if it is a Hash. Gives an empty hash if the last element is not a Hash

Raise StandardError

raise(StandardError.new("str0"))

Raise a StandardError exception using “str0” as exception message

Configure action controller to disable caching

config.action_controller.perform_caching=(false)

This will set a global configuration related to caching in action controller to false

Table 5.1: Codex identifies common programming snippets automatically, then feeds them to crowd-sourced expert programmers for metadata such as the bolded title and descriptive text.

5.2.2 Pattern Annotation

Many valuable programming idioms are not collected in documentation or on the web. While users

can access standard library documentation for core abstractions (e.g., for Ruby, http://ruby-doc.

org/), and libraries often ship with similar kinds of documentation provided by their maintainers, the

common idioms by which basic functions may be combined and extended often remain uncodified.

Instead, these idioms live in the minds of programmers and — sometimes — on the message boards

of communities and forums. Novice users of languages and libraries must “mind the gap” present in

official forms of documentation.

Codex fills in gaps of practice-driven knowledge by detecting common idioms as it indexes code

and sending them out to be filtered and annotated by a crowd of human experts. Codex finds

these idioms by selecting snippets in its database with query parameters such as commonality and

complexity. These selected snippets (e.g., that appear in a large number of unique projects and are

sufficiently nontrivial) are primed for annotation and human filtering. For instance, over the course

of its indexing, Codex identifies inject { |x,y| x + y } as a common snippet, occurring 15 times


across 4 projects.

Next, Codex sends these snippets—strings of Ruby code, along with examples of them in use—to

a Ruby expert on oDesk, a paid expert crowdsourcing platform. The worker annotates the snippet

with a title (e.g., “sum all the elements in a list”), a description, and a vote of how useful the snippet

would be for an everyday Ruby programmer. Codex stores these annotations in its index along with

the original source snippet, making previously implicit knowledge explicit. Eventually, we envision

a community of Ruby programmers that annotates snippets of interest.

The Codex IDE uses this annotated snippet information to provide higher-level interpretability

to code. The annotations appear whenever a programmer opens a file containing the idiom. Users

benefit from annotated code under many different scenarios: perhaps using code scavenged from

a web tutorial; opening an unfamiliar file passed on by a collaborator; revisiting a segment of

copy/pasted code; or trying to recall the use of an idiosyncratic library function.

Consider one such user, Morwenna, who is collaborating with a colleague on a Ruby on Rails ap-

plication. Morwenna hasn’t had much experience with Rails, so she begins navigating the many files

of her colleague’s code in an attempt to build familiarity with the framework. While visiting con-

fig/application.rb, Morwenna comes across the snippet config.action controller.perform caching =

false and wonders what this means. Codex indicates the line has an annotation, so she asks to see

it. The annotation reads, “Turns off default Rails caching.”

The Codex IDE calls out and displays any available and relevant annotations (Figure 1.3). When

the cursor moves over a line where annotations are available, a user can call them into the sidebar.

We present examples of these annotated snippets in Table 5.1. In general, Codex’s annotation

system uncovers higher-level connections between more basic program components. For instance,

human workers can infer the relation of a snippet to some outside library, providing context that isn’t

explicitly present (e.g., Net:HTTP or Ruby on Rails). Similarly, Codex allows for the documentation

of higher-level idioms, where programmers can find each component in documention but not the

snippet itself, like the combination of raise and StandardError.new.

Querying for Understanding

In addition to the automatic idiom detection provided by the pattern finding module, users can query

Codex directly to better understand community practices around a line or block of code. Queryable

parameters include the type of AST node (e.g., a block, conditional, or function call), the body

string of the normalized code associated with a node, the body string of original code, the amount

of information contained in an AST node (i.e., a measure of code complexity), and the frequency of

a node’s occurrence across files and projects.

For instance, from a library-driven standpoint, suppose that programmers want to know more

about how people use the Net::HTTP class. They can query for all blocks that contain Net::HTTP.new,

sorting on the ones that occur most often. By the diversity of this result set, programmers gain


Function Description

Array#sort by index(idx) Sort an array by the value at idx

Array#convert join(str) Converts each array element to a string then joins them all on str

Array#upto size Create a range, same size as the array

String#capital tokens(str) Capitalize all tokens in a string

Hash.nested Create a hash with default value {}Hash#get(key) Retrieve based on :key or “key”

File#try close Close a file if it’s open

Table 5.2: A sample of functions from CodexLib, detected in emergent programming practice andencapulated into a new standard library.

a sense of the kinds of context in which Net::HTTP is used — even more so, if any of the results

have been annotated by Codex’s crowdsourcing engine. This is a more query-driven approach to

example-driven development [12, 65].

Queries also have applications in other more IDE-specific components like auto-complete, where

the IDE might attempt to find the most common completion for a snippet of code, given additional

context. For example, with the line Hash.new and an open block, Codex suggests the completion block

{ |h,k| h[k] = [] }, which initializes the default value of a hash to a new empty Array. Codex’s

user query system enables a broad set of functionalities including code search, auto-complete, and

example discovery.

5.2.3 Library Generation

Many of the Ruby snippets discovered by Codex are modular, reusable components. The recompos-

able nature of these snippets suggests that programmers might benefit from their encapsulation in a

new standard library that reflects the “missing” functionality that Ruby programmers actually use.

Programmers may sometimes engage in unnecessary work: both the mechanical work of typing out

repetitive syntax, and also the mental work of caching task-oriented semantics in working memory.

Here we present CodexLib, a library created by emergent practice (Table 5.2). Unlike human

language, which evolves over time (e.g., “personal computer” becomes “PC” and “smartphone”

emerges to describe a new class of devices), programming languages and libraries often remain more

static. CodexLib suggests programming libraries can similarly evolve based on actual usage.

Consider one common Ruby idiom, creating a new Hash object where its default lookup value is

another empty Hash. This nested hash object allows programmers to write code in a matrix-like style,

e.g., hash[‘‘Gaiman’’][‘‘Coraline’’] = true. Programmers usually create a nested hash with the

snippet, Hash.new { |h,k| h[k] = {} }. The nested hash idiom is 22 characters long and involves

some nontrivial tracking of syntactic details, yet it appears in 66 times in 12 projects. Programmers

would likely benefit by the creation of a shorter library function. Using CodexLib, they can create

a new nested hash with the code Hash.nested, which is only 10 characters long and has far fewer


opportunities for error.

Alternatively, consider the Ruby idiom to capitalize each word token in a string, which occurs

10 times across 5 different projects:

var0.split(/str0/).map do |var1|

var1.capitalize

end.join("str0")

This idiom is dense and not immediately self-descriptive; it contains four function calls and a

block within three lines. The code splits var0 on str0 (in practice, usually “ ”) to produce an array,

applies capitalize to each element in this array, the uses join to knit the array into a new string again

using str0. Programmers might benefit from a simpler way to express this task. Using CodexLib

they can achieve the same result with the shorthand code: var0.capital tokens(‘‘str0’’).

CodexLib is a layer on top of the Codex snippet database. To construct it, we extract the most

popular idioms and their crowdsourced descriptions from the database. For this small number of

functions, it is feasible to manually write function signatures and encapsulate them in new class

methods for Hash, Array, String, Float, File, and IO (Table 5.2). Programmers can download this

library as a Ruby gem at http://hci.st/codexlib.

5.3 Evaluation

Codex hypothesizes that we can build new software engineering interfaces by using databases that

model practice-driven knowedge for programming languages. In this section, we provide evidence

for three claims:

1. The 3,000,000 snippets in the Codex database are sufficient to characterize and analyze a

broad swath of program behavior. We measure the redundancy of AST nodes as Codex indexes

increasing amounts of code.

2. Codex captures a set of snippets that are recomposable and task-oriented. We ask oDesk Ruby

experts to describe and review a subset of the Codex patterns.

3. Codex allows us to identify unlikely code, without too many false positives. We evaluate the

number and kinds of warnings that Codex throws across a test set of 49,735 lines of code.

5.3.1 The Codex Database

The Codex database is composed of more than 3,000,000 lines of open source code, indexed from

100 popular Ruby projects on Github. These projects come from a diverse set of application areas,


Figure 5.2: A plot of Codex’s hit rate as it indexes code over four random samples of file orderings.The y-axis plots the database hit rate, and the x-axis plots the number of lines of code indexed.

including programming languages, plugins, webservers, web applications, databases, testing suites,

and API wrappers.

We designed Codex to reflect programming practice. Programming is open ended — the number

of valid strings of source code in most languages is infinite — so no database can hold information

about every possible AST node or program. However, programming is also highly redundant when

examined at a small enough level of granularity [29]. Of the approximately 7 million AST nodes that

Codex has indexed, only 13% are unique after normalization. Among the more complex types of

AST nodes we see variablity in this redundance. For example, among block nodes 74% are unique,

and among class nodes 85% are unique (Table 5.3).

To evaluate the breadth of code that Codex knows about, we examine the overall hit rate of its

database as it indexes more code. That is, when indexing N lines of code, what percentage of its

normalized AST nodes have not been seen before as they are added to the database? We analyzed

the raw Codex dataset for values ranging from 92 to 3,000,000 lines of code across four random

samples of file ordering.

Codex’s hit rate exceeds 80% after 500,000 lines of code (Figure 5.2), meaning that Codex had

already observed 80% of the AST nodes after normalization. Different AST node types display

slightly different curves, with the same overall shape. Many of the nodes we are interested in for

statistical analysis are more complex, and so they are less amenable to the leveling of this curve.

However, were Codex to index more code, its hit rate would increase even futher.

5.3.2 Pattern Annotation

We asked professional Ruby programmers on the oDesk expert crowdsourcing marketplace to an-

notate 500 Codex snippets randomly sampled from the approximately 10,000 snippets that passed

Codex’s general pattern finding filter.

First, we asked crowdworkers to label each snippet with one of the categories: Data or Control

Flow, Standard library, External library, and Other (Table 5.4). The majority of snippets address

standard library tasks (76%), followed by external library tasks (14%), and tasks involving data or


Node Type Percent Unique

Class definition 85%

Rescue statement 78%

Block statement 74%

Function definition 69%

If statement 66%

Interpolated string 29%

Function call 28%

Inlined hash 17%

Table 5.3: The percent of snippets that are unique after normalization for common AST node types.

Category Percent of Snippets

Standard Library 76%

External Library 14%

Data or Control Flow 9%

Table 5.4: Programmers from an expert crowdsourcing market annotated Codex’s idioms with theirusage type. The vast majority concern the use of standard, built-in libraries.

control flow (9%). None fell outside these categories (Other = 0%).

Next, we asked oDesk crowdworkers to answer: 1) Is this snippet a useful programming task or

idiom? 2) Can this snippet be encapsulated into a separate standalone function? 3) Is there a more

common way to write this snippet?

The oDesk Ruby experts reported that 86% of the snippets queued for annotation are useful,

96% are recomposable, and 91% have no more common form. These statistics indicate that Codex’s

pattern finding module produces snippets that are generally recomposable and reflective of good

programming practice.

5.3.3 Statistical Linting

Statistical linting relies upon the low-level properties of millions of lines of code to warn users about

code that is unlikely. Codex defines a general approach for detecting unlikely code, on which it

implements analyses for: type signatures, variable names, function chains, and block return types.

Here we evaluate to what extent CodexLint’s produces false positives through a training set of 49,735

lines of code.

As Codex seeks to identify unlikely code, and not program bugs, the distinction between true

positives and false positives is largely subjective. Inevitably, some users will want to be warned

about these properties, while others will not. Here we test the statistical linter against code known

to be of high quality. Supposing the number of warnings CodexLint suggests is small, relative to

the number of lines of code analyzed, this provides evidence that the statistical linting tool does not


suggest too many false positives.

We based our CodexLint test set on 6 projects randomly sampled and withheld from the 100

repositories collected to build Codex’s index. The test set projects contain a total of 49,735 lines of

code, and all of these projects are popular and widely used, with more than 100 watchers on Gitub

(as the case for all the projects selected for indexing by Codex). Since 90% of the snippets annotated

through Codex’s pattern finding module are found by crowdsourced experts to be idiomatic, and

over 85% are rated as useful, we can safely assume that these projects generally do contain high-

quality code — the null hypothesis would be the principle, “garbage in, garbage out.” By treating

each warning it throws as a false positive, we arrive at a conservative estimate of the error rate.

Running CodexLint against the test set, we find that it generates 1248 warnings over 49,735 lines

of code; this suggests a conservative false positive rate of 2.5%.

The most common category of false positive involves functions and blocks that appear at least

a few times across a number of projects, but that haven’t been observed enough for Codex to

appropriately model their behavior. For example, nodes and uri are part of a HTML parsing

library that Codex has only seen used in a few files, and the system throws a warning about their

combination, e.g., nodes.uri. We are working on a new technique to detect sparse functions based on

library dependencies and additional program context that will handle them separately in analysis.

The second most common false positive occurs when Codex observes two AST nodes, neither

of them particularly uncommon, together in a new and valid way, e.g., lambda blocks returning a

function call to rand, which did not appear at all in Codex’s index. Programming is an open-ended

task, and there will always be valid combinations of expressions that a system like Codex has not

encountered.

Other false positives are more ambiguous. For example, one project passes the map function

a string, which would usually produce an error. This project had overridden map to support new

functionality. Similarly, another file assigns a variable named @requests an integer value, and Codex

has only ever observed @requests as an array. Programmers might be well served by changing their

code in response to these warnings.

Finally, this false positive rate will decrease as the size of Codex’s index grows and fewer correct

code paths surprise it. As the statistical linting algorithm is based upon probability thresholds, users

can make the linter even more conservative by adjusting these thresholds — analogous to adjusting

the parameters of traditional linters.

5.4 Discussion and Limitations

The approach that Codex takes has limitations, many of which we plan to address with future work.

First, while we have collected evidence that suggests Codex’s index is large enough to encompass a

broad swath of program behavior, it is likely that many applications — such as pattern annotation


and statistical linting — would benefit from a larger index of code. We have tested Codex with

indexes as large as ten million lines of code, with no significant difference in the kinds of nodes and

statisitical properties it detects. However, as the size of the index grows, there will be fewer and

fewer edge cases and false positives, and Codex will more easily detect idioms and make precise

statistical statements about combinations of AST nodes. Codex must balance its desire for more

coverage against the danger of indexing lower-quality code.

Second, many more kinds of program analyses can be defined beyond Codex’s current abstrac-

tions. All the analyses tested in the current version of Codex rely upon local properties of AST

nodes, and not the surrounding program context. By incorporating more of this context into analy-

ses, we might detect more complex properties (e.g., detecting that a user hasn’t initialized a MySQL

database wrapper).

Third, due to the subjective nature of CodexLint’s warnings, we have not determined a precise

rate of true positives and false positives. In future work, we might ask programmers to evaluate these

warnings, to better determine how often they are useful. Moreover, this paper does not address the

general question: do programmers really find it useful to know when they are violating convention?

We can determine the answer more concretely through longitudinal study.

Finally, while Codex models practice-driven knowledge for the Ruby programming language, our

techniques for processing AST nodes and generating statistics are applicable to any AST structure

or language. For example, it might be feasible to generate a Codex database for JavaScript by

crawling highly-trafficked web pages. Moreover, while we focused on a dynamic language due to its

popularity and flexibility of naturalistic usage, static languages provide additional metadata that

Codex could leverage. Extending Codex’s analyses to these other languages remains future work.

Chapter 6

Discussion

This thesis presents three systems that contribute techniques for modeling user behavior at scale,

operationalizing these models to enable new applications across human behavior, language, and code.

These systems solve a number of challenges, but introduce and motivate many others. For example,

how can we choose good representations for modeling system knowledge in open-ended domains such

as human life? And how we can build useful systems on top of such open-ended models, when we

do not know in advance what kind of information they may encode? In this section, I motivate and

discuss a series of these open questions, lessons, and challenges.

6.1 Data Mining in HCI

My work draws on data mining techniques to advance research in human-computer interaction.

These techniques allow systems to better understand user behavior: for instance, in the domains of

writing, programming, or ubiquitous computing. Systems can then leverage this understanding to

adapt and react to current or future behavior.

Today, data mining is most often applied in the service of low level interactions. For example, a

device might learn from a user’s history of touch interactions to better decide what they are trying

to click on. The reason for this is two-fold: first, such targeted interactions provide a large amount

of easily interpretable training data; and second, improving the accuracy of such small interactions

reliably creates significant impact when improvements are deployed at scale.

The work I have presented offers a different perspective on the opportunities that data mining

presents to HCI, imagining how these techniques might help interfaces understand and help users

with higher level behaviors. This approach more often allows systems to engage in new kinds of

interactions with users, as opposed to refining existing interactions. For example, in our Augur

work, we reflect on what ubiquitous computing systems could do if they could understand the

many thousands of activities that people engage in, and the relationships between those activities.

59

CHAPTER 6. DISCUSSION 60

However, achieving a high level understanding of user behavior through data mining is challenging

for exactly the same reasons that achieving low level understanding data is tractable. You need to

answer some tough questions. Where does the data come from? How do you represent the high level

patterns you are interested in? And how reliable are the discovered associations?

Our work on Codex, focused as it was on programming tools, had the easiest time answering

these questions. Open source code is abundant on the web and can be represented through high

level AST-based parses, and interactions designed to help users can be drawn into an IDE in a

relatively innocuous way. For example, if a linting suggestion based on a high level code pattern

is wrong, the worst that might happen is you waste a user’s time by bringing it to their attention.

In contrast, our Augur work, which was focused on human life and behavior in the broadest sense,

had a difficult time with these same questions. Data about human behavior is not abundant, has

no natural representation, and inferences made on the basis of such behaviors can have damaging

real world consequences (even something as simple as turning off a light can be quite bad under

the wrong circumstances). So with Augur we needed to be much more creative about the source of

data—fiction—and how we could represent human behavior in a useful way.

Along these lines, the greatest opportunities in data mining for HCI will likely bring new datasets

and creative insights to old problems, as Augur brought fiction to ubiquitous computing. These

opportunities may also lie in domains where the data in question falls more naturally into a useful

high level representation (such as program ASTs) that can be applied to known problems. Our

Empath work provides some supporting evidence for this idea, as we applied the lessons we learned

in Augur to a much narrower problem—the creation of new lexicons—and produced a tool with

significant impact, used by many researchers including Facebook’s Data Science team.

6.2 Biases in Data-Driven Interfaces

It is important to understand the biases in our datasets and the models that we generate from them.

All of the work I have presented here contains such biases. Augur makes predictions about human

life based on the actions that characters take in fiction, learning from a source biased by drama

and stereotype. Empath draws on word associations learned from a wide variety of texts written by

many thousands of individuals, yet will often generate lexical categories that succumb to stereotypes

of race and gender, reflecting the broader attitudes of society. Even Codex is biased by the common

tendencies of the programmers who published the code it analyzed. And of course, interfaces based

upon the models learned from such data will be biased themselves.

Understanding biases is most important when models are trained on datasets that are quite

different from the ones they are being applied to. For example, the recent spectacular failures of

some computer vision algorithms can be traced back to training them on datasets that did not

contain enough racial diversity to match the populations they were analyzing [74]. This kind of


issue is particularly relevant to projects such as Augur, which draw their strength from the fact

that they are using novel and abundant sources of data. Fiction is of great benefit to Augur in that

it allows the system to turn a microscope on thousands of human lives without a large-scale data

collection effort or the need to invade anyone’s privacy. But fiction is also a great weakness in that

these human lives under analysis may not be realistic ones.

Many researchers are woking on techniques that seek to bridge this disconnect. For example,

suppose we could collect a small but realistic distribution of the relationships between human ac-

tivities. If we then had a method that could compare the distribution drawn from fiction with the

real distribution, and identify dimensions along which the model exhibits dramatic bias, or gender

bias, or bias towards violence, we might use that information to transform the fictional distribution

and de-bias the model. Such an approach has been taken to remove gender stereotypes from word

embeddings [10], a technique which could be directly applied to tools like Empath, and might be

modified to apply to the fiction-based models.

However, even if we remove all the biases we can quantify, we still need to deal with the biases

we cannot, such as biases of absence. If a model is built upon books written only in the nineteenth

century, for instance, it is unlikely to contain much information about interactions between members

of same sex couples. And even if we know that a bias of absence exists, there is no way to address

that absence without simply finding a better dataset. In fact, we encountered this issue in our work.

Commercial fiction is not the most abundant source of mundane activities, and so we found more

suitable data: amateur fiction writers are less experienced in the craft of writing, and tend to leave

more of those details in. Sometimes finding more or better data is the only good solution.

As interactive systems become ever more driven by user and community data, we must consider

the potential biases that may spread from the data to the system. Analyzing and correcting for such

biases should become an important step in the design process. This is especially true for any work

that follows from this thesis, which relies heavily on unsupervised or semi-supervised learning.

6.3 Data Power vs. Modeling Power

One recurring challenge in this thesis work has been whether to put more effort into finding more

data or into building more sophisticated models. In some projects, such as Codex, coming up with

a better model made all the difference and increased the power of the system. In other projects,

such as Augur, we only succeeded once we had acquired several orders of magnitude more data and

ultimately threw away much of the original model’s complexity.

For many of the unsolved problems at the intersection of machine learning and HCI, the limiting

factor is the data. Give an off-the-shelf recurrent neural network (RNN) enough fiction, and it should

have little problem generating a realistic set of character behaviors over time. Give a similar RNN

enough mappings between English and code, and it will soon be translating your language into code


fragments. Choice of model can be important on the margins, but not as important as having a

large dataset that captures the kind of relationships you need.

However, the representational choices for a model’s features, inputs, and outputs remain critical

for defining the interaction boundaries of a system. This is often the hardest part of an HCI project

that mixes modeling and design, as it dictates the space of possible interactions. For example,

Augur reasons about activities that are defined as a verb phrases with a human subject, for example

the phrase park car. These human activities are its basic units of reasoning and so determine the

kinds of predictions the model can make and the ways it can empower interactions. You give the

model a signal for park car and it might predict something like open door, perhaps allowing a

sufficiently intelligent car to open the door for you. You give the model a few pieces of context,

maybe screen, desk, and computer, and it can predict something like working, perhaps enforcing

a set of notification preferences you have assigned to that context. These kinds of input/output

relationships do not appear out of nowhere. They require deep thinking about the design space you

want to enable, and how you might gather the necessary signals from the data. The process is far

more involved than simply throwing a neural network at a new dataset.

With systems like those presented in this thesis work, designers no longer need to plan in advance

every possible behavior they want an interface to understand. However, as researchers and meta-

designers of systems that enable systems, we still need to think ahead to the space of behaviors we

want to capture in the models that we create. This space represents the power of a model, and its

potential to enable new and useful interactions.

Chapter 7

Conclusion

Interfaces can benefit from understanding user needs across a diverse set of domains: activity pre-

diction, writing, and programming, among many others. While supervised learning techniques and

other statistical models provide powerful tools for learning patterns from user data, they still re-

quire a system designer to formulate a set of hypotheses in advance: a set of questions upon which

those models can be trained. In contrast, this thesis shows how interfaces can operationalize semi-

supervised or unsupervised models trained on data drawn from these domains to reason about user

actions in way unanticipated by any system designer.

Moving forwards, I aim to explore how we can extend this approach to draw data from community

resources in a way that goes on to empower those resources, creating bidirectional interactions

between systems and their sources of data. I envision a future of where systems engage in a virtuous

cycle: a system first learns from a community, then goes on to empower work in that community,

and finally learns again from what it has empowered the community to do.

63

Bibliography

[1] Project gutenberg. In https://www.gutenberg.org/.

[2] Gregory D Abowd, Anind K Dey, Peter J Brown, Nigel Davies, Mark Smith, and Pete Steggles.

Towards a better understanding of context and context-awareness. In Handheld and ubiquitous

computing, pages 304–307. Springer, 1999.

[3] Etyan Adar, Mira Dontcheva, and Gierad Laput. CommandSpace: Modeling the relationships

between tasks, descriptions and features. In Proc. UIST 2014.

[4] Marzieh Ahmadzadeh, Dave Elliman, and Colin Higgins. An analysis of patterns of debugging

among novice computer science students. In Proc. ITiCSE 2005.

[5] Apple Computer. Knowledge Navigator, 1987.

[6] Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix, and William Pugh.

Using static analysis to find bugs. In In IEEE Software 2008.

[7] Niranjan Balasubramanian, Stephen Soderl, and Oren Etzioni. Rel-grams: A probabilistic

model of relations in text. AKBC-WEKEX Workshop 2012.

[8] Ira D. Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. Clone

detection using abstract syntax trees. In Proc. ICSM 1998.

[9] Michael S Bernstein, Jaime Teevan, Susan Dumais, Daniel Liebling, and Eric Horvitz. Direct

answers for search queries in the long tail. In Proc. CHI 2012.

[10] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.

Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In

Advances in Neural Information Processing Systems, pages 4349–4357, 2016.

[11] Margaret M Bradley and Peter J Lang. Affective norms for english words (anew): Instruction

manual and affective ratings. In Technical Report C-1, The Center for Research in Psychophys-

iology, University of Florida, 1999.

64

BIBLIOGRAPHY 65

[12] Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. Example-centric

programming: integrating web search into the development environment. In Proc. CHI 2010.

[13] Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. Oppor-

tunistic programming: Writing code to prototype, ideate, and discover. In In IEEE Software

2009.

[14] Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. Two

studies of opportunistic programming: interleaving web foraging, learning, and writing code.

In Proc. CHI 2009.

[15] Raymond P. L. Buse and Westley Weimer. Synthesizing api usage examples. In Proc. ICSE

2012.

[16] Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their

participants. In Proc. ACL 2009.

[17] Angel X. Chang and Christopher D. Manning. Tokensregex: Defining cascaded regular expres-

sions over tokens. Technical Report 2014.

[18] Sunny Consolvo, David W. McDonald, Tammy Toscos, Mike Y. Chen, Jon Froehlich, Beverly

Harrison, Predrag Klasnja, Anthony LaMarca, Louis LeGrand, Ryan Libby, Ian Smith, and

James A. Landay. Activity sensing in the wild: A field trial of Ubifit Garden. In Proc. CHI

’08, CHI ’08, pages 1797–1806, 2008.

[19] Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher

Potts. A computational approach to politeness with application to social factors. In Proc of

ACL 2013.

[20] Stephane Ducasse, Matthias Rieger, and Serge Demeyer. A language independent approach for

detecting duplicated code. In Proc. ICSM 1999.

[21] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as

deviant behavior: a general approach to inferring errors in systems code. In Proc. SOSP 2001.

[22] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical resource for

opinion mining. In Proceedings of LREC 2006.

[23] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information

extraction. In Proc. EMNLP 2011.

[24] Ethan Fast and Eric Horvitz. Identifying dogmatism in social media: Signals and models. In

Proc. EMNLP 2016.

BIBLIOGRAPHY 66

[25] Ethan Fast and Eric Horvitz. Long-term trends in the public perception of artificial intelligence.

In AAAI, pages 963–969, 2017.

[26] Ethan Fast, Will McGrath, Pranav Rajpurkar, and Michael Bernstein. Mining human behaviors

from fiction to power interactive systems. In Proc. CHI 2016.

[27] Ethan Fast, Daniel Steffe, Lucy Wang, Michael Bernstein, and Joel Brandt. Emergent, crowd-

scale programming practice in the ide. In Proc. CHI 2014.

[28] Adam Fourney, Richard Mann, and Michael Terry. Query-feature graphs: bridging user vocab-

ulary and system functionality. In Proc. UIST 2011.

[29] Mark Gabel and Zhendong Su. A study of the uniqueness of source code. In Proc. FSE 2010.

[30] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements

of Reusable Object-Oriented Software. 1994.

[31] Gerd Gigerenzer. How to make cognitive illusions disappear: Beyond heuristics and biases. In

In European review of social psychology 1991.

[32] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for

accurate object detection and semantic segmentation. In Proc. CVPR ’14, 2014.

[33] Scott A. Golder and Michael W. Macy. Diurnal and seasonal mood vary with work, sleep, and

daylength across diverse cultures. In Science, volume 333, pages 1878–1881, 2011.

[34] Max Goldman, Greg Little, and Robert C. Miller. Collabode: collaborative coding in the

browser. In Proc. CHASE 2011.

[35] Max Goldman and Robert C. Miller. Codetrail: Connecting source code and web resources. In

Proc. VL/HCC 2009J.

[36] Mark Grechanik, Chen Fu, Qing Xie, Collin McMillan, Denys Poshyvanyk, and Chad Cumby.

Exemplar: Executable examples archive. In Proc. ICSE 2010.

[37] S. Greenberg and I. H. Witten. How users repeat their actions on computers: Principles for

design of history mechanisms. In Proc. CHI ’88.

[38] Bjorn Hartmann, Daniel MacDougall, Joel Brandt, and Scott R. Klemmer. What would other

programmers do: suggesting solutions to error messages. In Proc. of CHI 2010.

[39] Bjorn Hartmann, Leslie Wu, Kevin Collins, and Scott R. Klemmer. Programming by a sample:

rapidly creating web applications with d.mix. In Proc. UIST 2007.

BIBLIOGRAPHY 67

[40] Vasileios Hatzivassiloglou and Kathleen R McKeown. Predicting the semantic orientation of

adjectives. In Proceedings of the 35th annual meeting of the association for computational

linguistics and eighth conference of the european chapter of the association for computational

linguistics, pages 174–181. Association for Computational Linguistics, 1997.

[41] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the

naturalness of software. In Proc. ICSE 2012.

[42] Reid Holmes, Robert J. Walker, and Gail C. Murphy. Strathcona example recommendation

tool. In Proc. FSE 2005.

[43] Oliver Hummel, Werner Janjic, and Colin Atkinson. Code conjurer: Pulling reusable software

out of thin air.

[44] C. Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of

social media text. In Proc. AAAI 2014.

[45] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip-

tions.

[46] Emre Kiciman. Towards learning a knowledge base of actions from experiential microblogs. In

AAAI Spring Symposium, 2015.

[47] Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. An ethnographic study of

copy and paste programming practices in oopl. In Proc. ISESE 2004.

[48] Andrew J. Ko and Brad A. Myers. Designing the whyline: a debugging interface for asking

questions about program behavior. In Proc. the CHI 2004.

[49] Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence of

massive-scale emotional contagion through social networks. In Proceedings of the National

Academy of Sciences, volume 111, pages 8788–8790, 2014.

[50] Ranjitha Kumar, Arvind Satyanarayan, Cesar Torres, Maxine Lim, Salman Ahmad, Scott R

Klemmer, and Jerry O Talton. Webzeitgeist: Design Mining the Web. In Proc. CHI 2013.

[51] Ivan Laptev, Marcin Marszaek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic

human actions from movies. In Proc. CVPR 2008.

[52] Yong Jae Lee, C Lawrence Zitnick, and Michael F Cohen. Shadowdraw: real-time user guidance

for freehand drawing. In Proc. SIGGRAPH ’11, 2011.

[53] Yang Li. Reflection: Enabling event prediction as an on-device service for mobile interaction.

In Proc. UIST ’14, UIST ’14, pages 689–698, 2014.

BIBLIOGRAPHY 68

[54] H. Liu and P. Singh. Conceptnet – a practical commonsense reasoning tool-kit. In BT Tech-

nology Journal 2004.

[55] Qun Luo and Weiran Xu. Learning word vectors efficiently using shared representations and

document representations. In Proc. AAAI 2015.

[56] David Mandelin, Lin Xu, Rastislav Bodık, and Doug Kimelman. Jungloid mining: helping to

navigate the api jungle. In Proc. PLDI 2005.

[57] Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. CommunityCommands. In

Proc. UIST 2009.

[58] Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. CommunityCommands. In

Proc. UIST 2009.

[59] William McGrath, Mozziyar Etemadi, Shuvo Roy, and Bjrn Hartmann. fabryq: Using phones

as gateways to communicate with smart devices from the web. In Proc. EICS 2015, EICS ’15.

[60] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed repre-

sentations of words and phrases and their compositionality. In Proc. NIPS 2013.

[61] Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space

word representations. In Proc. NAACL-HLT 2013.

[62] George A. Miller. WordNet: A lexical database for english. In In Commun. ACM 1995.

[63] Saif M. Mohammad and Peter D. Turney. Crowdsourcing a word-emotion association lexicon.

In Computational Intelligence, volume 29, pages 436–465, 2013.

[64] Saif M Mohammad, Xiaodan Zhu, Svetlana Kiritchenko, and Joel Martin. Sentiment, emotion,

purpose, and style in electoral tweets. In Information Processing & Management. Elsevier, 2014.

[65] Mathew Mooty, Andrew Faulring, Jeffrey Stylos, and Brad A. Myers. Calcite: Completing code

completion for constructors using crowds. In Proc. VL/HCC 2010.

[66] Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. Finding deceptive opinion spam

by any stretch of the imagination. In Proc. ACL 2011.

[67] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification

using machine learning techniques. Proc. ACL 2002.

[68] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word

count: Liwc 2001. In Mahway: Lawrence Erlbaum Associates 71 2001.

[69] Naiyana Sahavechaphan and Kajal Claypool. Xsnippet: mining for sample code. In Proc.

OOPSLA 2006.

BIBLIOGRAPHY 69

[70] Robert C. Seacord, Daniel Plakosh, and Grace A. Lewis. Modernizing Legacy Systems: Software

Technologies, Engineering Process and Business Practices. 2003.

[71] Phillip Shaver, Judith Schwartz, Donald Kirson, and Cary O’connor. Emotion knowledge:

further exploration of a prototype approach. In Journal of personality and social psychology,

volume 52, page 1061. American Psychological Association, 1987.

[72] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data

quality and data mining using multiple, noisy labelers. In Proc. SIGKDD 2008.

[73] Ian Simon, Dan Morris, and Sumit Basu. MySong: automatic accompaniment generation for

vocal melodies. In Proc. CHI 2008.

[74] Tom Simonite. When it comes to gorillas, google remains blind. In Wired.

[75] Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, An-

drew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over

a sentiment treebank. Proc. EMNLP 2013.

[76] Philip J Stone, Dexter C Dunphy, and Marshall S Smith. The general inquirer: A computer

approach to content analysis. MIT press, 1966.

[77] Jeffrey Stylos and Brad A. Myers. Mica: A web-search tool for finding api components and

examples. In Proc. VL/HCC 2006.

[78] Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant for reusing open source

code on the web. In Proc. ASE 2007.

[79] Peter D. Turney. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised

classification of reviews. In Proc. ACL 2002.

[80] Peter D Turney, Patrick Pantel, et al. From frequency to meaning: Vector space models of

semantics. Journal of artificial intelligence research, 37(1):141–188, 2010.

[81] Raoul-Gabriel Urma and Alan Mycroft. Programming language evolution via source code query

languages. In Proc. PLATEAU 2012.

[82] Mark Weiser. The computer for the 21st century. In Scientific American, volume 265, pages

94–104. Nature Publishing Group, 1991.

[83] Yunwen Ye and Gerhard Fischer. Supporting reuse by delivering task-relevant and personalized

information. In Proc. ICSE 2002.

[84] YoungSeok Yoon and Brad Meyers. A longitudinal study of programmers’ backtracking. In

Proc. VL/HCC 2014.

CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR...capturing human behavior and language for interactive...

Documents

Transcript of CAPTURING HUMAN BEHAVIOR AND LANGUAGE FOR...capturing human behavior and language for interactive...