Studying archives of online behavior

Studying Archives of Online Behavior

Computational Qualitative Research Seminar

James HowisonUniversity of Texas at Austin

Link to slides on twitter @jameshowisonReadings at

https://www.dropbox.com/sh/1gx9s2zlnxvumbz/AAAV9uSAJHsiPeJhSsNnnM9Pa?dl=0

Readings• The presentation and discussion will draw on:

– Howison, J., & Crowston, K. (2014). Collaboration through open superposition: A theory of the open source way. MIS Quarterly, 38(1), 29–50.

– Howison, J., Wiggins, A., & Crowston, K. (2011). Validity Issues in the Use of Social Network Analysis with Digital Trace Data. Journal of the Association for Information Systems, 12(12), Article 2.

– Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through Documentary Practices. In Proceedings of the 44th Hawaii International Conference on System Sciences (HICSS 2011) (pp. 1–10). Waikoloa, HI. http://doi.org/10.1109/HICSS.2011.455

– Annabi, H., Crowston, K., & Heckman, R. (2008). Depicting What Really Matters: Using Episodes to Study Latent Phenomenon. In Proceedings of the International Conference on Information Systems (ICIS).

– The methodological appendix for the Howison and Crowston Superposition article.

To the Archives!

The evidence is here, somewhere.CC Credit: http://www.flickr.com/photos/hamadryades/

Opportunities of online archive studies

• Quantity• Granularity• Accessibility

– Much is openly available– Or the organization can provide bulk access– (compare to ethnography and getting individual

cooperation)• Emic'ness

Emic'ness?

Emic: in their words (from the inside)Etic: in your words (from the outside)

Naturalistic: the archives are primary to the users and the activity themselves:

"documentary traces are the primary mechanism in which users themselves know their distributed

communities and act within them.”

(Ribes and Geiger, 2011)

Yet, many challenges

We are using the system (and the system that archived and presents the traces) as a data collection method.

But the systems were not built for research.So we need to ask, for any research question:How well do the archives represent the activity,

as it happened?

Individual Exercise (6 mins)

1. Pick a system that renders online archives of something you are interested in.

– Can be your project for this course or something you choose right now.

– Slight preference for an archive showing traces from more than 1 person

2. Go and find a specific archive page and read it. 3. Write a sentence or two about what is

happening there.

Quick Group discussion (4 mins)

• Let’s hear from a few participants about their choices.

Individual exercise II (6 mins)

• How might archives diverge from experience?1. How did the system record activity at the time?2. How did the conversion to archives occur?3. How is your experience of reading the archives

different from the experience of the participants in the activity that was archived?

Discussion in groups

• Group discuss questions (go question by question, not person by person)– How recorded? (each person speak)– How converted? …– How is reading experience different?

Most surprising?

• One person from each group report back aspect that was most surprising.

Archival transformation

• Deletions– Some data is periodically purged from databases, after all

they are running a website, not a research database.• Overlaps

– When database dumps are pulled periodically• Re-calculations

– Historical depictions on a site (e.g., counts of messages, members, or other data such as downloads) might be later creations or re-calculations

– Can you rely on participants having seen those figures at the time?

Database schemas are not research ontologies

• Databases (or websites) often use words that are very exciting for research– “Friends”, “Followers”, “Assignment”, “Member”

• But their meaning may have very, very little to do with the sociological/theoretical concept– At best they are a hint that something interesting is happening,

but often are interpreted literally!• Examples from Sourceforge

– use of “assigned to” field on close.– “member list” does not show who is active (no one was ever

removed!)

Non-archived activity

Reasoning with missing/complete data

• Trouble both ways• Assuming that the data are complete (rather

than a system selected sample)• Can miss important activities or whole archives that

need to be integrated.

• Oddly enough, when data are complete issues can also emerge– See discussion in JASIST validity in SNA paper.

Hidden readership

• Archives almost never tell you who read what, and when they read it.– Might be key to interpretation (or might be

irrelevant)– Definitely crucial to any argument about

information flow (and almost all interpretations of SNA measures are about information flow).

• You may be able to impute readership from responses, but it’s a weak signal.

Activity traces scattered through archives

• Participants experience a flow of activities across different systems– Linked by time and order that they occur

• But they are archived by different systems– If you just read the mailing list you miss so much– And yet so many studies *want* their archive to

be the only one (so much easier to analyze).

Pacing of activities

• Participant observation in an open source project highlighted the role of pacing.– Rapid replies indicated interest and importance but

also availability– Very long gaps (sometimes years) indicated

deferral and return.• In other work I was reading archives and found

pacing hard to appreciate; it was very salient in participant observation but hidden in studies relying on trace data alone.

An episode

How to represent pacing?

Time stamps

Representing pacing

• Calculate gaps?

Reading gaps doesn’t help, easy to ignore, make them harder to ignore?

Visualize events

What is to be done?• Sufficient engagement with the system and community to

adequately interpret the traces.• Use a system and see how your data is archived.• When you think a phenomena/construct can be

operationalized computationally, at least show some narrative examples from the dataset.

• Complement archives with interviews and/or surveys– Archives make great prompts for interviews– Lakhani and Wolf (2003) survey immediately after a post.

• Gaskin et al (2014) “Zooming in and out of sociomaterial routines” MISQ.

An ontology for trace data studies• Document

– Archived content. E.g., An e-mail message, tracker comment, release note, pull-request, log entry.

– Provides evidence for events and actions. – One document may provide evidence for multiple events and actions.

• Event– An event causes documents to be archived. Sending an email, releasing a version.

• Action– The contextualized meaning of an event. e.g., contributing code, showing leadership

(can be at quite different conceptual levels in different studies.) • Participant

– An actor (typically a person, but could be a machine or bot)• Identifier

– A string associated with a participant. – Many identifiers could refer to one participant (e.g, email and username)– but many participants may act through one identifier (e.g., “admin account”)

Episodes

• A unit of analysis, facilitating comparison and summary (e.g., counting)– Compare to content analysis or nlp that counts

mentions of concepts, database queries that count documents, surveys that measure attitudes.

– The detail provided by trace data renders episodes more accessible, research to be more granular, closer to the work.

• Ideally emic (meaningful to and recognizable by participants)

Ok, but how to store this?

• Moving from documents and events to actions and outcomes is interpretative work– I do the qualitative first, then hope to make it computable

(e.g, through machine learning)• It is akin to content analysis but a much more

complicated ontology– Content analysis (classic or grounded theory) assigns Codes

to Documents– Software like Atlas ti has trouble handling coding of

structured data (dates, linked documents like threads, multiple identifiers for single participant.).

I use RDF• Resource Description Format

– Triples: James hasEmail [email protected]– URLs working natively (making viewing original archives easy)

• Retains original data structure – e.g., Document in thread by Identifier– Allows ad-hoc addition of structure (schemaless)– Allows inheritance (e.g., MailingListEvent a CommunicationEvent)

• Allows you to overlay higher level structure– e.g., Action(s) in (ordered) Episode by Participant– And then apply codes to Actions (storing when, who, why)

• Querying via SPARQL, Validation via RDF rules (aka SPIN)

An episode

Showing an example

Studying archives of online behavior

Software

Transcript of Studying archives of online behavior