Craig Evans CAS587 – Culture As Data Project Results 4 December 2012.

15
We Are More Than Our Features Craig Evans CAS587 – Culture As Data Project Results 4 December 2012

Transcript of Craig Evans CAS587 – Culture As Data Project Results 4 December 2012.

We Are More Than Our Features

Craig Evans CAS587 – Culture As DataProject Results4 December 2012

Challenge:Finding the Right Data Set

Wide variety of data types presented Global, national, local Big data, personal data

Discussed varying technologies Data mining Text mining Machine learning Visualisation

All very abstract …

Motivation:Something Personal/Relatable

Never lose sight of the data Its not about the technology

Technology is a tool, not an endpoint Choose data that we can all see

something in

So …

Goal

CAS587 is an interdisciplinary class We have different interests/focus –

do they come out through our readings analysis?

Analyse the writings of the CAS587 class, and see if there is any apparent trend in their writing.

Importance …

To the student: Who else in the class has a similar interest? Who has expresses skills that are

complementary? Who would you reach out to to build a team

later? To the instructor:

Has the right message been communicated? Have your goals in educating the class been

met? To the wider population:

This is an example of how data can get used in a way unintended. Would you write differently if you knew the text was going to be used for this purpose?

Would you choose to post anonymously instead?

Data Appropriateness

It is a “raw” data set No previous preprocessing It is not what the data was intended

for It is a little “random” in nature – not

a traditional structured dataset found in an online repository

CAS587 – The Data Set

Starts as a PDF file Converted to standard ASCII text file

Manual cleanup of data required Removal of heading/footer information

Result? 150 files 96677 words 1150150 chars

The Process

1. PDF’s submitted to CAS587 Website

2. Results exported toplain text files

3. Results imported todatabase

4. Results analyzed in custom Java application

5. Results returned to

database

6. Results returned to Excel / Visualisation Tool

Used trial version of publicly

available PDF2Text tool

mySQL

• Text parsed to individual words• Text stemmed using WordNet• tf*idf Weightings used to

generate keywords per person/article

• If time permits – run Sentiment Analysis over corpus

Excel is easy, but once data

processed, I can have some fun

with the visualisation

tf*idf … term frequency x inverse doc frequency(From Wikipedia)

… a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is used as a weighting factor in information retrieval and text mining. The tf*idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

ExampleConsider a document containing 100 words where the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then (3 / 100) = 0.03. Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The tf*idf score is the product of these quantities: 0.03 × 4 = 0.12.

The Class Week 2-6

Week 2: What is Culture as Data? filter,comparative,autism,scholarship,writer,closed,overload,library,net,

outward,air, inside,coin,ecology,region Week 3: Social Media - culture, trends, and data

activism,movie,stock,flu,market,happy,mood,tweet,Trends,predict,happier,weak, Democrats,happiness,television

Week 4: Visualization, the challenges of visualizing culture - the challenges of manipulating large amounts of data visualization,template,analyst,analytic,seer,visual,computing,dot,cloud,

distort, manipulate,viewer,map,lie,trap Week 5: Books, Music, Images, Movies

music,dementia,rating,alzheimer,movie,taste,playlist,political,novel,Books,musical, affiliation,preference,listen,writing

Week 6: Data as Culture: Curating, Scrubbing, and Sampling classification,hire,narrative,card,database,replicate,icd,scientific,declin

e,finding,poetic,viscosity,replication,solution,electronic

The Class Week 7-11

Week 7: Prediction customer,habit,pregnant,economy,economics,coupon,routine,cue,

prediction, purchasing,evaluation,trigger Week 8: Personal data online.  Conversations and

Persistence.  Interpretations of personal data. Spider,thesis,speaker,oatmeal,report,communicative,annual,persona,p

ublic,email, private,eat,analyzeword,mouth,wife Week 9: History of Big Data Critiques

skull,friction,reductionism,craniology,maturity,downfall,shimmering,positivism, introspectometer,domain,inaccurate,conflict,economics,igy,dominate

Week 10: Life After Privacy obfuscation,protect,privacy,car,policy,setting,private,default,public,opti

on,breach, anonymize,identifiable,regulation,photo Week 11: Art as Data; Data as Art

art,wind,transfinite,installation,artistic,cascade,choir,hint,visualization,rose,color, contents,flow,beautiful

Picking on an IndividualCraig Evans – Total Corpus

Keywords from total corpus cent,visualisation,suspect,secondary,teach,

zip,irb,material,illustrate,interestingly, openly,playlist,artwork,profile,century, experience,lose,computationally,reuse

Most negative sentiment … not,lose,suspect,base,dementia,secondary,paranoid,

bias,present,present,disturbing,insufficient,paranoia, difficult,number

Most positive sentiment … model,interesting,good,well,better,researcher,accura

te, aware,time,time,beneficial,enable,teach,illustrate,find, method,read,add,excellent,art

Picking on an IndividualCraig Evans – Week 7

Week 7: Prediction … Keywords customer,habit,pregnant,economy,economics,coupon

, routine,cue,prediction,purchasing,evaluation,trigger Keywords against rest of corpus

model,influence,buying,paper,predictive,joke, series,economist,valid,pregnant,resource,woman,link

Most negative sentiment … bias,difficult,not,base,nefarious,invalid,defunct,

savage,hard,blue,miss,number,scale,pregnant Most positive sentiment …

model,find,color,joke,read,sound,accurate, interesting,valid,valid,privacy,improve,influence, compare,reasoning,group,improvement,absolute

CAS587 Wordle – Just for Karrie

Questions?