Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir...

Post on 29-Jan-2016

215 views 0 download

Transcript of Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir...

Using to Save Lives

Or, Using Digg to find interesting events.

Presented by: Luis Zaman, Amir Khakpour, and John Felix

Outline

Explanation Digg is a social web-media discovery tool

based on user submitted content. 1 or 2 submissions a minute Half-life of “interest” is about a day

Digg aggregates “interesting” content.

But how do we find interesting Events and know their Themes?

Motivation Collaborative nature of Social Media can scour

the WWW very thoroughly. But, this generates A LOT of data (you’ll see).

It would be cool to find emergencies, or critical situations based on this collaborative media.

Apple seems like a pretty good starting point.

Approach

Preprocessing Digg API

REST API http://services.digg.com/stories/topic/apple?count=10

XML response <?xml version="1.0" encoding="utf-8" ?><users

timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml>

Limitations 100 results per request 1 Hour of time series data Can’t go fast, or else.

Preprocessing

Time Series Each digg is the event (only 100 at a time) Rows

Each story’s digg count Columns

Every hour (2,207 of them from August 08 – November 08)

Clustering Rows

Each story that was digged at any point in the time series Columns

The words in the title and description of this story

Preprocessing - Challenges

SLOW Really Dirty Data Different Formats of Data REALLY SLOW

Introduction to Document Clustering

Challenges of clustering of text documents unlike structured data are: Volume Dimensionality Sparsity Complex semantics

In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) Huge sparse matrix, we just store non-zero values

Text

Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.

Clustering Dataset

Number of stories (m) : 25470 Total number of unique words (n): 55557 Nonzero values: 469323 (0.03214%)

Clustering using Cluto Software Using Kmeans, bisecting Kmeans

Calculating Centroids and SSE A C++ program is run on “black”

Document Clustering by Optimizing Criterion Functions According to Zhao et .al, to have a good

clustering for documents we can use some Criterion Function and use optimization to find clusters: Internal Criterion Functions (I)

Maximizing the internal similarity function:

External Criterion Functions (E) Minimizing the external similarity function:

Hybrid Criterion Functions (H) Maximizing E

I

Experiments SSE for I (K-Means vs Bisecting K-Means)

Visualization What we used

jQuery Database query library for javascript

PHP/MySQL Scripting language and database backend

Google Visualization API Time Series Graph Zoomable

Timepedia Chronoscope Clickable

Conclusions Success?

Of course we think so

Future Work Save lives? Better clustering

Cleaner data More data

Make it scalable, and dynamic On-line and on the fly?