Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

45
Using Cassandra to Support Crisis Informatics Research Kenneth M. Anderson Associate Professor Department of Computer Science Co-Director of The Center for Software and Society Co-Director of Project EPIC Director of CU’s Big Data Initiative Happy Ada Lovelace Day!

description

Crisis Informatics is an area of research that investigates how members of the public make use of social media during times of crisis. The amount of social media data generated by a single event is significant: millions of tweets and status updates accompanied by gigabytes of photos and video. To investigate the types of digital behaviors that occur around these events requires a significant investment in designing, developing, and deploying large-scale software infrastructure for both data collection and analysis. Project EPIC at the University of Colorado has been making use of Cassandra since Spring 2012 to provide a solid foundation for Project EPIC's data collection and analysis activities. Project EPIC has collected terabytes of social media data associated with hundreds of disaster events that must be stored, processed, analyzed, and visualized. This talk will cover how Project EPIC makes use of Cassandra and discuss some of the architectural, modeling, and analysis challenges encountered while developing the Project EPIC software infrastructure.

Transcript of Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Page 1: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Using Cassandra to Support Crisis Informatics Research

Kenneth M. Anderson Associate Professor

Department of Computer ScienceCo-Director of The Center for Software and Society Co-Director of Project EPIC

Director of CU’s Big Data Initiative

Happy Ada Lovelace Day!

Page 2: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Associate Professor; Department of Computer ScienceKen Anderson

‣ Research Interests • Software Architecture and Software Design

• Data-Intensive Systems and Crisis Informatics

‣ Teaching Interests • Software Engineering; OO A&D; Data Engineering

‣ Active in Broadening Participation in Computer Science • Led the creation of the BA in CS degree at CU

- 450 new CS majors in two years; 900 CS majors on campus

Page 3: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Project EPIC

‣ Empowering the Public with Information in Crisis • Largest NSF-Funded Project on Crisis Informatics

- ~4M since Fall 2009

‣ Results • ~60 research publications, 2 PostDocs, 5 PhD graduates, 4

MS graduates, 13 current PhD students

• Tweak the Tweet; 100+ data sets (~1.5B tweets)

• Software: Data collection, analytics, NLP, GIS

Page 4: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Crisis Informatics The study of how technology is changing the way the world responds to mass emergency events

Page 5: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
Page 6: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
Page 7: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
Page 8: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

70K Geotagged Tweets prior/during/after

Hurricane Sandy Landfall

Page 9: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

0

35

70

105

1409/

12/1

3 12

:00

AM

9/12

/13

12:0

0 PM

9/13

/13

12:0

0 AM

9/13

/13

12:0

0 PM

9/14

/13

12:0

0 AM

9/14

/13

12:0

0 PM

9/15

/13

12:0

0 AM

9/15

/13

12:0

0 PM

9/16

/13

12:0

0 AM

9/16

/13

12:0

0 PM

9/17

/13

12:0

0 AM

9/17

/13

12:0

0 PM

9/18

/13

12:0

0 AM

9/18

/13

12:0

0 PM

9/19

/13

12:0

0 AM

9/19

/13

12:0

0 PM

9/20

/13

12:0

0 AM

9/20

/13

12:0

0 PM

Tweets Per Minute

2013 Colorado Floods — First Nine Days51 31 15 17 11 7 7 5 3

Average Tweets Per Minute

Page 10: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
Page 11: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
Page 12: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Project EPIC Software Infrastructure

‣ EPIC Collect • Twitter data collection infrastructure capable of collecting

24/7 with 99.9% uptime (since 2010)

- Built on top of Cassandra and designed for scalability, availability, and flexibility

‣ EPIC Analyze • A scalable and flexible data analytics environment that

allows Project EPIC analysts to browse, search, filter, annotate, and process EPIC Collect data sets

- Built on top of DataStax Enterprise, Redis, Rails, & Postgres

Page 13: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Logical Arrangement of Components Deployed across seven servers in a CU Data Center

Project EPIC Software Architecture

EPIC Event Editor EPIC Analyze Splunk ApplicationLayer

ServiceLayer

StorageLayer

Twitter Redis

PostgreSQL Cassandra

SolrPig HadoopEPIC

Collect

DataStax Enterprise

Page 14: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

EPIC Collect

Page 15: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Center

TwitterTwitter

Collection Service

LogCassandraCassandra Cassandra Cassandra

Project EPIC Event

Editor

Page 16: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Center

TwitterTwitter

Collection Service

LogCassandraCassandra Cassandra Cassandra

Project EPIC Event

Editor

Why Cassandra?{ “id”: … }

Flexibility. Immune to changes in Tweet metadata.

Page 17: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Center

TwitterTwitter

Collection Service

LogCassandraCassandra Cassandra Cassandra

Project EPIC Event

Editor

Availability. Tweets can be written to any node in the cluster.Why Cassandra?

Page 18: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Center

Cassandra

TwitterTwitter

Collection Service

LogCassandra Cassandra Cassandra

Project EPIC Event

Editor

Scalability. Need more disk space? Add more nodes!

Cassandra CassandraCassandra Cassandra

Why Cassandra?

Page 19: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Center

TwitterTwitter

Collection Service

LogCassandraCassandra Cassandra Cassandra

Project EPIC Event

Editor

Robustness. Data on nodes automatically replicated.Why Cassandra?

Page 20: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

However…

Page 21: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Modeling is Wicked Hard

Page 22: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Getting Row Keys Right

Page 23: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

It’s hash tables all the way down…Cassandra Data Model

Row Key 1 Column Name A ••• Column Name X

Value ••• Value•••

Row Key N Column Name B ••• Column Name Y

Value ••• Value

�1

The design of row keys is critical.

Page 24: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Why?

‣ Row keys determine what you can retrieve • They are your primary means to make a query and retrieve

relevant data; their structure determines query expressivity

• It should be easy to generate them from elements of your problem domain

‣ Row keys determine how “wide” your rows are • This is important because Cassandra replicates rows

‣ Row keys are partitioned across your cluster’s nodes • A “bad” row key design can negatively impact performance

Page 25: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Row Keys Should Reflect Problem Domain ‣ You need to easily be able to generate row keys based on

information in your problem domain <region_name>:<entity_name>:<time_collected>

vs 751e8446ede178f10fd44e3a37affb6b15ed30ce

‣ The former: easily generated from domain objects • easily reconstructed at query time

‣ The latter might be easily generated • but not easily reconstructed

Page 26: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

The Reason?

‣ No easy way to ask Cassandra for all row keys in a column family

• If you want to get this information, you have to query Cassandra for it, in batches, until all row keys have been retrieved

- This is not an O(1) operation!

‣ Instead, it’s better if you can skip this step and reconstruct from your problem domain

• US_EastCoast:Invoices:0000_01012014 to US_EastCoast:Invoices:2359_12312014

Page 27: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Wide vs. Narrow

‣ You can design “wide” rows or “narrow” rows • This corresponds to returning a LOT of information for a

given key or a limited amount of information

!

!

• Wide rows can be useful, for instance, if you’re domain has lots of “events” on a given day or within a given hour

fb_users_dk user 1; user 2; … user 100,000; …ken_age_ht age; height

Page 28: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

CassandraCassandra Cassandra Cassandra

The Rub? Rows Get Replicated

As previously mentioned, rows get replicated

How wide is too wide? Depends on size of cluster and network bandwidth

For wide rows, this can be a performance concern.

Page 29: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Row Keys Get Partitioned

‣ The nodes in your cluster divide up the key space between them

• The value of a row key determines where it will get stored

‣ You have to be cognizant of this partition because often Cassandra is being used in situations where a LOT of data is being written to it

• You need to make sure your row key design does not overburden any one node in your cluster

Page 30: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

CassandraCassandra Cassandra Cassandra

Imagine your row_key is a monotonically increasing integer

Twitter Collection

Service

Say, for instance, tweet ids

Over a single day, all tweets might be saved on just one node in the cluster; the others would remain idle!

Page 31: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

CassandraCassandra Cassandra Cassandra

Instead, you want enough variation that keys get evenly distributed across the cluster

row_key_1 row_key_a row_key_$ row_key_2

Writer

Reader

Page 32: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Design of Row Key for EPIC Collect‣ For Project EPIC, we make use of a “hybrid” row key

• The first part of the row_key is a keyword used to collect tweets for a given event

- earthquake, flood, cowx, obama, …

• The second part of the row_key is the Julian day that a tweet was collected on

- January 1, 2014 equals “2014001”; February 1, 2014 equals “2014032”; etc.

• The third part of the row_key is the last digit of an MD5 hash of the entire Tweet JSON object

- i.e. 0-9, a-f; This is used to distribute tweets across the cluster

Page 33: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Tweets Column Familykeyword:day:tag Tweet Id 1 ••• Tweet Id N

JSON ••• JSON•••

keyword2:day:tag Tweet Id 1 ••• Tweet Id MJSON ••• JSON

•••

‣ keyword: a word of interest for an event; e.g. “flood” ‣ julian_day: the day of the year a tweet was collected ‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”

Page 34: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

CassandraCassandra Cassandra Cassandra

flood:002:0

flood:002:3

flood:002:1flood:002:2

flood:002:4flood:002:5

flood:002:6

flood:002:7

flood:002:8 flood:002:cflood:002:9

flood:002:b

flood:002:a

flood:002:d

flood:002:e

flood:002:f

Row Key Distribution

flood:002:0 flood:002:1Replication …

Page 35: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

EPIC Analyze

Page 36: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

A data analytics environment for large Twitter Data SetsEPIC Analyze

‣ Provides a scalable and extensible analysis environment • Aims to partially automate Project EPIC’s analysis work

- Automatically calculate common metrics on all data sets

- Apply new analysis algorithms to entire data sets at once

- Support filtering/sampling on large data sets

- Support shared data set annotation by a team of analysts

• Provide these features while

- supporting data sets of millions of tweets

- with fast performance so as not to interrupt analysis work

Page 37: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

DataStax Enterprise

Project EPIC Web Apps

Hadoop Cassandra Solr

3rd Party Analytics

AppsFacebook

Twitter

RedisPig

Page 38: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Challenges

‣ Recall: goal of EPIC Collect is to store events in a reliable, scalable fashion

‣ Data not necessarily structured to support analysis • Implication: Need for Migration/Duplication to enable

features such as searching, filtering, analysis, etc.

Page 39: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Data Migration and Duplication

‣ With EPIC Collect, we chose to have fairly “wide” rows • Each row stores the tweets that contain a given keyword for

a given day

- “All tweets that contain the word “flood” collected on 01/01/14”

- We use the “tag” to keep the row from growing too large, but there can still be 100s of 1000s of tweets in each row

‣ To support searching/filtering, we want to use Solr • however, Solr requires “narrow” structured rows

- one tweet per row, each column defined by a schema

Page 40: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

We go from…

{“text” : “This flood is …” …}

row_key

flood:2014002:a

tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N

To this…tweet_1_attributes row_key_for_tweet_1

…tweet_2_attributes row_key_for_tweet_2

tweet_3_attributes row_key_for_tweet_3

tweet_N_attributes row_key_for_tweet_N

Page 41: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Implications

‣ Each time a data set is “imported” into EPIC Analyze • we must launch a script that reformats each tweet into the

“narrow row” format required by Solr

- In the future, we’ll modify collection to write tweets both ways

‣ It’s not a complete duplication • we only store those attributes that we want to search on

‣ but it’s still significant ‣ the benefit is that we can then apply all of Solr’s powerful

search capabilities to our data sets

Page 42: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Conclusions

Page 43: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Cassandra: Strong Foundation for Project EPIC

‣ With migration to Cassandra in 2012, EPIC Collect has been running 24/7 with minimal downtime

• Downtime usually related to network outages

• Cassandra keeps right on ticking!

‣ Has provided Project EPIC with a reliable environment to perform a wide range of crisis informatics research

• leading to new understanding of how people use Twitter to coordinate and collaborate during times of disaster

Page 44: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Cassandra: Strong Foundation for Project EPIC

‣ An excellent NoSQL technology but you must take time to understand Cassandra’s advantages and its data model

• Provides flexibility, availability, scalability, and robustness

• Row keys

- difficult to get right (but that’s true of all data modeling tasks!)

- design to reflect your problem domain

- to determine width of rows (and speed of replication)

- and to partition data across your cluster

Page 45: Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Thank YouKen Anderson <[email protected]>

Project Epic: <http://epic.cs.colorado.edu>

Department of Computer Science University of Colorado

@epiccolorado