IRE 2012 Unstructured Data Talk

Post on 11-Jul-2015

197 views 0 download

Tags:

Transcript of IRE 2012 Unstructured Data Talk

Analyzing Unstructured Data

for Stories

eugenewu@mit.edu

What Am I Talking About?

• Example

• Structured Data 101

• Structured Data Continuum

• More Examples

http://projects.propublica.org/drywall/

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint

1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1058. Plaintiffs-Intervenors, Daniel and Nicole Smith are citizens of Alabama and together own real property located at 766 Tabernacle Road, Monroeville, Alabama

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint

1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1058. Plaintiffs-Intervenors, Daniel and Nicole Smith are citizens of Alabama and together own real property located at 766 Tabernacle Road, Monroeville, Alabama

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint

541 Lynn Hurst Court,

Montgomery, Alabama 36117

37.0625, -95.677068

541 Lynn Hurst Court,

Montgomery, Alabama 36117

37.0625, -95.677068

Jefferson County

http://projects.propublica.org/drywall/

http://projects.propublica.org/drywall/

Scanned

Documents

Addresses

Google Maps

Unstructured

Information

Structured Data

Visualization

Scanned

Documents

Addresses

Google Maps

Unstructured

Information

Structured Data

Visualization

Scanned

Documents

Addresses

Google Maps

Unstructured

Information

Structured Data

Visualization

Scanned

Documents

Addresses

Google Maps

Unstructured

Information

Structured Data

Visualization

Scanned

Documents

Addresses

Google Maps

Who cares?

What is it?

Who Cares?

Software

Visualization

Mashups

Store

Databases

PANDA

Analyze

Fusion tables

Excel

Databases

R/Python/Ruby

Who Cares?

Software

Visualization

Mashups

Who Cares?

Software

Visualization

Mashups

Tainted House Data

+ Economic Data

+ Health Stats

+ Crime Stats

+ Corruption Data

Structured Data

Structured Data

Attribute

Name

Data type

Consistent

Structured Data

Attribute

Name

Data type

Consistent

Structured Data

Attribute

Name

Data type

Consistent

Structured Data

Attribute

Name

Data type

Consistent

Florida’s Lee County

has 1518 addresses

Structured Data

Attribute

Name

Data type

Consistent

Structured Data

Attribute

Name

Data type

Consistent

Structured Data

Attribute

Name

Data type

Consistent

Numeric(integers, dollars,…)

Date/Time

Lat, Lon

Structured Data

Attribute

Name

Data type

Consistent

Numeric(integers, dollars,…)

Date/Time

Lat, Lon

Structured strings(Florida)

Structured Data

Attribute

Name

Data type

Consistent

FLORIDA

FL

Flroida

FloridaState

Florida’s

Structured Data

Attribute

Name

Data type

Consistent

FLORIDA

FL

Flroida

FloridaState

Florida’s

5

10

1

1

1

Structured Data

Attribute

Name

Data type

Consistent

What Am I Talking About?

• Structured Data 101

• Structured Data Continuum

• More Examples

structuredunstructured

Continuum

structuredunstructured

ImagesImages

http://www.whatisstephenharperreading.ca/2010/03/01/book-number-76-one-day-in-the-life-of-ivan-denisovich-by-alexander-solzhenitsyn/

structuredunstructured

ImagesImages Text Blob

1056. Plaintiffs - Intervenors, Robert and Tasha Lambert

are citizens of Alabama and together own real property

located at 541 Lynn Hurst Court, Montgomery, Alabama

36117. Plaintiffs are participating as class representatives

in the class and subclasses as set forth in the schedules

accompanying this complaint which are incorporated herein

by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a

citizen of Alabama and owns real property located at 2105

Lane Avenue, Birmingham, Alabama 35217. Plaintiff is

participating as a class representative in the

structuredunstructured

ImagesImages Text Blob Email

structuredunstructured

ImagesImages Text Blob Email

Re: IRE conference in Boston

June 1, 3:08PM

jaimi@ire.org

Subject

Date

From

structuredunstructured

ImagesImages Text Blob Email Excel

structuredunstructured

Images Text Blob Email Excel

“It’s sunny

in texas”

structuredunstructured

Images Text Blob Email Excel

“It’s sunny

in texas”

Tweet Weather Location

It’s sunny in

texas

Sunny Texas

structuredunstructured

Images Text Blob Email Excel

“It’s sunny

in texas”

Tweet Weather Location

It’s sunny in

texas

Sunny (37.06,

-95.67)

structuredunstructured

Images Text Blob Email Excel

You have unstructured data

What structure do I need?

Attributes with simple types

Whe

n

Ask

Find

What Am I Talking About?

• Structured Data 101

• Structured data continuum

• More Examples

http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/

2011 State of the Union

Type/Meaning

String

Name

Word

Mr. Speaker, Mr. Vice President,

members of Congress,

distinguished guests, and fellow

Americans:

Tonight I want to begin by

congratulating the men and

women of the 112th Congress, as

well as your new Speaker, John

Boehner. And as we mark this

occasion, we're also mindful of

the empty chair in this chamber,

and we pray for the health of our

colleague -- and our friend --

Gabby Giffords.

It's no secret that those of us here

tonight have had our differences

over the last two years. The

debates have been contentious;

we have fought fiercely for our

beliefs. And that's a good thing.

That's what a robust democracy

Mr. Speaker, Mr. Vice President,

members of Congress,

distinguished guests, and fellow

Americans:

Tonight I want to begin by

congratulating the men and

women of the 112th Congress, as

well as your new Speaker, John

Boehner. And as we mark this

occasion, we're also mindful of

the empty chair in this chamber,

and we pray for the health of our

colleague -- and our friend --

Gabby Giffords.

It's no secret that those of us here

tonight have had our differences

over the last two years. The

debates have been contentious;

we have fought fiercely for our

beliefs. And that's a good thing.

That's what a robust democracy

Word

Mr

Speaker

Vice

President

Members

Congress

Distinguished

Guests

Americans

People

Jobs

New

years

Bin Laden Tweets/Sec

http://www.flickr.com/photos/twitteroffice/5681263084/

Type/Meaning

Time

Name

Time

Deadly Day in Baghdad

http://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all

Type/Meaning

Lat, Lon

Number

Name

Location

Body Count

http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all

http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all

14, 12

Killed in Action

Lat Lon

Sentiment of NZ Earthquake

http://twitinfo.csail.mit.edu/detail/4/

Type/Meaning

-1 to 1

Name

Happiness

Pattern

Matching

Great, 7AM

meeting

7:00AM

Interpret

Meaning

Great, 7AM

meeting

Not Happy

Interpret

Meaning

It’s still

new

Great, 7AM

meeting

Happy!

Interpret

Meaning

It’s still

new

Interpret

Meaning

It’s still

new

Lack of

context

Earthquakes

Extracting meaning is

by far the most difficult

What if it’s just unstructured?

CrowdSourcing

Lots of humans do

tasks computers suck

at

Training

Quality Issues

Dealing with Forms

Dealing with Forms

Entity Information

Pattern Matching

• Regex

– Describe and find patterns

– Killed in action

(?P<n>\d{1,3})(\s[A-Z]{1,3})?\sKIA

DBTruck demo?

Structure = Super Valuable

You have unstructured data

What structure do I need?

Attributes with simple types

When

Ask

Find

Structure = Super Valuable

You have unstructured data

What structure do I need?

Attributes with simple types

When

Ask

Find

tinyurl.com/iredatatipsheet

eugenewu@mit.edu

@sirrice

Structure = Super Valuable