Managing Unstructured Data

5
AnHai Doan University of Wisconsin-Madison Managing Unstructured Managing Unstructured Data Data

description

Managing Unstructured Data. AnHai Doan University of Wisconsin-Madison. Unstructured Data. Appears in many forms emails, Web pages, memos, call center text record, etc. Is pervasive 80% of the world data, and is growing Managed by many players - PowerPoint PPT Presentation

Transcript of Managing Unstructured Data

Page 1: Managing Unstructured Data

AnHai DoanUniversity of Wisconsin-Madison

Managing Unstructured DataManaging Unstructured Data

Page 2: Managing Unstructured Data

2

Unstructured Data ...Unstructured Data ... Appears in many forms

– emails, Web pages, memos, call center text record, etc. Is pervasive

– 80% of the world data, and is growing Managed by many players

– SIGIR/WWW/KDD/AAAI, Google/Yahoo/Microsoft/IBM

We should work on it, or risk missing the boat!

But what sets us apart from the above guys?

Page 3: Managing Unstructured Data

3

Structure + System Focus!Structure + System Focus! Make it very easy to extract structures from raw data

– in raw form keyword search / bag analysis– many apps want to go beyond that, they want structure– we should encourage this back to our play ground– not just DB + IR, but DB + IR + IE

Instead of working on isolated research problems, lets build end-to-end UDMS– should repeat what we did with System R / Ingres: system

blueprint, followed by 20 years of rapid progress– unifies & accelerate our research efforts– keeps work grounded, make impact

Page 4: Managing Unstructured Data

4

What Does this System Look Like? What Does this System Look Like?

Extraction +

Integration

Flexible modes

of interaction

Mass collaboration

Best-effort, pay-as-you-go, improving over time

Scale up to huge data (by running over clusters)

Joe Hellerstein

Joe Six-Pack

DB + IR + IE + II, in a best-effort, Web 2.0 fashion

Page 5: Managing Unstructured Data

5

Broader ImpactsBroader Impacts Great for many current applications

– e-science, business, personal data, Web data, etc. Great for many current research topics

– IR, integration, PIM, data spaces– user interfaces, HCI, mashup– provenance, uncertainty– cluster management – query processing– monitoring, handling changes, pub/sub systems

Raises novel research issues– mass collab, best-effort, extraction, helping Joe Six-Pax

Helps define data mgt principles in broader contexts