Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

297
Unstructure :: smashing the boundaries of data :: 2014-03-07 - SxSWi Workshop Ian Varley - @thefutureian

description

When it comes to thinking about data, most software designers are stuck in a rigid, 2-dimensional mindset: "rows and columns." A shame, because breaking free from this "tyranny of the table" can bring our software to new heights: intuitive user experiences, fast development iterations, and cohesive apps. In this workshop, we'll cover a few concepts that bring data design out of the 1970s, like: sparse representation, emergent schema, ultra-structure, prototype-driven design, graph theory, traversing the time dimension, and more. We'll run the gamut of philosophical approaches to understanding what is important in your mental (and software) model, and how to transcend your two-dimensional picture of data, and trade it in for an N-dimensional one. Working hands-on with a simple "mock company" and its new killer app, you'll learn: * The basic concepts of data design: entities, relationships, attributes, and types (along with a few better ways to notate them) * How to experiment with creating these data structures in a couple existing cloud-based frameworks (e.g. google apps engine, force.com, heroku, etc.). * How emergent techniques like schema-on-read and ultra-structure can simplify modeling (or, sometimes, complicate it) * How statistical techniques from the data mining world can loosen our insistence on rigid models * Why the time dimension is important (in data as well as schema)

Transcript of Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Page 1: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Unstructure:: smashing the boundaries of data ::

2014-03-07 - SxSWi WorkshopIan Varley - @thefutureian

Page 2: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 0: Intro & Logistics

Page 3: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hi. I’m Ian Varley.

I live in Austin, TX.

I work for Salesforce.com, doing data engineering.(Note: this presentation is entirely my own work and opinions, and doesn’t imply anything about Salesforce’s products.)

→ @thefutureian, ianvarley.com

Page 4: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

About me:

- BA in Philosophy- MS in Software Engineering- 15+ years database experience

Not really an authority on data structure, but

"You teach what you want to learn".

Page 5: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Why are we here?

To grok the structure of data, and then smash it.

Page 6: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Some logistics:

- 150 minutes, no breaks. (Feel free to get up, use the facilities, or leave if you're bored.)

- This will be dense.(Lots to cover, so we'll move fast.)

- But! Do interrupt at any time with questions.(If you’re lost, you’re not the only one.)

Page 7: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

All materials are available:

- This presentation: http://tiny.cc/unstructure-sxsw14-slides

- Live notes:http://tiny.cc/unstructure-sxsw14-notes

- Code & samples: https://github.com/ivarley/unstructure-sxsw14

Page 8: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

There’s some setup required, but we won’t need it right away.

Probably best if everyone starts trying to download & install stuff now, and do it in the

background as I’m talking.

Page 9: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Setup, part 1: git

- Download & Install Githttp://git-scm.com/book/en/Getting-Started-Installing-Git

- Clone my repo: $ cd ~$ git clone https://github.com/ivarley/unstructure-sxsw14.git

Page 10: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Setup, part 2: heroku

- Download & Install Heroku Toolbelthttps://toolbelt.heroku.com/

- Create Heroku Accounthttps://id.heroku.com/signup

Page 11: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Setup, part 3: CouchDB

- Download & Install CouchDBhttp://couchdb.apache.org/

Page 12: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Setup, part 4: miscellaneous

- Google Chromehttps://www.google.com/intl/en/chrome/browser/

- JSONView Pluginhttp://goo.gl/K07fFs

Page 13: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Quick Survey: - occupation: coders? designers? other? - technical skill: low / medium / high - know what a relational database is? - know SQL? - know what NoSQL means? - have used a NoSQL database? - have read Aristotle? :)

Page 14: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

There’s a spectrum in an audience like this ...

The Hacker: Wants hands on, doesn’t care

about theory

The Academic: Wants heady concepts, not

comfortable with code.

The “Hackademic”: Wants enough theory to be grounded, and enough hacking to

know when something is bullshit.

What I’m aiming for:

Page 15: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Where we’re headed:

Chapter 1 - Warming Up Chapter 2 - Hierarchy Chapter 3 - Relation Chapter 4 - Mutation Chapter 5 - Conclusion

Page 16: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Any questions before we get started?

Page 17: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 1: Warming Up

Page 18: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Talking about data can be a little … dry.

So, we’re going to use an example that most people can relate to easily.

Page 19: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Music.Photo credit: Josh Haner/The New York Times

Page 20: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

● It's a domain we all know about.○ But, informally (not usually for work or study)

● Lots of meaty concepts to think about:○ Recording, Performances, Compositions, Bands, Venues, etc ...

● There are lots of music sites with data APIs:○ Do512, EchoNest, Songkick, Sched.org, MusicBrainz, 7Digital, etc.

● It's on everyone's mind during SxSW.● Also, I'm a musician and I felt like it.

Why music?

Page 21: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Running Example:

● listen.up - Internet company for “all things music”.● Duh, this is fake, there’s no “.up” TLD

● What do we do? Everything! ● Including:

● Recorded music catalogs, streaming, purchase...● Live music performance, booking & tickets...● Licensing, royalties, compositions, lyrics …● Instruments, lessons, repairs, classifieds …● Anything else you can think of.

Page 22: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Running Example:

● listen.up - Internet company for “all things music”.● Duh, this is fake, there’s no “.up” TLD

● What do we do? Everything! ● Including:

● Recorded music catalogs, streaming, purchase...● Live music performance, booking & tickets...● Licensing, royalties, compositions, lyrics …● Instruments, lessons, repairs, classifieds …● Anything else you can think of.

Is this a good business model? No, but who cares!

Page 23: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You are my team of

professional ontologists,you’re going to figure out

what music data is out there in the world for us to store.

(I’ll pay you in stock. It’ll be worth a fortune, trust me.)

Page 24: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Normally, this is where we might do a big group brainstorm.

But it turns out those don’t work.

In fact, they hurt more than they help.

(https://en.wikipedia.org/wiki/Brainstorming#Challenges_to_Effective_Brainstorming)

Page 25: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So, we’ll do 3 steps:

1. Take 60 seconds and type as many music-related concepts as you can.

- concepts, not proper nouns (i.e. "band", "composer", "instrument"; not "Radiohead", "Beethoven", "guitar", etc.)

- make them singular ("band", not "bands")- not sure if it’s music-related? put it anyway.

2. Dump them into a shared google doc.3. I’ll lowercase, dedupe, and publish.

Page 26: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Go!I’ll add a few prompts in case you are getting stuck ...

Page 27: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So, we’ll do 3 steps:

1. Take 60 seconds and type as many music-related concepts as you can.

- concepts, not proper nouns (i.e. not "Radiohead", but "band"; not "Beethoven" but "composer", not "guitar" but "instrument", etc.)

- make them singular ("band", not "bands")- not sure if it’s music-related? put it anyway.

2. Dump them into this google doc: http://tiny.cc/unstructure-sxsw14-terms

3. I’ll lowercase, dedupe, and publish here.

Page 28: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Now, your job is to make some sense of this.

1. Break into groups of ~3 neighbors.a. Introduce yourselves like civilized human beings.

2. Organize this list however you want to!a. Group things togetherb. Indent thingsc. Draw lines in a drawing programd. etc.

Page 29: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Go!We’ll take about 10 minutes for this.

Page 30: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Let’s discuss a few sample things people did.(Goal is to "sample", not for everyone to share! We don’t have all day.)

● How did you sort or group the terms?● Did you end up with a flat list, or hierarchy?● Did anything not fit in?● Any higher level organization of terms?

(Note: there’s no right answer here ... yet.)

Page 31: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Good job. is off to a great start.

Now it’s time to get into the meat.What is structure? What is data?

Page 32: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This is a story in three parts:

HierarchyRelationMutation

Page 33: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

These are 3 successive viewpoints thatwill gradually open our eyes to the

deep structure of data.

Page 34: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

“Structure? Isn’t this workshop supposed to be about unstructured data?”

Page 35: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Poppycock. You want to see some real unstructured data?

Page 36: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Page 37: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

What most people mean when they say unstructured data is:

“flexibly structured data”

Or possibly:

“data we don’t know the structure of yet”

(We’ll get to both of those; hold your horses.)

Page 38: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

There are lots of boundaries to smash in the world of data.

But we have to learn to structure before we can unstructure.

Image credit: Rodrigo Diaz Aravena

Page 39: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Our minds are very fluid, and can connect concepts in subtle ways.

Our computers aren't. They need concrete instructions to structure & connect data.

Page 40: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

The next three chapters will be a deep dive into how concepts are combined to let us represent the world in computers.

This is usually called “modeling”.

Modeling gets a bad rap.

Page 41: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Models cut away the accidental and leave the essential.

Model != diagram, drawing

Model == Skeleton, Essence, Abstraction

Page 42: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Most of the time, it’s good to remember:“the map is not the territory”.

But for the next couple hours, for us, the map is precisely the territory.

We are data cartographers.

- Alfred Korzybski

Page 44: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 2: Hierarchy

Page 45: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

We all have a pretty good intuitive understanding of “data”. What’s yours?

Page 46: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

data = plural of datumdatum = Latin for “given”

so ... data is “givens”?aka “facts”?

Page 47: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

What’s the simplest fact? A bit:

1 / 0on / offyes / no

true / false

“Are the lights on in this room?”

Page 48: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

When you want more complex facts, you chunk together simpler ones.

In one dimension, that's a "list":● Byte = list of bits → 01000010 = 66 = “B”● Word = list of characters → [B,e,a,t,l,e,s]● Phrase = list of words -> “The Beatles are a band”● And so forth ...

Page 49: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Of course, a 1-dimensional list is just one (very simplistic) way

to chunk things together.

Page 50: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Photo credit: http://thekingoflimbspart2.com/radiohead-setlists/radiohead-2012-setlists/radiohead-setlist-houston-texas-3032012/

Page 51: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So you just chunk datums together, and you get information, knowledge, wisdom … !?

Photo By Karora (Own work) [Public domain], via Wikimedia Commons

Page 52: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

No. It’s not quite that easy.

f∆ƒ˙73f˚=£ƒ••XMbritneysp3ars-giraffe

is a complex structure, but it lacks something: “meaning”

Page 53: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Getting a little more haughty, we could say that the “givens” of

structured data are really pointers to concepts.

Without at least some concept, it’s not data: it’s noise.

Page 54: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But that raises all kinds of questions.

What are concepts?What can we say about them?

What kinds of concepts are there?What’s the difference between a

concept and the thing it points at?

Who could answer such questions?

Page 55: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

When the epistemological going gets tough, the tough call ...

Aristotle, 384 –322 BC

Page 56: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Aristotle wrote a treatise called

The Praedicamenta(The Categories)

It’s not clear exactly what he was categorizing (he didn’t say), but the list

stands to this day as a pretty damn sensible way to, well, categorize.

Page 57: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Aristotle’s 10 "categories":

substance Stuff, essence; matter, but also universal concepts

quantity How much? How many?

quality What kind? Of what nature?

relation More, less, double, half, stronger, weaker, etc.

place Where?

time When?

position Being situated on, in, next to, sitting, touching, etc.

having Possession, state like “clothed” or “armed”

causing What did it do, make happen?

being caused What happened to it, what did it undergo?

Page 58: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You don’t have to agree with Aristotle’s categories (you’d be in good company).

(But you’re also unlikely to have a sudden inspiration about it that hasn’t already been the subject of 12 papers and a dissertation.)

Page 59: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But having some version of Aristotle’s list is hard to argue with, experientially. We

sort the world into a hierarchy of concepts; everything in its right place.

Page 60: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Without concepts, we don’t have data.

We just have noise.

Page 61: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Of course, you can’t just store a concept on a disk, or send it over a network.

So, not long after we had computing machines, folks set to work figuring out how to map and store our concepts in

the unforgiving realm of silicon.

Page 62: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Let’s take a trip back in time.

Photo By NASA Ames Research Center (NASA-ARC) (NIX A-28284) [Public domain], via Wikimedia Commons

Page 63: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

The year is 1966.

NASA is working on Saturn V and Apollo rockets, and they can’t figure out how to store this ginormous bill of materials. They ask:

Could these new “computers” help?

Page 64: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

IBM: “Hey NASA! We made an system for you to manage information!”

NASA: “Groovy! What’s it called?”

IBM: …...“Information Management System.”

Page 65: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Thus began the not-extremely-exciting era of

hierarchical databases.● Data is stored in records, which can have sub-records.● There's a single strict hierarchical arrangement.● To access data, you need to know the hierarchy.

Page 66: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

For example:

Show Band Time

House Of Vans

Charlie XCX

4:15

House Of Vans

Pusha T

5:00

Show Venue Date

House Of Vans

The Mohawk

3/13/14

Chaos in Tejas

Iron And Lace

3/14/14

Band Song Order

Charlie XCX

You 1

Charlie XCX

Super Love

2

To get to the set list, you have to navigate through the show, to the

band, to the song.

And, you only get to choose one hierarchy to store things in.

Page 67: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

That sounds arbitrarily restrictive. Why did they make it like that?

Page 68: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

To understand,we have to talk about dimensions.

Page 69: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

We already talked about zero dimensions:

And about one dimension:

point = bit = on/off = true/false

line = list = array

Page 70: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

It’s pretty straightforward to see how you’d represent these,

with a series of zeros and ones.

What about 2 dimensions?

Page 71: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

It’s a plane! (aka table, grid, matrix, spreadsheet, etc)

Page 72: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Tabular data is everywhere.

Page 73: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Page 74: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You're certainly familiar with the world's most ubiquitous 2-dimensional data tool ...

Page 75: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

X dimensionY dimension

Excel!Photo credit: http://decentralist.wordpress.com/2012/10/01/libreoffice-vs-openoffice-not-always-simple/

Page 77: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Spreadsheets are totally flexible.

This is a blessing and a curse.

Page 78: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

They can be used well ...

Page 79: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Or poorly ...

(Fun read: http://www.epmchannel.com/2013/02/22/is-excel-the-most-dangerous-piece-of-software-in-the-world/)

Page 80: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Or awesomely ...

Image credit: http://gadgetose.com/excel-stop-motion-music-video/

Page 81: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But you get the point. 2-dimensional data is everywhere.

Page 82: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

2-dimensional data doesn’t always look like a table or grid.

We just mean that it’s “conceptually planar”: two axes, each w/ a set coordinate system.

(Here, rows = “web results”, and columns = “link name”, “url”, “description”, and “image”. No link has two URLs, for example.)

Page 83: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

However! We are now faced with a choice, because we can still only actually store things

in linear form (a single stream of bits).

So do we put rows inside columns, or columns inside rows?

Page 84: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

To linearize the two dimensions in a table, I can either ...

Page 85: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

<table> <row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Karma Police</col> ... </row></table>

Go row-wise ...

Page 86: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

<table> <col> <row>Creep</row> <row>No Surprises</row> <row>Lucky</row> <row>Karma Police</row> <row>Fake Plastic Trees</row> </col> <col> <row>1993</row> <row>1997</row> <row>1997</row> <row>1997</row> <row>1995</row> </col> <col> <row>Pablo Honey</row> ... </col></table>

Or column wise ...

Page 87: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

<table> <row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Karma Police</col> ... </row></table>

But I can’t have it both ways.<table> <col> <row>Creep</row> <row>No Surprises</row> <row>Lucky</row> <row>Karma Police</row> <row>Fake Plastic Trees</row> </col> <col> <row>1993</row> <row>1997</row> <row>1997</row> <row>1997</row> <row>1995</row> </col> <col> <row>Pablo Honey</row> ... </col></table>

?(Unless I store it twice.)

Page 88: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Because we read left-to-right and top-to-bottom, most of our systems store tables that

way too (row-wise).

But it’s not mandatory, of course.

Page 89: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Whichever way you choose, you can’t get around the fact that you have to choose an ordering of dimensions.

Page 90: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

That makes sense for 2 dimensions.

But what about …3+ dimensions?

Page 91: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Same thing.

Page 92: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Computer science has had the idea ofmulti-dimensional arrays

since the beginning.

It's a straightforward extension to a table, conceptually. It's very hard to visualize more than 3 unless you’re on dope*.

* - This is a verrrry funny joke because multidimensional arrays use locators called dope vectors. Ha ha ha hmm.

Page 93: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But you’re still making it linear when you store it.

Page 94: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Imagine storing the venues, shows, bands, and songs in one file.

That’s 4 dimensions.

(Each venue has many shows; each show has many bands;

each band has many songs; etc.)

Page 95: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Venue: The Mohawk Show: Vans Day Party, 3/13/14 Band: Eagulls (12:30 p.m.) Song: Nerve Endings Song: Tough Luck Band: DJ Rashad (1:15 p.m.) Song: Holiday Song: I Can Feel It Band: Kelela (2:15 p.m.) Song: ... Band: Charli XCX (3:15 p.m.) Song: ... Band: Dum Dum Girls (4:15 p.m.) Song: ... Band: Pusha T (5:00 p.m.) Show: Venue: ... Show: ... Band: ... Song: ...etc.

Song: Nerve Endings Band: Eagulls Show: Vans Day Party, 3/13/14 Venue: The MohawkSong: Nerve Endings Band: Eagulls Show: Vans Day Party, 3/13/14 Venue: The Mohawk Show: Official Showcase, 3/14/14 Venue: The MohawkSong: Holiday Band: DJ Rashad Show: Vans Day Party, 3/13/14 Venue: The Mohawk Band: Cattle Decapitation Show: Chaos In Tejas, 3/10/14 Venue: Iron And LaceSong: ... Band ... Show: ...etc.

By venue → show → band → song By song → band → show → venue

Page 96: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

No matter how many dimensions, storing the data still requires that you

pick a single primary orientation.

So Big Blue’s design choice makes a little more sense now, right?

Page 97: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

BTW, If you've used an ATM recently…

you're an IMS user.

(It’s not as obsolete as it sounds.)

Page 98: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

How do you actually store the linearized data?

You put it in a format.

Page 99: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Creep•••••••••••••1993Pablo•HoneyNo Surprises••••••1997OK•ComputerLucky•••••••••••••1997OK•ComputerKarma Police••••••1997OK•ComputerFake Plastic Trees1995The•Bends••

Fixed-width files were all the rage in the 1960s.

Row delimiter is a line break; column delimiter is a pre-set agreement about how many characters are in each line.

This is wasteful, brittle, and hard to read.

Page 100: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

"Creep","1993","Pablo Honey""No Surprises","1997","OK Computer""Lucky","1997","OK Computer""Karma Police","1997","OK Computer""Fake Plastic Trees","1995","The•Bends"••

Delimited files (e.g. CSV, comma separated values):

Row delimiter is still a line break; column delimiter is variable (a comma, in this case). Optionally, also “qualifiers” (quotes, here).

This is a decent format (but, Microsoft’s version really screwed things up for everyone.)

Page 101: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

<row> <col>Creep</col> <col>1993</col> <col>Pablo Honey</col> </row> <row> <col>No Surprises</col> <col>1997</col> <col>OK Computer</col> </row> <row> <col>Lucky</col> <col>1997</col> <col>OK Computer</col> </row>

There’s also markup (e.g. HTML)

“Tags” (<tag></tag>) give you the start and end of rows, and the start and end of columns within those rows.

SGML, HTML, XML, all follow this approach.

Page 102: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Song: CreepYear: 1993Album: Pablo Honey%Song: No SurprisesAlbum: OK ComputerYear: 1997%Year: 1997Song: LuckyAlbum: OK Computer...

Or even YAML (“YAML Ain’t Markup Language”)

(Like email headers.)

Column pointers (names) are inline with the values; rows have many lines, and are delimited by another character (e.g. “%”).

This is obviously more flexible, but still inherently hierarchical.

Page 103: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

And there are a couple modern technologies that are hierarchical all

the way down.

Page 104: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

In practice, the only form of markup people use for storing data is XML.

And the most prevalent subset of YAML people use for storing data is JSON.

(Technically JSON isn’t a subset of YAML but you shut up.)

Page 105: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

XML was early obvious choice, because we were used to HTML, so we all "got" it.<?xml version="1.0"?><venue name="The Mohawk"> <show title="Vans Day Party, 3/13/14"> <band name="Eagulls" time="12:30 p.m."> <song order="1">Nerve Endings</song> <song title="Touch Luck" order="2" /> </band> <band name="DJ Rashad" time="1:15 p.m."> <song title="Holiday" order="1" /> <song title="I Can Feel It" order="2" /> </band> </show></venue>

Page 106: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But XML isn’t very human-friendly.● It’s verbose● The wrong things grab your eye● It’s somewhat complicated to parse● Distinction between attributes and tag contents

is confusing.“XML is crap. Really. There are no excuses. XML is nasty to parse for humans, and it's a disaster to parse even for computers. There's just no reason for that horrible crap to exist.” - Linus Torvalds, Yesterday (2014-03-06), on Google+

Page 107: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Now JSON is winning ...{ "venue": { "name": "The Mohawk", "show" : { "title": "Vans Day Party, 3/13/14", "bands": [{ "name": "Eagulls", "time": "12:30 pm", "songs": [ {"title": "Nerve Endings", "order": 1}, {"title": "Touch Luck", "order": 2} ]}, { "name": "DJ Rashad", "time": "1:15 pm", "songs": [ {"title": "Holiday", "order": 1}, {"title": "I Can Feel It", "order": 2} ]} ] } }}

Page 108: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

JSON:● Where XML looks like a mass of text, JSON

looks spacious (little clutter)● Fast to parse, for humans and computers● Self-describing, flexible format● Extremely simple syntax (one page)

Page 109: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

from http://www.json.org/

Page 110: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

One more thing: so far, we’ve talked about a certain kind of hierarchical

relationship: containment.

But there’s another kind worth mentioning: generalization.

Page 111: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Page 112: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)
Page 113: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This kind of relationship is common in programming (it’s called superclassing). But it’s uncommon (at least, explicitly) in database systems.

We’ll come back to it later.

Page 114: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

You’re doing some corporate espionage forlisten.up. You notice that do512.com seems to have a good way to organize data in their API. What can you steal, er, learn?

Note: the guys at do512 are friends of mine and I am in no way encouraging anyone to perform any actual corporate espionage, no matter how cool that sounds. Listen.up is a made up company, do512’s json API is open, and this is an exercise for learning; no stealing anything. :)

Page 115: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

● Make sure you’ve got the JSONView extension in Chrome● Go to: http://2014.do512.com/events.json● Explore the hierarchical data that comes up● Try “-” to collapse all, click “+” signs to unfold sections● Also try:

○ http://2014.do512.com/venues.json○ http://2014.do512.com/artists.json

Page 116: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

● What’s the hierarchical structure of this data?● In the same way we talked about nesting bands, venues,

shows, etc, … what are the objects being nested here?● Would you store it differently?● Are there any superclass / subclass relationships?

Page 117: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Questions?

Page 118: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 3: Relations

Page 119: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So, hierarchy! Pretty great, right?

Page 120: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Actually, no; it’s quite problematic if you use it as the method for storing data.● Lots of stuff isn't naturally hierarchical.● You can't change the organization without changing all the

code that accesses data.● Above 3 dimensions, the number of possible access paths

goes up dramatically! (Exponentially, in fact.)○ The academic literature of the 60s and 70s is full of

papers describing how to do this better or faster.

Page 121: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But, what are you gonna do? Computers are just boring & hierarchical, so you’d

better learn to deal with it.

Page 122: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Meanwhile, outside the offices of IBM, the revolution of the ‘60s was happening.

Page 123: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

The desirability and feasibility of hierarchy, especially a single universal hierarchy, was seeming less and less

important.

Page 124: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Reconciliation seemed impossible.

And then ...

Page 125: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

something wonderful happenedTHE MONOLITH IN KUBRICK'S 2001: A SPACE ODYSSEY (1968)

Page 126: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Specifically, the math nerds

beat the business jocks.

Page 127: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This is Dr. Edgar F. Codd (1923-2003)

He worked for IBM in the 1960s, and couldn’t stand the thought of our rich, interconnected world being subjugated to storage in hierarchical databases.

So he came up with a radical theory.image from wikipedia:

http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg

Page 128: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Two intellectual ancestors:Set Theory Graph Theory

Page 129: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Set Theory:

Sets are collections of objects. You can precisely describe operations on sets:● Union● Intersection● Difference● Cartesian Product

Page 130: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Graph Theory:

Graphs are collections of nodes, connected by edges.

Not this:

Page 131: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Graph Theory:

Graphs are collections of nodes, connected by edges.

This: Think: a social network where the nodes are people and the edges are friend relationships.

Page 132: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Codd’s genius was combining these and proposing a declarative (rather than imperative) access model.

● The relational model is a graph of sets● Relations (tables) are sets of tuples (rows).● Some attributes (columns) are edges that let you

connect the sets in interesting ways.● You never specify “how” to get to data, just “what”

data to get, based on sets.

Page 133: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

When he proposed this as a way to actually store data in 1969,

people thought he was from Mars.

Everyone said “It’ll never work, computers are too slow.”

His employer, IBM, said “Thanks but no thanks; we’ll just keep selling IMS.”

Page 134: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But the haters didn’t bank on two things.

Page 135: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

1: Moore’s Law

That graph is exponential, not linear.

We can have plenty of CPU.

Page 136: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

2: Programmer Time

As programmer time became more valuable than computer time, ease of representing the problem domain became a dominating factor.

Page 137: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Due to some tomfoolery at IBM, Codd’s “Alpha” never made it.

But another group at IBM created a quasi-relational version called SEQUEL, which looked kind of like COBOL.

Then in 1979, Larry Ellison copied the design to create Oracle and SQL (SEQUEL was trademarked). And the rest is history.

Page 138: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

For his part, Codd waged a decades-long battle to get a more true representation of the relational model adopted.

But, it never was. SQL was king.

Page 139: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So what is the relational model, then?

Page 140: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

First, one quick PSA ...

Page 141: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Occupational Hazard:

Abstraction Vertigo

Concrete Data

Metadata

The Structureof Metadata

Band name: RadioheadYear formed: 1985

Entity: BandAttributes: Name, Year Formed

Concepts → Entities, Attributes

Page 142: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Safety first.If you feel dizzy, just ask a question.

Page 143: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Entity Attribute

Relationship

There are 3 foundational concepts:

Page 144: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Entity

Attribute

Attribute

relationship

Entity

Attribute

Attribute

Page 145: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

is at

Venue

Address

Age Req.

Page 146: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Vans Day Party

12pm

Free

is at

The Mohawk

123 Red River

All Ages

Page 147: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Notice the subtle shift here.

Neither of those entities is “inside” the other. They’re both “first class” entities,

and they’re in a relationship.

The relationships are described at the level of sets, not ad hoc. Shows can be at venues, categorically.

Page 148: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Entities look pretty much exactly like 2-dimensional tables, except for the

concept of a “key”.

That’s the attribute (or set of attributes) that distinguishes this row from that row.

Page 149: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

is at

Venue

Address

Age Req.

Show Start time Cover $

1234 8pm $5

5678 11pm $9

Venue Address Age Req.

The Mohawk 123 Red Riv. 21+

Beerland 456 Red Riv. All Ages

key

key

Page 150: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Keys describe an entity’s identity.

In practice, most systems today use surrogate keys (i.e. IDs) to establish unambiguous identity.

eg: integers (123456), codes (X74-UUA2), GUIDs

Page 151: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

is at

Venue

Address

Age Req.

Show Is At Start time Cover $

1234 The Mohawk 8pm $5

5678 Stubb’s 11pm $9

Venue Address Age Req.

The Mohawk 123 Red Riv. 21+

Beerland 456 Red Riv. All Ages

Page 152: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Then, to get connected information out,you traverse the relationships with

something called a JOIN.

Page 153: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

is at

Venue

Address

Age Req.

Show Is At Start time Cover $

1234 The Mohawk 8pm $5

5678 Stubb’s 11pm $9

Venue Address Age Req.

The Mohawk 123 Red Riv. 21+

Beerland 456 Red Riv. All Ages

Page 154: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Is At Start time Cover $

1234 The Mohawk 8pm $5

5678 Stubb’s 11pm $9

Venue Address Age Req.

The Mohawk 123 Red Riv. 21+

Beerland 456 Red Riv. All Ages

Show Is At Address Age Req. Start time Cover $

1234 The Mohawk 123 Red Riv. 21+ 8pm $5

+

=

Page 155: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Tuple is a fancy word for “row”.

(It comes from abstracting … septuple, octuple, N-tuple …)

Page 156: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Cardinality means, how many tuples of one relation can match each tuple in

another relation?

(In English: a show is at one venue, but a show can have many bands.)

Page 157: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

played by

Band

Name

Genre

Show Start time Cover $

1234 8pm $5

5678 11pm $9

Band ID Name Genre

RDOHD Radiohead Rock

EGLS Eagulls Indie Rock

many to many?

Page 158: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show

Start time

Cover $

played by

Band

Name

Genre

Show ID Start time Cover $

1234 8pm $5

5678 11pm $9

Band ID Name Genre

RDOHD Radiohead Rock

EGLS Eagulls Indie Rock

Show ID Band ID Start Time

1234 RDOHD 9:30pm

5678 EGLS 9pm

5678 RDOHD 11pm

Page 159: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Normalization is the idea that if something repeats (i.e. has a cardinality of more than 1), it should be expressed as another entity, not as repeating data.

Page 160: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Date Venue Street City State

Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX

Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX

Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX

No:

Yes: Show Date Venue

Austinist Party! - Day 1 3/13 Mohawk

Austinist Party! - Day 2 3/14 Mohawk

Austinist Party! - Day 3 3/15 Mohawk

Venue Street City State

Mohawk 123 Red River Austin TX

Page 161: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Basically, normalization dictates that the same fact is never repeated

in more than one place.

Page 162: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Date Venue Street City State

Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX

Austinist Party! - Day 2 3/14 Mohawk 123 Red River Austin TX

Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX

There’s nothing wrong with this data per se, but we all know that the final 4 columns refer to the same thing.

Page 163: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Date Venue Street City State

Austinist Party! - Day 1 3/13 Mohawk 123 Red River Austin TX

Austinist Party! - Day 2 3/14 Mohawk 985 Congress Austin TX

Austinist Party! - Day 3 3/15 Mohawk 123 Red River Austin TX

There’s nothing wrong with this data per se, but we all know that the final 4 columns refer to the same thing.

What would it mean if the street address were different in one of these? Are there two Mohawks?

Page 164: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Normalization gives you consistency, also known as relational integrity.

Certain kinds of problems (like that one) just can’t happen, because there’s

literally only one place where you store the address of the Mohawk.

Page 165: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Date Venue

Austinist Party! - Day 1 3/13 Mohawk

Austinist Party! - Day 2 3/14 Mohawk

Austinist Party! - Day 3 3/15 Mohawk

Venue Street City State

Mohawk 123 Red River Austin TX

Page 166: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Normalization also means that theattributes of an entity are non-repeating.

(i.e. there’s no repeating columns, or groups of columns)

Page 167: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Start time Cover $ Band #1 Band #2 Band #3

House of Vans 8pm $5 Eagulls Cyndi Lauper Radiohead

Chaos in Tejas 11am $10 Metalface CRUD Decapitation

Show Start time Cover $

House of Vans 8pm $5

Chaos in Tejas 11am $10

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

House of Vans Radiohead 3

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

Chaos in Tejas Decapitation 3

No:

Yes:

Page 168: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This makes sense, because …what if you had 4 bands? Or 400?

Page 169: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Do people break these rules?All the time!

(We’ll get to valid reasons why they might want to do that.)

Page 170: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

One last thing:

Schema is the structure of the database itself. It is stored as ... data!

VenueID Name Address

123 Mohawk 123 Red River

456 Stubbs 456 Red River

BandID Name # Members

4321 Eagulls 4

8765 CRUD 17

Table Column Type

Venue VenueID ID

Venue Name String

Venue Address String

Band BandID ID

Band Name String

Band # Members Integer

Page 171: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

The final piece of the relational database puzzle:

SQL (Structured Query Language)

which is about how you get stuff out of this graph of sets.

Page 172: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

We obviously don’t have time to really learn SQL, but here’s the gist:

Page 173: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Page 174: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Start time Cover

House of Vans 8pm $5

Chaos in Tejas 9pm $10

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Page 175: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Show Start time Cover

House of Vans 8pm $5

Chaos in Tejas 9pm $10

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Page 176: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

Show Start time Cover

House of Vans 8pm $5

Chaos in Tejas 9pm $10

Page 177: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

Show Start time Cover

House of Vans 8pm $5

Chaos in Tejas 9pm $10

$10 Metalface

Page 178: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

SELECT S.cover, B.Band FROM Shows S INNER JOIN ShowBands B ON B.Show = S.Show WHERE S.start_time > 8pm AND B.band like ‘%face%’ ORDER BY order ASC

Show Band Order

House of Vans Eagulls 1

House of Vans Cyndi Lauper 2

Chaos in Tejas Metalface 1

Chaos in Tejas CRUD 2

Show Start time Cover

House of Vans 8pm $5

Chaos in Tejas 9pm $10

$10 Metalface 1

Page 179: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So that’s relational databases and SQL, in a very small nutshell.

Page 180: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You can imagine, relational databases get very complicated for non-trivial things.

But you now know almost all there is to know: relational databases are graphs of sets, navigated via declarative language.

Source: http://wiki.musicbrainz.org/-/images/5/52/ngs.png

Page 181: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Also, remember generalization? Theoretically, that’s just another type of

relationship between entities.

Page 182: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Band Genre

Radiohead Rock

Metalface Jazz

Entertainer Since

Radiohead 1985

Metalface 2005

Gonzo the Incredible 1968

Gob 2002

Penn & Teller 1990

Magician Style

Gonzo the Incredible Sorcery

Gob Fail

Penn & Teller Conjuring

Page 183: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

I say theoretically because nobody actually does this in practice, since

(unlike in object oriented programming) doing this carries a performance and

complexity burden in databases.

(But, you could do it, theoretically.)

Page 184: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

We’re going to spin up a relational database using Heroku, create and populate some tables, and show the data on a web site.

Page 185: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!About Heroku:● Super simple hosted engine for running any code online● Has built-in relational database capabilities with Postgres● Uses a version control system called git● Uses Amazon AWS to host the code● It's owned by Salesforce.com (my company) but run

separately; I'm not an expert● Also note that this example is written in Ruby and I’m

really not an expert on that. Hope it works!

Page 186: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Prerequisites:1. You should already have git installed; do now if not.2. You should have heroku toolbelt installed, and have

created a heroku login. Do now if you haven’t.3. You should have already cloned my repo, but if not do it

now.$ cd ~$ git clone https://github.com/ivarley/unstructure-sxsw14

Page 187: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Steps:1. Go to the directory you cloned the repo into:

$ cd unstructure-sxsw14

2. Create and publish the app:$ heroku login$ heroku create$ git push heroku master$ heroku open

Page 188: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Congrats! You just published a web app.

Not convinced? Edit some of the visible text in the file:app/views/welcome/index.html.erb

and republish:$ git commit -am 'made an edit'$ git push heroku master$ heroku open

Page 189: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Now let’s add some data!

Heroku automatically created a Postgres database for you.$ heroku pg:psql

opens a database prompt. Create a table and insert data:CREATE TABLE band (bandid INT PRIMARY KEY, name VARCHAR, genre VARCHAR);INSERT INTO band (bandid, name, genre) VALUES (1, 'Radiohead', 'Rock');SELECT * FROM band;

Page 190: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Add some more tables and data:CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR);CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR);INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red River, Austin, TX');INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07');

And create a view (basically, a saved SQL statement):CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s.venueid = v.venueid;

Page 191: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Add some more tables and data:CREATE Table venue (venueid INT PRIMARY KEY, name VARCHAR, address VARCHAR);CREATE TABLE show (venueid INT PRIMARY KEY, bandid INT, date VARCHAR);INSERT INTO venue (venueid, name, address) VALUES (1, 'The Mohawk', '123 Red River, Austin, TX');INSERT INTO show (venueid, bandid, date) VALUES (1, 1, '2014-03-07');

And create a view (basically, a saved SQL statement):CREATE VIEW shows AS SELECT b.name as band_name, v.name as venue_name, s.date FROM band b INNER JOIN show s ON b.bandid = s.bandid INNER JOIN venue v ON s.venueid = v.venueid;

Page 192: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Now you can SELECT data:SELECT * FROM shows;

And you’ll get: band_name | venue_name | date -----------+------------+------------ Radiohead | The Mohawk | 2014-03-07

Note if you’re into nit-picking: I'm taking a shortcut here and saying that each record in "show" is a band/venue combination, with a date. If "show" were a proper entity (for example, if the show had a name, a promoter, etc.) then the proper “normalized” way to model it would be to create a show entity with an ID and a venue ID, show name, promoter, etc; and then have intersection tables between bands and shows (like a band_show table). But this is fine for now.

Page 193: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Leave the SQL prompt (type “\q”) and reopen the app:$ heroku open

And (hopefully) voila! You see the results of:SELECT * FROM shows;

Feel free to mess around now--add more data, to see how it’ll show up. If you’re really advanced, try changing the ruby code to show different SQL statements, etc. We’ll take about 10 minutes to play around, ask questions, etc.

Page 194: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Questions?

Page 195: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 4: Mutation

Page 196: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So, hierarchical databases were lame. But relational databases are awesome!

SQL Rules!! Right … ?

Page 197: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hrm. As awesome as they are, relational databases have a few warts.

And some of them have only become apparent recently.

Page 198: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

● It’s really hard to work with recursive / graph relationships● Results of SELECT queries are always flat tables, which

means you have to reassemble nested structures yourself● They don’t play well with object oriented programs● They don’t support inheritance & superclassing!● Relational modeling tools are generally quite sucky.● They don’t scale well to extremely large data sets,

because they promise things you can only do on a single (non-distributed) system.

There’s a standard list of grievances:

Page 199: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

These are all interesting points; if we had a whole semester, we could spend weeks on any one of them.

Page 200: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But I want to spend the rest of our time on what is (IMO) the essential problem:

mutability.(i.e. change, being mutated)

Page 201: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Let’s zoom way out for a minute.

By NASA [Public domain], via Wikimedia Commons

Page 202: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

We talked about hierarchies.● Old school, classical● Single top down view of reality

We talked about relations.● Modern, networks, connections● No single privileged access path or view

Page 203: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But both hierarchies and relations tend towards a commitment to afixed representation of reality.

You’re modeling static concepts that exactly match the real world. Doing this with high fidelity is the crowning

achievement of the relational model.

Page 204: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But reality doesn’t hold still.

It changes, all the time.

Any fixed way of representing the world is doomed to become outdated.

Page 205: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Darwin knew what was up.The nature of reality is to be dynamic, evolving.

Page 206: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

A species seems like a fixed thing, but that's just a label we attach.

They are fuzzy around the edges, and always changing.

Page 207: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Change is a fact of life at all levels of the abstraction ladder…

our understanding of structure itself

the structure of our software apps

facts about things in the real world

Page 208: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

And it’s only getting faster.

We can either hide our heads in the sand, or we can figure out what to do about it.

Page 209: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So let’s talk about 3 ways to smash the boundaries of data,

in the mutation dimension:

Attribute FlexibilityNon-Destructive Mutability

Model Agility

Page 210: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Non-Destructive Mutability

Page 211: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This is an UPDATE statement in SQL:UPDATE Band SET name = 'Jaydiohead' WHERE Band ID = 5678 and name = 'Radiohead'

Result:Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Jaydiohead

Page 212: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You know what you can’t do?

Undo.Edits in traditional relational databases are destructive. If you want to keep the old version, you have to do it yourself.

Note for data nerds: yes, databases keep transaction logs so you can undo and redo edits as part of transactions. But this is (a) implementation dependent, (b) not typically exposed to users in the relational model, and (c) not guaranteed to persist beyond the transaction itself.

Page 213: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This means that everyone has to either:

● Add extra complexity to their data model, or● Accept that changes are destructive.

Both are pretty crappy options. Most people just do #2.

Page 214: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But, sometimes you can’t do that. Consider compliance.● Who changed my salary?● Who deleted this opportunity from the

pipeline?● Who took my stapler?

You might not care a bunch about that, but Sarbanes-Oxley sure does.

And if you don’t know what that is, count yourself lucky and go back to making pretty things.

Page 215: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So this …

So how do people store history now?Option #1: audit columns (partial solution)

Band ID Name

1234 Bjork

5678 Radiohead

Page 216: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Option #1: audit columns (partial solution)

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name Create Date Created By Modify Date Modified By Deleted?

1234 Bjork 3/6/14 2pm Ian Varley 3/6/14 3pm Jan Jones false

5678 Jaydiohead 3/5/14 1pm John Smith 3/7/14 1pm Ian Varley false

So this … becomes this.

So how do people store history now?

Page 217: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So this …

So how do people store history now?Option #2: History tables

Band ID Name

1234 Bjork

5678 Radiohead

Page 218: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So this … becomes this

So how do people store history now?Option #2: History tables

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name Version Date By

1234 Bjork 1 3/6/14 Ian V

5678 Radiohead 1 3/6/14 Ian V

5678 Jaydiohead 2 3/7/14 Ian V

1234 Fjork 2 3/7/14 Ian V

Page 219: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Neither of these are ideal.

They ugly up your data model, which is not just an inconvenience; it makes it harder to see the “real” stuff.

It also violates the spirit of normalization: if something means the same thing, don’t repeat it all over the place.

Page 220: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But what if … the database took care of this for you?

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Name

1234 Bjork

5678 Radiohead

Page 221: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Some newer ones do:● Google pioneered this with BigTable - time is a

privileged dimension, part of the model, stored with every datum

● NoSQL Stores like HBase follow suit● Salesforce offers “field history” out of the box,

as a meta-feature on any entity, expressed as a history table

● Document stores can store older versions

Page 222: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Does this use a lot of space? Yes.

But guess what? We have a lot of space.

Page 223: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

And, it turns out that for a lot of computing applications, making it immutable makes the problem way easier to reason about.

(For more on that, see Pat Helland’s talk, Immutability Changes Everything.)

Page 224: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Here’s the crazy part. If we do this across the board, we end up with ... data time travel.“K-9, show me my accounts receivable as of last May, and compare it with today.”

Page 225: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Plain old “undo” is a subset of this. Ever wondered why Google web products all have “undo” and nobody else’s do?

(Full revision history is the fancy version. They have that too.)

Page 226: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

My pronouncement:

Henceforth, all databases should make time a privileged dimension, and retain older versions of data in a way that supports time travel.

Page 227: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Attribute Flexibility

Page 228: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Historically, databases have been a little something of a schoolmarm.

Image: Public domain. From the 1894 Laughable Lyrics: A Fourth Book of Nonsense Poems, Songs, Botany, Music, etc. by Edward Lear.

Page 229: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Some things are decided and enforced directly at the database level:

- what attributes an entity can have- how many attributes there are- what type of data to store in each attribute

- (string, number, date, currency, etc.)- other constraints (max & min values, etc)

Page 230: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But why is this the database's job?

It’s partially a historical accident.Disk & memory used to be scarce and

highly optimized, so record formats had to be prescriptive and fixed.

Page 231: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

It’s also partially a mindset.

Centralizing decisions about structure is very tempting; it’s easy to overestimate your ability to “get it right” the first time,

or find the “one true model” for all.

Page 232: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But …

What if a database's job is really to store whatever fields I give it?

What if I want to say, "Let me store any additional facts I can think of about bands, venues, shows"?

Page 233: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

If your model imposes less, then there's less to change when the world changes.

This analogous to the difference between strongly typed languages (like C and Java)

and scripting languages (like Ruby and Python).Sometimes you want to trade safety for flexibility.

Page 234: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

As an example, Thom Yorke’s booking agent might use address book software.

But what if they had to update the database schema for each new kind of social media service he decides to use?

Page 235: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You can picture just adding more attributes, like key-value pairs in JSON:

{ id: "ARP5KQF1187B9B4DD1", name: "Explosions in the Sky", genres: [{name: "post rock"}], years_active: [{start: 1999}], artist_location: { location: "Austin, TX, US", city: "Austin", region: "Texas", country: "United States" }}

Page 236: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You can picture just adding more attributes, like key-value pairs in JSON:

{ id: "ARP5KQF1187B9B4DD1", name: "Explosions in the Sky", genres: [{name: "post rock"}], years_active: [{start: 1999}], artist_location: { location: "Austin, TX, US", city: "Austin", region: "Texas", country: "United States" }, familiarity: 0.687572, favorite_color: "Blue"}

Page 237: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

“Tagging” is the simplest version of this.

Each tag is a boolean (true/false), and there can be any number of them.

Page 238: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

You can already model attribute flexibility at a meta-level in a relational database...

Band ID Name

1234 Bjork

5678 Radiohead

Band ID Attribute Value

5678 City Austin

5678 State TX

5678 Fav. Color Blue

5678 Familiarity 0.687572

After all, it’s just another degree of cardinality, right?

Page 239: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But writing queries against it is a massive PITA (trust me).

And, it’s the same meta problem: if you did this for every entity in your model, your model would

be impossible to comprehend.

Page 240: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Attribute flexibility is “table stakes” for new DBs:● Google did it with BigTable - the columns for

a row are totally flexible at run time, and the values are simple byte arrays

● Most other NoSQL stores offer this too● Some services make it the backbone of

what they offer (e.g. keen.io - 1 entity, but any set of attributes you want to send)

Page 241: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

My second pronouncement:

Henceforth, most databases should really think about offering attribute flexibility, accepting writes and reads of “columns” that haven’t been declared in advance.

Page 242: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Aside: if columns aren’t predefined, how do we know their data type?● Strings, Numbers, Dates, etc?● JSON has the right idea:

○ If it’s in quotes, it’s a string○ If it’s not, it’s either:

■ a number → ■ true or false

Page 243: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

And … while we’re at it, why not just get rid of attributes altogether, and say that entities store JSON blobs, with nesting intact?

(We’ll come back to that …)

Page 244: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Model Agility

Page 245: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

"Schemas usually remain relatively stable over the lifetime of a database for most applications."

- S Navathe, 1992

"No one will need more than 637 kB of memory for a personal computer."

- B. Gates, 1979

Page 246: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

It’s true, though. Database schemas do tend to remain stable … because it sucks so bad to change them!

● Honestly, databases have always been the least “agile” part of software development.

● It's the final frontier of "BDUF" (big design up front)

Page 247: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

But, wait. Aren’t schemas stored as data? Can’t you just change them?

ALTER TABLE Band ADD COLUMN favorite_color STRING

Page 248: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

In theory, that works. In practice, it doesn’t.

● You can change the schema, but you can only ever have one schema at a time

● Some changes might require downtime● For big tables, that could mean you’re offline for hours,

days, or even weeks.● And, by the way, if you follow my first pronouncement

about data and time travel, what happens when you change the schema?

Page 249: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

What to do instead?

One option is to generalize: make your model so generic, you never have to alter a table!

Page 250: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This is what “architecture astronauts” do, and it doesn’t work, because you eventually end up with this model:

(You can sometimes find a sweet spot, but more likely you’re just pushing the essentialism and brittleness to another layer.)

Page 251: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

What we really need is a system where structure changes are also non-destructive.

(We’d need that anyway if we want to time travel with the data, right?)

Page 252: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

There’s actually an example of this!

http://couchdb.apache.org/

Page 253: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Apache CouchDB is a NoSQL database.

● The database is just a flat collection of JSON files● There’s no schema! Put in whatever you want!● You create views (using javascript) that “materialize”

certain access patterns across your documents.

It’s not perfect (scaling is tricky, etc.) but as an illustration of these points, it’s spot on.

Page 254: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

CouchDB hits all 3 of our mutability desires:

● All documents are versioned. ○ Non-destructive Mutation!

● Document JSON can have any structure○ Attribute Flexibility!

● Schema-On-Read using views○ Model Agility!

Page 255: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

It also happens to be wicked easy to get started with, so let’s do one more

hands-on exercise.

Page 256: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Let’s load some music data into CouchDB!

You should have already installed CouchDB, but if not, do so now.

Page 257: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Earlier, we used the Do512 API. This time we’ll use The Echo Nest, a huge music data repository. (Which was, incidentally, just bought by Spotify yesterday, March 6th, 2014! Too bad we didn’t get on this listen.up thing a little sooner amiright?)

They require setting up an API key for access, so as a shortcut I’ve done that part for you.

Page 258: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

The file mutation/echonest-artists-austin-v1.json is data on the top 100 “most familiar” artists with a hometown of Austin, TX, from this API call:http://developer.echonest.com/api/v4/artist/search?api_key={MY_API_KEY}&format=json&results=100&start=0&bucket=familiarity&bucket=genre&bucket=artist_location&bucket=years_active&artist_location=austin

Full disclosure, I modified the result slightly, so it’d work immediately with CouchDB bulk load:● removed the outer "response" wrapper from the API● changed the name of the array from "artists" to "docs"● changed all the "id" fields to "_id" so CouchDB would use them

Page 259: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Ensure CouchDB is installed and running:

http://127.0.0.1:5984/_utils/index.html

Create a new database called

“listenup”

Page 260: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Upload a bunch of data:

$ cd ~/unstructure-sxsw14/mutation/$ curl -H "Content-Type:application/json" -d @echonest-arists-austin_v1.json -X POST http://127.0.0.1:5984/listenup/_bulk_docs

Reload the database web page:http://127.0.0.1:5984/_utils/database.html?listenup

Page 261: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!See what happens when you change a value. Let’s pick a band at random:

http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9

Double click, and change to: [{"name": "garage

soul"}]Then click “Save Document”.

Page 262: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Now there are two versions of this document:

Non-destructive mutability, in the flesh!

Page 263: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Now add a field that never existed before:

http://127.0.0.1:5984/_utils/document.html?listenup/ARJXDMJ11FF10D84F9

Click “Add Field”.

Add “Influences” as the Field name, and click

“Save Document”.

Double click the “null” next to “Influences”, and change the value to

[{"name": "Michael McDonald"}]Click “Save Document”.

Page 264: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Attribute flexibility!

Notice that so far, we haven’t once had to specify a schema.

Page 265: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

Now we query the data by writing code to implement materialized views. This is actually a little complicated ...

Page 266: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!

(Yes, I know this is actually about Riak, not CouchDB. Why do you hate laughing?)Comic by John Muellerleile / http://thinkdifferent.ly/fault-tolerance.png

Page 267: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!As a simple example, create a new document:

{ "_id": "_design/application", "views": { "genre-view": { "map": "function(artist) {if(artist.name && artist.genres) {artist.genres.forEach(function(genre) {emit(genre, artist.name);});}}" } }}

Then visit: http://127.0.0.1:5984/listenup/_design/application/_view/genre-view

Page 268: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hands on!Try making your own view. Some examples are here:

http://guide.couchdb.org/draft/cookbook.html

Page 269: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Questions?

Page 270: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Chapter 5: Conclusion & Future Directions

Page 271: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

So where have we been?

Page 272: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Hierarchy

Relation

Mutation

Page 273: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Before we go off to our subsequent debauchery, if you’ll indulge me, we

actually have a tiny bit more smashing to do.

Page 274: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Now it’s time to go through the looking glass.

Image from Disney Wikia: http://disney.wikia.com/wiki/Alice

Page 275: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

(Some of these thoughts are adapted from my previous presentation, I’ve Always Wanted To Data Model)

Page 276: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

This is a technical book from the 1970s. It’s a philosophy book.

It opened my eyes to some of the real

underlying questions.

Page 277: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Is data “true”?

Page 278: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Of course not, not categorically.

Member Year Joined

Thom Yorke 1985

Jonny Greenwood, 1985

Ed O'Brien, 1985

Colin Greenwood 1985

Philip Selway 1985

Ian Varley 2014

Official Radiohead Band Members

Page 279: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Is data “real”?

Page 280: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Robin Hood’s Band Of Merry Men

Depends on what you mean.

Merry Man Merry-ness

Robin Hood High

Little John Medium

Much The Miller’s Son Medium

Friar Tuck High

Arthur a Bland Low

Maid Marian Unknown

Page 281: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

My personal theory:

Data is “existential claims”.

The fact that this data, and not some other, is stored implies that someone (or something) is making a claim about the existential state of something. This claim may or may not correspond to the actual existential state of that something.

Page 282: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

As cool as the boundary-smashing tools we looked at today are, it’s also wise to

remember that we don’t really know much of what’s going on. For example ...

Page 283: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Nobody actually knows what an “entity” really is.

Page 284: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

“Entity” is another word for Category, in linguistics terms.

And an important property of linguistic categories is that they are slippery.

See:● Steven Pinker: The Stuff Of Thought● Douglas Hofstadter: Surfaces & Essences● George Lakoff: Women, Fire, and Dangerous Things

Page 285: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

part: an abstract definition of a connected set of physical materials that serve some purpose

part: one instance of a manufactured item, which either does or does not meet quality standards

Images: (1) Atwood Hydraulic Surge Brake Actuator, http://www.pacifictrailers.com/Atwood-Hydraulic-Brake-Actuator-Parts-List-and-Schematic/; (2) Ford Motor Company flywheel magneto assembly line 1913, source unknown

Page 286: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

And if you think you can “solve” the problem, I’ve got some World Trade Center insurance policies to sell you.

Page 287: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

That said, there are a couple techniques we could adopt that would help:

● First-class Sub- / Super-Typing● First-class Scoping and Aliasing

(Not that there aren’t ways to do this in relational models, but they’re unobvious and not widely used.)

Page 288: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Discrete models encourage black & white thinking in a

gray world

Page 289: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Sometimes the deep structure is actually ambiguous.

Image credit: By Chire (Own work) [Public domain or Public domain], via Wikimedia Commons

Page 290: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Our current tools are s#!t.

Boxes & lines aren’t how we actually think, our spatial processing of diagrams doesn’t map well to our

temporal, spatial, and causal comprehension of data structure.

Page 291: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

F*** THAT NOISE.

Page 292: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

If we had the right tools, what would they look like?

Page 293: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

They’d have native support for ...

● My mutability requirements○ Non-destructive mutation○ Attribute Flexibility○ Model Agility

● The 3 Ps: ○ Provenance, Provability, Probability

Page 294: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

They’d have native support for ...

● Supertyping● Extensible Meta-Metadata● Semantic Zoom● Prototype Generation● Model Versioning and Diffing

Page 295: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Interesting direction: RAML

● Textual (YAML-based)● Strong, simple syntax● Generates useful models● Communication focussed

Could there be something like this for data models?

Page 296: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Who knows ...

Page 297: Unstructure: Smashing the Boundaries of Data (SxSWi 2014)

Thanks! @thefutureianianvarley.com

(If you enjoyed this, please rate it a 5 … sxsw.com/rate)