Schema Design by Gary Murakami

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Schema Design

Schema Design – Gary Murakami

http://www.ihsa.org/SportsActivities/Chess/RecordsHistory.aspx

http://en.wikipedia.org/wiki/Algorithms_+_Data_Structures_=_Programs


Chess 4.5 (Northwestern University)

Larry Atkin & Dave Slate

chessprogramming.wikispaces.com

http://en.wikipedia.org/wiki/Chess_(Northwestern_University)

http://en.wikipedia.org/wiki/Chess_(Northwestern_University)

http://chessprogramming.wikispaces.com/David+Slate

http://chessprogramming.wikispaces.com/David+Slate


Agenda

• What is a Record?

• Core Concepts

• What is an Entity?

• Associating Entities

• General Recommendations

• Questions


All application development isSchema Design


Success comes fromProper Data Structure

What is a Record?


Key → Value

• One-dimensional

• Single value is a blob

• Query on key only

• No schema

• Value cannot be updated, only replaced

Key Blob


Relational

• Two-dimensional (tuples)

• Each field is a single value

• Query on any field

• Very structured schema (table)

• In-place updates *

• Normalization requires many tables, joins, indexes, and poor data locality and performance

PrimaryKey


Document• N-dimensional

• Each field can contain 0, 1, many, or embedded values

• Query on any field & level

• Flexible schema

• Inline updates *

• Embedding related data has optimal data locality, requires fewer indexes, has better performance

_id

Core Concepts


Traditional Schema DesignFocus on data storage


Document Schema DesignFocus on data use


Another way to think about itTraditional:What answers do I have?

Document:What questions do I have?


Three Building Blocks ofDocument Schema Design


1 – Flexibility

• Choices for schema design

• Each record can have different fields

• Field names consistent for programming

• Common structure can be enforced by application

• Easy to evolve as needed


2 – ArraysMultiple Values per Field

• Each field can be:– Absent– Set to null– Set to a single value– Set to an array of many values

• Query for any matching value– Can be indexed and each value in the array is in

the index


3 - Embedded Documents• Any value can be a document

• Nested documents provide structure

• Query any field at any level– Can be indexed


Belle and Endgame tablebases

Play chess with God – Ken Thompson

http://en.wikipedia.org/wiki/Belle_(chess_machine)

http://en.wikipedia.org/wiki/Endgame_tablebase

http://cm.bell-labs.com/who/ken/



http://en.wikipedia.org/wiki/Ken_Thompson

http://en.wikipedia.org/wiki/Belle_(chess_machine)

http://en.wikipedia.org/wiki/Ken_Thompson

http://en.wikipedia.org/wiki/Endgame_tablebase

What is an Entity?


An Entity

• Object in your model

• Associations with other entities

Referencing (Relational)

Embedding (Document)

has_one embeds_one

belongs_to embedded_in

has_many embeds_many

has_and_belongs_to_manyMongoDB has both referencing and embedding for

universal coverage


Let's model something togetherHow about a business card?

Business Card


Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1}

Referencing


Addresses

{“_id”: 1,“street”: “10260 Bandley

Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}

Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}, “phone”: “408-996-1010”}

Embedding



Relational Schema

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• State• zip_cod

e


Document Schema


How are they different? Why?

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• state• zip_cod

e

{ “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”

}, “phone”: “408-996-1010”}

Schema Flexibility


{ “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “[email protected]”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100”}

http://google.com/


Longest “Database Endgame” Mate

• Augment schema with meta data– Distance to mate (DTM)– Distance to conversion (DTC)

• Retrograde analysis of DB

• Longest checkmate– 6 piece – 262 moves, KRNKNN– 7 piece – 517 moves, so far• Completion by 2015

http://timkr.home.xs4all.nl/chess2/diary_3.htm

http://timkr.home.xs4all.nl/chess2/diary_16.htm

Example


Let’s Look at anAddress Book


Address Book

• What questions do I have?

• What are my entities?

• What are my associations?


Address Book Entity-Relationship

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Associating Entities


One to One


Addresses




Thumbnails



Groups• name

N

1

N

1

N

N

N

1

1

1

11


1

1


One to OneSchema Design Choices

contact• twitter_id

twitter1 1

contact twitter• contact_id1 1

Redundant to track relationship on both sides • Both references must be updated for consistency

• Saves a fetch if no twitter

Contact• twitter

twitter 1


One to OneGeneral Recommendation

• Full contact info all at once– Contact embeds twitter• Parent-child relationship

– “contains”

• No additional data duplication• Can query or index on embedded field

– e.g., “twitter.name”

Contact• twitter

twitter 1


One to Many


Addresses




Thumbnails



Groups• name

N

1

N

1

N

N

N

1

1

1

11


1

1


One to ManySchema Design Choices

contact• phone_ids: [

]phone1 N

contact phone• contact_id1 N

Redundant to track relationship on both sides • Both references must be updated for consistency

• Not possible in relational DBs• Saves a fetch if no phones

Contact• phones

phoneN


One to ManyGeneral Recommendation

• Full contact info all at once– Contact embeds multiple phones• Parent-children relationship

– “contains”

• No additional data duplication• Can query or index on any field

– e.g., { “phones.type”: “mobile” }

Contact• phones

phoneN


Many to Many


Addresses




Thumbnails



Groups• name

N

1

N

1

N

N

N

1

1

1

11


1

1


Many to ManyTraditional Relational Association

Join table

Contacts• name• company• title• phone

Groups• name

GroupContacts

• group_id• contact_idX

Use arrays instead


Many to ManySchema Design Choices

group• contact_ids:

[ ]contactN N

groupcontact• group_ids:

[ ]N N

Redundant to track relationship on both sides • Both references must be

updated for consistency

Redundant to track relationship on both sides • Duplicated data must be

updated for consistency

group• contacts

contactN

contact• groups

group N


Many to ManyGeneral Recommendation

• Depends on use case1. Simple address book• Contact references groups

2. Corporate email groups• Group embeds contacts for performance

groupcontact• group_ids:

[ ]N N



addresses• type• street• city• state• zip_code

phones• type• number

emails• type• address

thumbnail• mime_type• data


Groups• name

N

1

N

1

twitter• name• location• web• bio

N

N

N

1

1

Document model - holistic and efficient representation

{“name” : “Gary J. Murakami, Ph.D.”,“company” : “10gen (the MongoDB) company”,“title” : “Lead Engineer and Ruby Evangelist”,“twitter” : {

“name” : “GaryMurakami”, “location” : “New Providence, NJ”,“web” : “http://www.nobell.org”

},“portrait_id” : 1,“addresses” : [

{ “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” }],“phones” : [

{ “type” : “work”, “number” : “1-866-237-8815 x8015” }],“emails” : [

{ “type” : “work”, “address” : “[email protected]” },{ “type” : “home”, “address” : “[email protected]” }

]}

Contact document example



Can We Solve Chess One Day?

• Chess tablebase problem– Chess programs often play worse– Search is not localized, poor cache performance,

seeks– Working set too large for memory

• Endgame database size – big data– 5 piece: 7 GB compressed 75%• 157 MB Shredderbase – 1000x• 441 MB Shredderbase – 10,000x

– 6 piece: 1.2 TB compressed– 7 piece: 70 TB estimated by 2015

http://rjlipton.wordpress.com/2010/05/12/can-we-solve-chess-one-day/

http://rjlipton.wordpress.com/2010/05/12/can-we-solve-chess-one-day/

http://www.chessbase.com/newsdetail.asp?newsid=3224


Working Set

1. To reduce the working set– reference less-used data instead of embedding• extract into referenced child document

– reference bulk data, e.g., portrait

2. To increase resources – read from secondaries in a replica set– use sharding

General Recommendations


Embedding over Referencing • Embed

– When “one” or “many” objects are viewed with their parent

– For performance– For atomicity

• Reference– When you need more scaling: max document size

is 16MB– For easy “many to many” associations– For smaller parent documents and working set


Legacy Migration

1. Copy existing schema & some data to MongoDB

2. Iterate schema design1. Measure performance and find bottlenecks2. Denormalize by embedding

1. one to one associations first2. one to many associations next3. many to many associations last

3. Examine, measure and analyze, review concerns, scaling


New Application

1. Focus on your application 1. Requests2. Responses3. Business-domain model objects / data structures

2. Then persist language object data to MongoDB1. Collections2. Associations3. Refactor for optimization and add indices


It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs+Databases = (Big) Data Applications






• Programs×MongoDB = Great Big Data Applications

• Play chess with God






• Programs×MongoDB = Great Big Data Applications

• Play music with God – AAC

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Questions?

"His pattern indicatestwo-dimensional

thinking.”- Spock

Star Trek II: The Wrath of Khan

www.3dchessfederation.com

Thank you so much to our community who made An Evening with MongoDB Minneapolis possible:

• David Hussman• Josh Kennedy• Matthew Chimento• Jeffrey Lemmerman• Dan Chamberlain • Christopher Rueber • Erin Newkirk

Thank you DevJam for hosting our event!

Schema Design by Gary Murakami

Technology

Transcript of Schema Design by Gary Murakami