Schema Design by Gary Murakami

59
Lead Engineer / Evangelist Gary J. Murakami, Ph.D. #MongoDB Schema Design

description

Schema Design by Gary Murakami

Transcript of Schema Design by Gary Murakami

Page 1: Schema Design by Gary Murakami

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Schema Design

Page 4: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Agenda

• What is a Record?

• Core Concepts

• What is an Entity?

• Associating Entities

• General Recommendations

• Questions

Page 5: Schema Design by Gary Murakami

Schema Design – Gary Murakami

All application development isSchema Design

Page 6: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Success comes fromProper Data Structure

Page 7: Schema Design by Gary Murakami

What is a Record?

Page 8: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Key → Value

• One-dimensional

• Single value is a blob

• Query on key only

• No schema

• Value cannot be updated, only replaced

Key Blob

Page 9: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Relational

• Two-dimensional (tuples)

• Each field is a single value

• Query on any field

• Very structured schema (table)

• In-place updates *

• Normalization requires many tables, joins, indexes, and poor data locality and performance

PrimaryKey

Page 10: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Document• N-dimensional

• Each field can contain 0, 1, many, or embedded values

• Query on any field & level

• Flexible schema

• Inline updates *

• Embedding related data has optimal data locality, requires fewer indexes, has better performance

_id

Page 11: Schema Design by Gary Murakami

Core Concepts

Page 12: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Traditional Schema DesignFocus on data storage

Page 13: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Document Schema DesignFocus on data use

Page 14: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Another way to think about itTraditional:What answers do I have?

Document:What questions do I have?

Page 15: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Three Building Blocks ofDocument Schema Design

Page 16: Schema Design by Gary Murakami

Schema Design – Gary Murakami

1 – Flexibility

• Choices for schema design

• Each record can have different fields

• Field names consistent for programming

• Common structure can be enforced by application

• Easy to evolve as needed

Page 17: Schema Design by Gary Murakami

Schema Design – Gary Murakami

2 – ArraysMultiple Values per Field

• Each field can be:– Absent– Set to null– Set to a single value– Set to an array of many values

• Query for any matching value– Can be indexed and each value in the array is in

the index

Page 18: Schema Design by Gary Murakami

Schema Design – Gary Murakami

3 - Embedded Documents• Any value can be a document

• Nested documents provide structure

• Query any field at any level– Can be indexed

Page 20: Schema Design by Gary Murakami

What is an Entity?

Page 21: Schema Design by Gary Murakami

Schema Design – Gary Murakami

An Entity

• Object in your model

• Associations with other entities

Referencing (Relational)

Embedding (Document)

has_one embeds_one

belongs_to embedded_in

has_many embeds_many

has_and_belongs_to_manyMongoDB has both referencing and embedding for

universal coverage

Page 22: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Let's model something togetherHow about a business card?

Page 23: Schema Design by Gary Murakami

Business Card

Schema Design – Gary Murakami

Page 24: Schema Design by Gary Murakami

Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1}

Referencing

Schema Design – Gary Murakami

Addresses

{“_id”: 1,“street”: “10260 Bandley

Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}

Page 25: Schema Design by Gary Murakami

Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}, “phone”: “408-996-1010”}

Embedding

Schema Design – Gary Murakami

Page 26: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Relational Schema

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Page 27: Schema Design by Gary Murakami

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• State• zip_cod

e

Schema Design – Gary Murakami

Document Schema

Page 28: Schema Design by Gary Murakami

Schema Design – Gary Murakami

How are they different? Why?

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• state• zip_cod

e

Page 29: Schema Design by Gary Murakami

{ “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”

}, “phone”: “408-996-1010”}

Schema Flexibility

Schema Design – Gary Murakami

{ “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “[email protected]”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100”}

Page 30: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Longest “Database Endgame” Mate

• Augment schema with meta data– Distance to mate (DTM)– Distance to conversion (DTC)

• Retrograde analysis of DB

• Longest checkmate– 6 piece – 262 moves, KRNKNN– 7 piece – 517 moves, so far• Completion by 2015

Page 31: Schema Design by Gary Murakami

Example

Page 32: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Let’s Look at anAddress Book

Page 33: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Address Book

• What questions do I have?

• What are my entities?

• What are my associations?

Page 34: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Address Book Entity-Relationship

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Page 35: Schema Design by Gary Murakami

Associating Entities

Page 36: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to One

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Page 37: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to OneSchema Design Choices

contact• twitter_id

twitter1 1

contact twitter• contact_id1 1

Redundant to track relationship on both sides • Both references must be updated for consistency

• Saves a fetch if no twitter

Contact• twitter

twitter 1

Page 38: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to OneGeneral Recommendation

• Full contact info all at once– Contact embeds twitter• Parent-child relationship

– “contains”

• No additional data duplication• Can query or index on embedded field

– e.g., “twitter.name”

Contact• twitter

twitter 1

Page 39: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to Many

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Page 40: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to ManySchema Design Choices

contact• phone_ids: [

]phone1 N

contact phone• contact_id1 N

Redundant to track relationship on both sides • Both references must be updated for consistency

• Not possible in relational DBs• Saves a fetch if no phones

Contact• phones

phoneN

Page 41: Schema Design by Gary Murakami

Schema Design – Gary Murakami

One to ManyGeneral Recommendation

• Full contact info all at once– Contact embeds multiple phones• Parent-children relationship

– “contains”

• No additional data duplication• Can query or index on any field

– e.g., { “phones.type”: “mobile” }

Contact• phones

phoneN

Page 42: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Many to Many

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Page 43: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Many to ManyTraditional Relational Association

Join table

Contacts• name• company• title• phone

Groups• name

GroupContacts

• group_id• contact_idX

Use arrays instead

Page 44: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Many to ManySchema Design Choices

group• contact_ids:

[ ]contactN N

groupcontact• group_ids:

[ ]N N

Redundant to track relationship on both sides • Both references must be

updated for consistency

Redundant to track relationship on both sides • Duplicated data must be

updated for consistency

group• contacts

contactN

contact• groups

group N

Page 45: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Many to ManyGeneral Recommendation

• Depends on use case1. Simple address book• Contact references groups

2. Corporate email groups• Group embeds contacts for performance

groupcontact• group_ids:

[ ]N N

Page 46: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Contacts• name• company• title

addresses• type• street• city• state• zip_code

phones• type• number

emails• type• address

thumbnail• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

twitter• name• location• web• bio

N

N

N

1

1

Document model - holistic and efficient representation

Page 47: Schema Design by Gary Murakami

{“name” : “Gary J. Murakami, Ph.D.”,“company” : “10gen (the MongoDB) company”,“title” : “Lead Engineer and Ruby Evangelist”,“twitter” : {

“name” : “GaryMurakami”, “location” : “New Providence, NJ”,“web” : “http://www.nobell.org”

},“portrait_id” : 1,“addresses” : [

{ “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” }],“phones” : [

{ “type” : “work”, “number” : “1-866-237-8815 x8015” }],“emails” : [

{ “type” : “work”, “address” : “[email protected]” },{ “type” : “home”, “address” : “[email protected]” }

]}

Contact document example

Schema Design – Gary Murakami

Page 48: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Can We Solve Chess One Day?

• Chess tablebase problem– Chess programs often play worse– Search is not localized, poor cache performance,

seeks– Working set too large for memory

• Endgame database size – big data– 5 piece: 7 GB compressed 75%• 157 MB Shredderbase – 1000x• 441 MB Shredderbase – 10,000x

– 6 piece: 1.2 TB compressed– 7 piece: 70 TB estimated by 2015

Page 49: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Working Set

1. To reduce the working set– reference less-used data instead of embedding• extract into referenced child document

– reference bulk data, e.g., portrait

2. To increase resources – read from secondaries in a replica set– use sharding

Page 50: Schema Design by Gary Murakami

General Recommendations

Page 51: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Embedding over Referencing • Embed

– When “one” or “many” objects are viewed with their parent

– For performance– For atomicity

• Reference– When you need more scaling: max document size

is 16MB– For easy “many to many” associations– For smaller parent documents and working set

Page 52: Schema Design by Gary Murakami

Schema Design – Gary Murakami

Legacy Migration

1. Copy existing schema & some data to MongoDB

2. Iterate schema design1. Measure performance and find bottlenecks2. Denormalize by embedding

1. one to one associations first2. one to many associations next3. many to many associations last

3. Examine, measure and analyze, review concerns, scaling

Page 53: Schema Design by Gary Murakami

Schema Design – Gary Murakami

New Application

1. Focus on your application 1. Requests2. Responses3. Business-domain model objects / data structures

2. Then persist language object data to MongoDB1. Collections2. Associations3. Refactor for optimization and add indices

Page 54: Schema Design by Gary Murakami

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs+Databases = (Big) Data Applications

Page 55: Schema Design by Gary Murakami

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs×MongoDB = Great Big Data Applications

• Play chess with God

Page 56: Schema Design by Gary Murakami

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs×MongoDB = Great Big Data Applications

• Play music with God – AAC

Page 57: Schema Design by Gary Murakami
Page 58: Schema Design by Gary Murakami

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Questions?

"His pattern indicatestwo-dimensional

thinking.”- Spock

Star Trek II: The Wrath of Khan

www.3dchessfederation.com

Page 59: Schema Design by Gary Murakami

Thank you so much to our community who made An Evening with MongoDB Minneapolis possible:

• David Hussman• Josh Kennedy• Matthew Chimento• Jeffrey Lemmerman• Dan Chamberlain • Christopher Rueber • Erin Newkirk

Thank you DevJam for hosting our event!