Schema Design by Gary Murakami
-
Upload
mongodb -
Category
Technology
-
view
2.919 -
download
0
description
Transcript of Schema Design by Gary Murakami
Lead Engineer / Evangelist
Gary J. Murakami, Ph.D.
#MongoDB
Schema Design
Schema Design – Gary Murakami
Schema Design – Gary Murakami
Chess 4.5 (Northwestern University)
Larry Atkin & Dave Slate
chessprogramming.wikispaces.com
Schema Design – Gary Murakami
Agenda
• What is a Record?
• Core Concepts
• What is an Entity?
• Associating Entities
• General Recommendations
• Questions
Schema Design – Gary Murakami
All application development isSchema Design
Schema Design – Gary Murakami
Success comes fromProper Data Structure
What is a Record?
Schema Design – Gary Murakami
Key → Value
• One-dimensional
• Single value is a blob
• Query on key only
• No schema
• Value cannot be updated, only replaced
Key Blob
Schema Design – Gary Murakami
Relational
• Two-dimensional (tuples)
• Each field is a single value
• Query on any field
• Very structured schema (table)
• In-place updates *
• Normalization requires many tables, joins, indexes, and poor data locality and performance
PrimaryKey
Schema Design – Gary Murakami
Document• N-dimensional
• Each field can contain 0, 1, many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data locality, requires fewer indexes, has better performance
_id
Core Concepts
Schema Design – Gary Murakami
Traditional Schema DesignFocus on data storage
Schema Design – Gary Murakami
Document Schema DesignFocus on data use
Schema Design – Gary Murakami
Another way to think about itTraditional:What answers do I have?
Document:What questions do I have?
Schema Design – Gary Murakami
Three Building Blocks ofDocument Schema Design
Schema Design – Gary Murakami
1 – Flexibility
• Choices for schema design
• Each record can have different fields
• Field names consistent for programming
• Common structure can be enforced by application
• Easy to evolve as needed
Schema Design – Gary Murakami
2 – ArraysMultiple Values per Field
• Each field can be:– Absent– Set to null– Set to a single value– Set to an array of many values
• Query for any matching value– Can be indexed and each value in the array is in
the index
Schema Design – Gary Murakami
3 - Embedded Documents• Any value can be a document
• Nested documents provide structure
• Query any field at any level– Can be indexed
Schema Design – Gary Murakami
Belle and Endgame tablebases
Play chess with God – Ken Thompson
What is an Entity?
Schema Design – Gary Murakami
An Entity
• Object in your model
• Associations with other entities
Referencing (Relational)
Embedding (Document)
has_one embeds_one
belongs_to embedded_in
has_many embeds_many
has_and_belongs_to_manyMongoDB has both referencing and embedding for
universal coverage
Schema Design – Gary Murakami
Let's model something togetherHow about a business card?
Business Card
Schema Design – Gary Murakami
Contacts
{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1}
Referencing
Schema Design – Gary Murakami
Addresses
{“_id”: 1,“street”: “10260 Bandley
Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”
}
Contacts
{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {
“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”
}, “phone”: “408-996-1010”}
Embedding
Schema Design – Gary Murakami
Schema Design – Gary Murakami
Relational Schema
Contact
• name• compan
y• title• phone
Address
• street• city• state• zip_cod
e
Contact
• name• company• adress
• Street• City• State• Zip
• title• phone
• address• street• city• State• zip_cod
e
Schema Design – Gary Murakami
Document Schema
Schema Design – Gary Murakami
How are they different? Why?
Contact
• name• compan
y• title• phone
Address
• street• city• state• zip_cod
e
Contact
• name• company• adress
• Street• City• State• Zip
• title• phone
• address• street• city• state• zip_cod
e
{ “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {
“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”
}, “phone”: “408-996-1010”}
Schema Flexibility
Schema Design – Gary Murakami
{ “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “[email protected]”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100”}
Schema Design – Gary Murakami
Longest “Database Endgame” Mate
• Augment schema with meta data– Distance to mate (DTM)– Distance to conversion (DTC)
• Retrograde analysis of DB
• Longest checkmate– 6 piece – 262 moves, KRNKNN– 7 piece – 517 moves, so far• Completion by 2015
Example
Schema Design – Gary Murakami
Let’s Look at anAddress Book
Schema Design – Gary Murakami
Address Book
• What questions do I have?
• What are my entities?
• What are my associations?
Schema Design – Gary Murakami
Address Book Entity-Relationship
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Associating Entities
Schema Design – Gary Murakami
One to One
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Schema Design – Gary Murakami
One to OneSchema Design Choices
contact• twitter_id
twitter1 1
contact twitter• contact_id1 1
Redundant to track relationship on both sides • Both references must be updated for consistency
• Saves a fetch if no twitter
Contact• twitter
twitter 1
Schema Design – Gary Murakami
One to OneGeneral Recommendation
• Full contact info all at once– Contact embeds twitter• Parent-child relationship
– “contains”
• No additional data duplication• Can query or index on embedded field
– e.g., “twitter.name”
Contact• twitter
twitter 1
Schema Design – Gary Murakami
One to Many
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Schema Design – Gary Murakami
One to ManySchema Design Choices
contact• phone_ids: [
]phone1 N
contact phone• contact_id1 N
Redundant to track relationship on both sides • Both references must be updated for consistency
• Not possible in relational DBs• Saves a fetch if no phones
Contact• phones
phoneN
Schema Design – Gary Murakami
One to ManyGeneral Recommendation
• Full contact info all at once– Contact embeds multiple phones• Parent-children relationship
– “contains”
• No additional data duplication• Can query or index on any field
– e.g., { “phones.type”: “mobile” }
Contact• phones
phoneN
Schema Design – Gary Murakami
Many to Many
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Schema Design – Gary Murakami
Many to ManyTraditional Relational Association
Join table
Contacts• name• company• title• phone
Groups• name
GroupContacts
• group_id• contact_idX
Use arrays instead
Schema Design – Gary Murakami
Many to ManySchema Design Choices
group• contact_ids:
[ ]contactN N
groupcontact• group_ids:
[ ]N N
Redundant to track relationship on both sides • Both references must be
updated for consistency
Redundant to track relationship on both sides • Duplicated data must be
updated for consistency
group• contacts
contactN
contact• groups
group N
Schema Design – Gary Murakami
Many to ManyGeneral Recommendation
• Depends on use case1. Simple address book• Contact references groups
2. Corporate email groups• Group embeds contacts for performance
groupcontact• group_ids:
[ ]N N
Schema Design – Gary Murakami
Contacts• name• company• title
addresses• type• street• city• state• zip_code
phones• type• number
emails• type• address
thumbnail• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
twitter• name• location• web• bio
N
N
N
1
1
Document model - holistic and efficient representation
{“name” : “Gary J. Murakami, Ph.D.”,“company” : “10gen (the MongoDB) company”,“title” : “Lead Engineer and Ruby Evangelist”,“twitter” : {
“name” : “GaryMurakami”, “location” : “New Providence, NJ”,“web” : “http://www.nobell.org”
},“portrait_id” : 1,“addresses” : [
{ “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” }],“phones” : [
{ “type” : “work”, “number” : “1-866-237-8815 x8015” }],“emails” : [
{ “type” : “work”, “address” : “[email protected]” },{ “type” : “home”, “address” : “[email protected]” }
]}
Contact document example
Schema Design – Gary Murakami
Schema Design – Gary Murakami
Can We Solve Chess One Day?
• Chess tablebase problem– Chess programs often play worse– Search is not localized, poor cache performance,
seeks– Working set too large for memory
• Endgame database size – big data– 5 piece: 7 GB compressed 75%• 157 MB Shredderbase – 1000x• 441 MB Shredderbase – 10,000x
– 6 piece: 1.2 TB compressed– 7 piece: 70 TB estimated by 2015
Schema Design – Gary Murakami
Working Set
1. To reduce the working set– reference less-used data instead of embedding• extract into referenced child document
– reference bulk data, e.g., portrait
2. To increase resources – read from secondaries in a replica set– use sharding
General Recommendations
Schema Design – Gary Murakami
Embedding over Referencing • Embed
– When “one” or “many” objects are viewed with their parent
– For performance– For atomicity
• Reference– When you need more scaling: max document size
is 16MB– For easy “many to many” associations– For smaller parent documents and working set
Schema Design – Gary Murakami
Legacy Migration
1. Copy existing schema & some data to MongoDB
2. Iterate schema design1. Measure performance and find bottlenecks2. Denormalize by embedding
1. one to one associations first2. one to many associations next3. many to many associations last
3. Examine, measure and analyze, review concerns, scaling
Schema Design – Gary Murakami
New Application
1. Focus on your application 1. Requests2. Responses3. Business-domain model objects / data structures
2. Then persist language object data to MongoDB1. Collections2. Associations3. Refactor for optimization and add indices
Schema Design – Gary Murakami
It’s All About Your Application
• Your schema is the impedance matcher– Design choices: normalize/denormalize,
reference/embed– Melds programming with MongoDB for best of
both– Flexible for development and change
• Programs+Databases = (Big) Data Applications
Schema Design – Gary Murakami
It’s All About Your Application
• Your schema is the impedance matcher– Design choices: normalize/denormalize,
reference/embed– Melds programming with MongoDB for best of
both– Flexible for development and change
• Programs×MongoDB = Great Big Data Applications
• Play chess with God
Schema Design – Gary Murakami
It’s All About Your Application
• Your schema is the impedance matcher– Design choices: normalize/denormalize,
reference/embed– Melds programming with MongoDB for best of
both– Flexible for development and change
• Programs×MongoDB = Great Big Data Applications
• Play music with God – AAC
Lead Engineer / Evangelist
Gary J. Murakami, Ph.D.
#MongoDB
Questions?
"His pattern indicatestwo-dimensional
thinking.”- Spock
Star Trek II: The Wrath of Khan
www.3dchessfederation.com
Thank you so much to our community who made An Evening with MongoDB Minneapolis possible:
• David Hussman• Josh Kennedy• Matthew Chimento• Jeffrey Lemmerman• Dan Chamberlain • Christopher Rueber • Erin Newkirk
Thank you DevJam for hosting our event!