MongoDB Days UK: Jumpstart: Schema Design

46
Schema Design (and its performance implications) Andrew Erlichson VP, Engineering [email protected] @erlichson

Transcript of MongoDB Days UK: Jumpstart: Schema Design

Page 1: MongoDB Days UK: Jumpstart: Schema Design

Schema Design(and its performance implications)

Andrew ErlichsonVP, [email protected]@erlichson

Page 2: MongoDB Days UK: Jumpstart: Schema Design

2

Agenda

1. Example Problem

2. MongoDB Schema Design vs. Relational

3. Modeling Relationships

4. Schema Design and Performance

Page 3: MongoDB Days UK: Jumpstart: Schema Design

Example Problem

Page 4: MongoDB Days UK: Jumpstart: Schema Design

4

Medical Records• Collects all patient information in a central repository• Provide central point of access for

– Patients– Care providers: physicians, nurses, etc.– Billing– Insurance reconciliation

• Hospitals, physicians, patients, procedures, records

PatientRecords

Medications

Lab Results

Procedures

Hospital Records

Physicians

Patients

Nurses

Billing

Page 5: MongoDB Days UK: Jumpstart: Schema Design

5

Medical Record Data

• Hospitals – have physicians

• Physicians– Have patients– Perform procedures– Belong to hospitals

• Patients– Have physicians– Are the subject of procedures

• Procedures– Associated with a patient– Associated with a physician– Have a record– Variable meta data

• Records– Associated with a procedure– Binary data– Variable fields

Page 6: MongoDB Days UK: Jumpstart: Schema Design

6

Lot of Variability

Page 7: MongoDB Days UK: Jumpstart: Schema Design

Relational View

Page 8: MongoDB Days UK: Jumpstart: Schema Design

Schema Design:

MongoDB vs. Relational

Page 9: MongoDB Days UK: Jumpstart: Schema Design

MongoDB Relational

Collections Tables

Documents Rows

Data Use Data Storage

What questions do I have? What answers do I have?

MongoDB versus Relational

Page 10: MongoDB Days UK: Jumpstart: Schema Design

Complex Normalized Schemas

Page 11: MongoDB Days UK: Jumpstart: Schema Design

Complex Normalized Schemas

Page 12: MongoDB Days UK: Jumpstart: Schema Design

13

Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Fields can contain an array of sub-documents

Fields

Typed field values

Fields can contain arrays

String

Number

Geo-Coordinates

Page 13: MongoDB Days UK: Jumpstart: Schema Design

Relationships

Page 14: MongoDB Days UK: Jumpstart: Schema Design

Modeling One-to-One Relationships

Page 15: MongoDB Days UK: Jumpstart: Schema Design

16

Referencing

Procedure• patient• date• type• physician• type

Results• dataType• size• content: {…}

Use two collections with a reference

Similar to relational

Page 16: MongoDB Days UK: Jumpstart: Schema Design

17

Procedure• patient• date• type• results

• equipmentId• data1• data2

• physician

• Results• type• size• content: {…}

Embedding

Document Schema

Page 17: MongoDB Days UK: Jumpstart: Schema Design

18

Referencing (1:1)

Procedure

{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134}

Results

{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }

Page 18: MongoDB Days UK: Jumpstart: Schema Design

19

Embedding (1:1)Procedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}

Page 19: MongoDB Days UK: Jumpstart: Schema Design

20

Embedding (1:1)

• Advantages– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation

• MongoDB doesn’t offer multi-document transactions

• Limitations– Large documents mean more overhead if most fields are not relevant– 16 MB document size limit

Page 20: MongoDB Days UK: Jumpstart: Schema Design

23

Referencing (1:1)

• Advantages– Smaller documents– Less likely to reach 16 MB document limit– Infrequently accessed information not accessed on every query

• Limitations– Two queries required to retrieve information– Cannot update related information atomically

Page 21: MongoDB Days UK: Jumpstart: Schema Design

24

One to One: General Recommendations

• Embed– No additional data duplication– Can query or index on

embedded field• e.g., “result.type”

• Exceptional cases…• Embedding results in large

documents• Set of infrequently access

fields

{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}

Page 22: MongoDB Days UK: Jumpstart: Schema Design

Modeling One-to-Many Relationships

Page 23: MongoDB Days UK: Jumpstart: Schema Design

26

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,

…}, { id: 12346, date: 2015-02-15, type: “blood test”,

…}]}

Pat

ient

s

Embed

One-to-Many RelationshipsModeled in 2 possible ways

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}

{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}

Pat

ient

s

Reference

Pro

cedu

res

Page 24: MongoDB Days UK: Jumpstart: Schema Design

27

One to Many: General Recommendations

• Embed, when possible– Access all information in a single query– Take advantage of update atomicity– No additional data duplication– Can query or index on any field

• e.g., { “phones.type”: “mobile” }

• Exceptional cases:– 16 MB document size– Large number of infrequently accessed fields

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,

…}, { id: 12346, date: 2015-02-15, type: “blood test”,

…}]}

Page 25: MongoDB Days UK: Jumpstart: Schema Design

Modeling Many-to-Many Relationships

Page 26: MongoDB Days UK: Jumpstart: Schema Design

29

Many to ManyTraditional Relational Association

Join table

Physiciansnamespecialtyphone

Hospitalsname

HosPhysicanRelhospitalIdphysicianIdX

Use arrays instead

Page 27: MongoDB Days UK: Jumpstart: Schema Design

30

{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [ { id: 12345, name: “Joe Doctor”, address: {…},

…}, { id: 12346, name: “Mary Well”, address: {…},

…}]}

Many-to-Many RelationshipsEmbedding physicians in hospitals collection

{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [ { id: 63633, name: “Harold Green”, address: {…},

…}, { id: 12345, name: “Joe Doctor”, address: {…},

…}]}

Data Duplication

Page 28: MongoDB Days UK: Jumpstart: Schema Design

31

{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}

Many-to-Many RelationshipsReferencing

{ id: 63633, name: “Harold Green”, address: {…}, …}

Hospitals

{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [63633, 12345]}

Physicians

{ id: 12345, name: “Joe Doctor”, address: {…}, …}

{ id: 12346, name: “Mary Well”, address: {…}, …}

Page 29: MongoDB Days UK: Jumpstart: Schema Design

32

Many to ManyGeneral Recommendation

• Use case determines whether to reference or embed:1. Data Duplication

• Embedding may result in data duplication

• Duplication may be okay if reads dominate updates

2. Referencing may be required if many related items

3. Hybrid approach• Potentially do both

{ _id: 2, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}

{ _id: 12345, name: “Joe Doctor”, address: {…}, …} { _id: 12346, name: “Mary Well”, address: {…}, …}

Hos

pita

ls

Reference

Phy

sici

ans

Page 30: MongoDB Days UK: Jumpstart: Schema Design

What If I Want to Store Large Files in MongoDB?

Page 31: MongoDB Days UK: Jumpstart: Schema Design

34

GridFS

Driv

erGridFS APIdoc.jpg(meta data)

doc.jpg(1)doc.jpg

(1)doc.jpg(1)

fs.files fs.chunksdoc.jpg

mongofiles utility provides command line GridFS interface

Page 32: MongoDB Days UK: Jumpstart: Schema Design

Schema Design and Performance

Two Examples

Page 33: MongoDB Days UK: Jumpstart: Schema Design

Example 1: Hybrid Approach

Embed and Reference

Page 34: MongoDB Days UK: Jumpstart: Schema Design

37

Healthcare Example

patients

procedures

Page 35: MongoDB Days UK: Jumpstart: Schema Design

Tailor Schema to Queries (cont.)

{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : ["551ac”, “343fs”]}

{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}

Patient Procedure

Find all patients from NH that have had chest x-rays

Page 36: MongoDB Days UK: Jumpstart: Schema Design

Tailor Schema to Queries (cont.)

{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}]}

{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}

Patient Procedure

Find all patients from NH that have had chest x-rays

Page 37: MongoDB Days UK: Jumpstart: Schema Design

Example 2: Time Series Data

Medical Devices

Page 38: MongoDB Days UK: Jumpstart: Schema Design

41

Vital Sign Monitoring Device

Vital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels

Produces data at regular intervals• Once per minute

Page 39: MongoDB Days UK: Jumpstart: Schema Design

42

We have a hospital(s) of devices

Page 40: MongoDB Days UK: Jumpstart: Schema Design

43

Data From Vital Signs Monitoring Device

{ deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500")}

• One document per minute per device

• Relational approach

Page 41: MongoDB Days UK: Jumpstart: Schema Design

44

Document Per Hour (By minute)

{ deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500")}

• Store per-minute data at the hourly level

• Update-driven workload

• 1 document per device per hour

Page 42: MongoDB Days UK: Jumpstart: Schema Design

45

Characterizing Write Differences

• Example: data generated every minute• Recording the data for 1 patient for 1 hour:

Document Per Event60 inserts

Document Per Hour1 insert, 59 updates

Page 43: MongoDB Days UK: Jumpstart: Schema Design

46

Characterizing Read Differences

• Want to graph 24 hour of vital signs for a patient:

• Read performance is greatly improved

Document Per Event 1440 reads

Document Per Hour24 reads

Page 44: MongoDB Days UK: Jumpstart: Schema Design

47

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 B 876 M

Total Index Size 6364 GB 106 GB

_id index 1468 GB 24.5 GB

{ts: 1, deviceId: 1} 4895 GB 81.6 GB

Document Size 92 Bytes 758 Bytes

Database Size 4503 GB 618 GB

• 100K Devices • 1 years worth of data

100000 * 365 * 24 * 60

100000 * 365 * 24

100000 * 365 * 24 * 60 * 130

100000 * 365 * 24 * 130

100000 * 365 * 24 * 60 * 92

100000 * 365 * 24 * 758

Page 45: MongoDB Days UK: Jumpstart: Schema Design

48

Summary• Relationships can be modeled by embedding or references

• Decision should be made in context of application data and query workload– Tailor schema to application workload

• It is okay recommended to violate RDBMS schema design principles– No duplication of data– Normalization

• Different schemas may result in dramatically different– Query performance– Hardware requirements

Page 46: MongoDB Days UK: Jumpstart: Schema Design

[email protected]

@erlichson