Weather of the Century: Design and Performance
-
Upload
mongodb -
Category
Technology
-
view
569 -
download
0
description
Transcript of Weather of the Century: Design and Performance
Consulting Engineer, MongoDB
André Spiegel
#MongoDB
The Weather of the Century:Design and High Performance
What was the weatherwhen you were born?
Data Format: Raw and in MongoDB
0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{ "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" }}
Data Format: Raw and in MongoDB
0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{ "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" }}
Station Identifier(»NYC Central Park«)
How Big Is It?
• 2.5 billion data points
• 4 Terabyte (1.6k per document)
• “moderately big”
How to do this with MongoDB?
First Deployment
• A single server with a really big disk
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge
Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM@
100 GB disk
mongod
c3.8xlarge
Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM@
100 GB disk
mongod
Now... how much would you pay?
..
Now... how much would you pay?
..
$60,000 / yr
Now... how much would you pay?
..
$60,000 / yr
$700,000 / yr
Use Cases
• Bulk loading– getting all data into the system
• Latency and throughput for queries– point in space-time– one station, one year– the whole world, once upon a time
• Aggregation and Exploration– warmest and coldest day ever, etc.
Bulk Loading: Principles
• On the application side:– batch size– number of client threads– use unordered bulk writes
• On the server side:– Journaling off ( temporarily! )– Index later– In cluster: pre-split, no balancing
Bulk Loading: Single Server
batchsize
threads
throughput
Bulk Loading: Single Server
batchsize
threads
throughput
8 threads,batch size 100→ 85,000 doc/s
Bulk Loading: Single Server
• Settings: 8 threads
batch size 100
• Total loading time: 10 h 20 min
• Documents per second: 70,000
• Index build time: 7 h 40 min (ts_1_st_1)
Bulk Loading: Cluster
Bulk Loading: Cluster144 threads,batch size 200→ 220,000 doc/s
Bulk Loading: Cluster
• Shard Key: Station ID, hashed
• Settings: 10 mongos @ 144 threads
batch size 200
• Total loading time: 3 h 10 min
• Documents per second: 228,000
• Index build time: 5 min (ts_1_st_1)
Queries: Point in Space-Timedb.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: Point in Space-Time
single server cluster0
0.20.40.60.8
11.21.41.6
avg95th99th
ms
max. throughput:
40,000/s 610,000/s(10 mongos)
db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: One Station, One Yeardb.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
single server cluster0
1000
2000
3000
4000
avg95th99th
ms
Queries: One Station, One Year
max.throughput: 20/s 430/s
(10 mongos)
targeted query
db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
Queries: The Whole World, Once Upon...db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
single server cluster0
2000
4000
6000
8000
avg95th99th
ms
Queries: The Whole World, Once Upon...
max.throughput: 8/s
310/s(10 mongos)
scatter/gather query
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
Analytics and Exploration
• Analytics means ad-hoc queries for whichwe do not have an index– Find all tornados– Maximum reported temperature
• We cannot just index everything– memory– write performance
Analytics: Find all Tornados
db.data.find ({ "presentWeatherObservation.condition" : "99"})
Analytics: Find all Tornados
db.data.find ({ "presentWeatherObservation.condition" : "99"})
1 h 28 minSingle Server
Analytics: Find all Tornados
db.data.find ({ "presentWeatherObservation.condition" : "99"})
47 sCluster
1 h 28 minSingle Server
Analytics: Maximum Temperature
db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } }])
Analytics: Maximum Temperature
db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } }])
61.8 °C = 143 °F
Analytics: Maximum Temperature
db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } }])
61.8 °C = 143 °F
4 h 45 minSingle Server
Analytics: Maximum Temperature
db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } }])
61.8 °C = 143 °F
2 minCluster
4 h 45 minSingle Server
Summary: Single Server
Pro
• Cost-effective
• Very good latency for single queries
Con
• Some operations are prohibitive:– Indexing– Table Scans
Summary: Cluster
Con
• High cost
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
..
Thank you.