MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
Arctic: High-performance IoT and financial data storage with Python and MongoDB
-
Upload
james-blackburn -
Category
Software
-
view
659 -
download
1
Transcript of Arctic: High-performance IoT and financial data storage with Python and MongoDB
© Man 2015
Arctic: High-performance IoT and financial data storage
the Python + MongoDB TimeSeries store
July 2015
@ManAHLTech
@jimmybbJames Blackburn
2
1. Overview: Data @ AHL
2. Arctic: a fast Python library for TimeSeries (and anything else)
3. Demo
4. Performance
Overview
Quant researchers
• Interactive work – latency sensitive• Batch jobs run on a cluster – total throughput• Historical data• New data• ... want control of storing their own data
Trading system
• Auditable – SVN for data• Stable
6
Users
Data sizes we deal with…
• ~1MB x 1000s 1x a day price data 10k rows (30 years)• ~0.5GB x 1000s 1-minute data 4M rows (20 years)• ~1GB x 1000s 10k x 10k data matrices 100M cells (30 years)• ~30TB Tick data 100k msgs/s
800M msgs/day
… and different shapes
• Time series of prices• Event data• News data• Metadata• What’s next?
7
Data sizes
8
lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)
Problems - Scale
9
lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)
Equity Prices: 77M float64s 600MB of data ~= 5Gbits!
600 MB
Problems - Scale
Using all the technology:
• Relational databases• Tick databases• Flat files • HDF5 files• Caches
14
Databases
Using all the technology:
• Relational databases• Tick databases• Flat files • HDF5 files• Caches
15
Can we build one system to rule them all?
Databases
Requirements
• Scalable – unbounded in data-size and number of clients
• Agile – any data shape; new shapes; iterative development
• Fast – as fast as local files
• Easy to use – and we mean easy
17
Project Requirements
Goals
• 20 years of 1 minute data in <1s
• 200 instruments x all history x once a day data <1s
• Single data store for all data types• 1x day data Tick data
• Data versioning + Audit
18
Project Goals
Data bucketed into named Libraries
• One minute
• Daily
• User-data: jbloggs.EOD
• Metadata Index
Pluggable Library types:
• VersionStore
• TickStore
• Pickle Store
• … pluggable …
https://github.com/manahl/arctic/blob/master/howtos/how_to_custom_arctic_library.py
19
Arctic Libraries
Document ~= Python Dictionary
Flexible schema Rapid prototyping
OpenSource database
Great support
#1 NoSQL DB (#3 overall) http://db-engines.com/en/ranking
21
Why MongoDB
Arctic key-value store
23
from arctic import Arctic
a = Arctic('research') # Connect to the data store
a.list_libraries() # What data libraries are available
library = a[‘jbloggs.EOD’] # Get a Library
library.list_symbols() # List symbols
library.write(‘SYMBOL’, <TS or other data>) # Write
library.read(‘SYMBOL’, version=…) # Read, with an optional version
library.snapshot('snapshot-name') # Create a named snapshot of the library
Library.list_snapshots()
https://github.com/manahl/arctic/blob/master/howtos/how_to_use_arctic.py
Arctic API
28
Implementation – a chunk
{ ID: ObjectId('52b1d39eed5066ab5e87a56d'), SYMBOL: 'symbol' INDEX: Binary('...', 0), COLUMNS: { ASK: { DATA: Binary('...', 0), DTYPE: '<f8', ROWMASK: Binary('...', 0) }, ... } START: DateTime(...), END: DateTime(...), SEGMENT: 1386933906826L, SHA: 1386933906826L, VERSION: 3,}
Low latency:
- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL)
- OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)
- 1s for 15M rows Java
Parallel Access:
- Cluster with 256+ concurrent data access
- Consistent throughput – little load on the Mongo server
Efficient:
- 10-15x reduction in network load
- Negligible decompression cost (lz4: 1.8Gb/s)
39
Conclusions
Like data? We’re [email protected]
40
Questions?
@ManAHLTech
https://github.com/manahl/arctic
@jimmybbJames Blackburn
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
42
Implementation – PickleStore
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
43
Implementation – PickleStore
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
44
Implementation – PickleStore
class PickleStore(object):
def read(self, collection, version, symbol):
data = ''.join([x['data'] for x in collection.find({'symbol': symbol,
'parent': version['_id']},
sort=[('segment', pymongo.ASCENDING)])])
return cPickle.loads(lz4.decompress(data))
45
https://github.com/manahl/arctic/blob/master/arctic/store/_pickle_store.py
Implementation – PickleStore