Post on 15-Feb-2017
©2016 Couchbase Inc. 1
The Couchbase Connect16 mobile appTake our in-app survey!
©2016 Couchbase Inc. 2
Memory-optimized indexhow they work
Sarath LakshmanSenior Software Engineer, Couchbase
©2016 Couchbase Inc. 3©2016 Couchbase Inc.
Agenda
• Architecture of Global Secondary Index
• What exactly is Memory-Optimized Index ?
• Architecture of Nitro Storage Engine
• Scalability and Performance
• Operational aspects of Memory-Optimized Index
©2016 Couchbase Inc. 4
Global Secondary Index (GSI)
An architecture overview
©2016 Couchbase Inc. 5©2016 Couchbase Inc.
GSI Overview
• Speed up your N1QL queries using fast indexes ordered by secondary JSON fields
• Workload isolation and independent scaling for document access/modifications and index operations
• Ensure read availability by creating replica indexes
• Global Indexes offer scalable performance, while local indexes degrade query performance as more nodes are added due to scatter/gather
• Asynchronously updated indexes with high throughput and low latency
©2016 Couchbase Inc. 6©2016 Couchbase Inc.
Multi Dimensional Scaling (MDS)
• Indexes can scale independently from document data• Workloads for different services are isolated
STORAGE
Couchbase Server 1
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Managed Cache
Storage
Data Service STORAGE
Couchbase Server 2
Managed Cache
Cluster ManagerCluster Manager
Data Service STORAGE
Couchbase Server 3
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Data Service STORAGE
Couchbase Server 4
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Query Service
STORAGE
Couchbase Server 5
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Index Service
Managed Cache
Storage
Managed Cache
Storage Storage
STORAGE
Couchbase Server 6
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Index Service
Storage
Managed Cache Managed Cache
©2016 Couchbase Inc. 7
Components of GSI
• Projector• Transforms document mutations into
secondary index items and routes them to index nodes based on index definitions
• Indexer• Updates indexes corresponding to
the document changes• Provide point-in-time index scan
snapshots• Handle index DDLs
• GSI Client• Smart client which is aware of global
indexes topology• Helps N1QL to interact with GSI
indexes• Facilitates index scan operations• Manages scan connections pooling
©2016 Couchbase Inc. 8©2016 Couchbase Inc.
Indexer update pipeline
Index Service
Index Port
Mutation QueueExtractionWorker
Index Queue
Storage updaterWorker
ForestDB/Nitro
Update index ONLY IF key has changed
{“LastName” : “Adams”,
“Phone” : “323-180-9978”}
{“LastName” : “Adams”}
{“Phone” : “323-180-9978”}
©2016 Couchbase Inc. 9
Memory Optimized Index
An introduction
©2016 Couchbase Inc. 10©2016 Couchbase Inc.
Why Memory-Optimized index ?
• Performance!• Server hardware is constantly evolving with many CPU cores and large amount
of DRAM• Single index write performance matters as it has to keep up with the rate of
document mutations send by many data service nodes• Data service offers very high write performance demanding fast index updates• Indexes hold small subset of document data (eg., secondary field). Hence, it is
possible hold indexes completely in memory• Disk oriented storage engine such as Standard Index are optimized for faster
disk access with paging mechanism by assuming that entire dataset cannot fit in memory
• Providing more DRAM and CPU cores will not speed up single index performance with standard indexes
©2016 Couchbase Inc. 11©2016 Couchbase Inc.
What exactly is Memory-Optimized Index ?
• Memory Resident Index• Throw more DRAM, CPU cores – Can I scale single index performance
linearly ? Yes• Designed for high performance and multicore scalability• Fast writes and low latency Index scans• Architecturally very different from disk-oriented storage engines• Fast backup and Recovery on Disk/SSD• Supports index snapshots at 20ms latency (200ms for standard index)• Avoid need for partitioning index for scaling throughput• Written in Golang/C• Every component of the index storage engine can scale seamlessly
with many CPU cores
©2016 Couchbase Inc. 12
Nitro Storage EngineThe storage engine that powers
Memory Optimized Indexes (MOI)
A VLDB 2016 paper (Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index)
©2016 Couchbase Inc. 13©2016 Couchbase Inc.
Design considerations
Multiple Writers for high performance Utilize the inherent parallelism in the Database Change Protocol (DCP) Scalable single index write performance by using available CPU cores
Lock-free data structures for high concurrency Writers and readers never block Maximize utilization of multicore CPUs
Fast snapshots Minimize latency for index queries/ reduce staleness of the index Create read snapshots at the rate of 100/second
Leverage optimizations for memory resident data structures
©2016 Couchbase Inc. 14
Nitro Architecture
• Create backups from snapshots and recover nitro after restart/crash
• Free items when GCed and not in reference• Remove items from skiplist which belongs to the
unused snapshots• Create point-in-time immutable snapshots for index
scans• Avoid phantoms and provide scan stability• Manage index snapshot versions in use• Implements Insert, Delete, Lookup, Range Iteration• Concurrent partitioned visitors• Concurrent bottom-up skiplist build
©2016 Couchbase Inc. 15©2016 Couchbase Inc.
Skiplist
• Probabilistic balanced ordered search data structure• Search is similar to binary search over linked-lists (O(logn))• Item granular operations unlike B+Tree (page oriented)• Lock-free skiplist is implemented by making use of atomic compare-and-swap,
atomic-add-fetch
©2016 Couchbase Inc. 16©2016 Couchbase Inc.
Lock-free data structure fundamentals
©2016 Couchbase Inc. 17©2016 Couchbase Inc.
Lock-free data structure fundamentals
Step 1: Mark as deleted Step 2: Removal
©2016 Couchbase Inc. 18©2016 Couchbase Inc.
Multi Versions Management (MVCC)
• Define lifetime metadata in each Skiplist node (ie, bornSn and deadSn)
• Create Snapshot 1
• Create Snapshot 2V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=0
V=30
bornSn=1deadSn=0
V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=0
V=30
bornSn=1deadSn=0
V=15
bornSn=2deadSn=0
V=32
bornSn=2deadSn=0
©2016 Couchbase Inc. 19©2016 Couchbase Inc.
Multi Versions Management (MVCC)
• Create Snapshot 3
V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=3
V=30
bornSn=1deadSn=0
V=15
bornSn=2deadSn=0
V=32
bornSn=2deadSn=3
V=32
bornSn=3deadSn=0
©2016 Couchbase Inc. 20©2016 Couchbase Inc.
Multi Versions Management (MVCC)
• Index scan for Snapshot 1
V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=3
V=30
bornSn=1deadSn=0
V=15
bornSn=2deadSn=0
V=32
bornSn=2deadSn=3
V=32
bornSn=3deadSn=0
Visibility: Iterator (Sn=1)
©2016 Couchbase Inc. 21©2016 Couchbase Inc.
Multi Versions Management (MVCC)
• Index scan for Snapshot 2
V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=3
V=30
bornSn=1deadSn=0
V=15
bornSn=2deadSn=0
V=32
bornSn=2deadSn=3
V=32
bornSn=3deadSn=0
Visibility: Iterator (Sn=2)
©2016 Couchbase Inc. 22©2016 Couchbase Inc.
Multi Versions Management (MVCC)
• Index scan for Snapshot 3
V=10
bornSn=1deadSn=0
V=20
bornSn=1deadSn=3
V=30
bornSn=1deadSn=0
V=15
bornSn=2deadSn=0
V=32
bornSn=2deadSn=3
V=32
bornSn=3deadSn=0
Visibility: Iterator (Sn=3)
©2016 Couchbase Inc. 23
Nitro MVCC vs Copy-On-Write B+Tree MVCC
• A single item update to leaf node performs copy-on-write of the entire block (Eg. 4kb)
• Since B+Tree has hierarchical structure, it also results in copy-on-write of all parent blocks recursively until the root block causing significant storage overhead (wandering tree problem)
• Write optimized storage engines tries to amortize this cost by batching updates
• Large batch sizes cause larger snapshot interval
• Nitro has fixed storage overhead per item
• Snapshotting is a lightweight operation
©2016 Couchbase Inc. 24©2016 Couchbase Inc.
Garbage Collection
V=1
bornSn=1deadSn=2
V=2
bornSn=2deadSn=0
V=3
bornSn=1deadSn=0
V=4
bornSn=1deadSn=2
V=5
bornSn=2deadSn=3
V=6
bornSn=3deadSn=0
V=7
bornSn=4deadSn=0
V=8
bornSn=1deadSn=0
V=9
bornSn=3deadSn=0
V=10
bornSn=3deadSn=4
Sn=1 Sn=2 Sn=3 Sn=4Concurrent
GC
ConcurrentSMR
Garbage Collection Snapshot List
rfcnt=0 rfcnt=1 rfcnt=0 rfcnt=2
V=1
©2016 Couchbase Inc. 25©2016 Couchbase Inc.
Safe Memory Reclamation
• Early and alive accessors can potentially hold references to GCed items• Freeing GCed items/nodes can cause dangling references• The memory reclaimer has to make sure that no accessor is holding reference
to GCed items• This problem does not occur with garbage collected languages• A lock-free SMR algorithm takes care of safe freeing of resources• Details of the SMR algorithm is available in the Nitro VLDB16 paper
©2016 Couchbase Inc. 26©2016 Couchbase Inc.
Nitro Backup
File-1
Backup worker-1
Backup worker-2
Backup worker-3
File-2 File-3
GC
Delta files
non-intrusivebackup
©2016 Couchbase Inc. 27
Nitro Recovery
• Concurrent bottom-up skiplist build
• Avoids unnecessary CAS conflicts during concurrent insert
• Snapshot number starts from Sn=1
• Once build is complete, additional items are inserted by replaying inserts from delta files concurrently
File-1
File-2
File-3
©2016 Couchbase Inc. 28©2016 Couchbase Inc.
Benefits of Nitro
• Lock-free operations allows storage engine to scale seamlessly with multicore CPUs
• Single index performance can be scaled by assigning more update workers
• The Nitro MVCC model provides fixed storage overhead per update/insert operation
• Fast snapshotting capability allows very low indexing latency between Data service and Index service
• Nitro provides a scalable lock-free garbage collector and safe memory reclaimer
• Nitro features a scalable online concurrent backup and fast recovery mechanism
©2016 Couchbase Inc. 29
Nitro GSI Integration
©2016 Couchbase Inc. 30
GSI Data Structures
The storage engine needs to maintain two storage structures:
Reverse map Index
Reverse map is used to lookup and remove previous index entry for the docid during the update
Index store maintains ordered index entries used by index scans
©2016 Couchbase Inc. 31
Memory Optimized Index update pipeline
Scalable write performance using multiple writers
Simple hash table used for reverse map instead of Nitro (Avoid concurrency overheads)
Periodic backup persists only (indexItem, docid)
The reverse map can be reconstructed on the fly during recovery
End-to-end Indexing latency ~20ms
HT
Nitro INDEX
hash(docid) % n
writer-1
HT
writer-2
HT
writer-n
..
Index Scan
©2016 Couchbase Inc. 32
Storage Optimizations
HT
Nitro INDEX
DocID Indexed Item
emp_005 MountainView
emp_008 Sunnyvale
Index Entry
MountainView:emp_005
Sunnywale:emp_008
CRC32 Hash Node Pointers
hash1
hash2
Direct pointers from hash table to index entry
Storage needed for index maintenance reduced ~50%
Index item delete cost reduced from O(logn) to O(1)
Optimized multi-entry indexing from single document
©2016 Couchbase Inc. 33
Performance & Scalability
Lets us see the numbers!
©2016 Couchbase Inc. 34©2016 Couchbase Inc.
Nitro performance
• Almost linear scaling of throughput with number of cores
Insert benchmark Lookup benchmark
©2016 Couchbase Inc. 35©2016 Couchbase Inc.
Nitro performance
• Partitioning is not required to scale single index performance
Get with background Inserts Throughput scalability with partitions
©2016 Couchbase Inc. 36©2016 Couchbase Inc.
Memory Optimized Index vs Standard Index – End-to-End
• 4 Data nodes, 1 Index node, 32 cores CPU (Intel(R) Xeon(R) E5-2630 v3 @ 2.40GHz)
• Index service node keeps up with mutations from 4 Data service nodes
Operation ThroughputInsert 1,658,031 Update 822,680 Delete 1,578,316
GSI index server throughput (items/sec)
Single Index benchmarkMOI Write Throughput = 1.6M/s 800k/s
©2016 Couchbase Inc. 37©2016 Couchbase Inc.
Nitro recovery performance
©2016 Couchbase Inc. 38
Memory-Optimized Index
Operational perspective
©2016 Couchbase Inc. 39©2016 Couchbase Inc.
Operational Aspects
• Memory-Optimized Index can be configured using cluster-wide setting
• What happens when an index node runs out of memory ?
• What happens to the indexes once Couchbase Server is restarted ?
• What is the recommended DRAM/CPU configuration for using MOI ?
©2016 Couchbase Inc. 40©2016 Couchbase Inc.
Summary
• Couchbase GSI allows to scale data services and index services independently with workload isolation
• Couchbase 4.5 features Memory-Optimized Indexes which can provide superior index performance by seamlessly scaling with many CPU cores and large amount of DRAM
• Introduced Nitro storage engine with following features:• Multiple writers and lock-free operations• Fast snapshotting with lightweight MVCC and concurrent garbage collector• Concurrent non-intrusive fast backup and restore
• Memory-Optimized Index leverages storage optimizations to reduce memory consumption for the index as well as generates compact file backups
• Showcased Nitro and Memory-Optimized Index end-to-end performance• It takes only few minutes to build large indexes!• For more details on Nitro, refer Nitro VLDB16 paper
(http://www.vldb.org/pvldb/vol9/p1413-lakshman.pdf)
©2016 Couchbase Inc. 41
Thank You!
©2016 Couchbase Inc. 42
Share your opinion on Couchbase
1. Go here: http://gtnr.it/2eRxYWn
2. Create a profile
3. Provide feedback (~15 minutes)