USING LOCK-FREE AND WAIT-FREE IN-MEMORY ALGORITHMS TO TURBO-CHARGE HIGH VOLUME DATA MANAGEMENTHENNING ANDERSEN, STIBO SYSTEMS A/S
See all the presentations from the In-Memory Computing Summit at http://imcsummit.org
BIO
20 years of professional career at Stibo Systems A/S Developed software for the last 30+ years Technical lead on many projects, including:
Migrating from C++ to Java platform (performance & scalability) Establishing a component platform In-Memory component
GET TO KNOW STIBO SYSTEMS
Travel/HospitalityDistributionRetail Manufacturing
OUR GROWING FAMILY
2015 MQ MDM OF PRODUCT DOMAIN
COMPLETE, SEAMLESS MULTIDOMAIN MDM SOLUTION
INTEGRATING IN-MEMORY INTO STEP
STEPSTEP
STEP Server (J2EE)STEP Server
(J2EE)
DB Server
STEP
DB Server
STEP
In-Memory DB
OFF-HEAP
BENCHMARK RESULTSLarge Retailer Data Large Distributor Data Scalability Test Data
REQUIREMENTS
Great performance Compact memory layout
Data Per Entry Overhead
Lookup by Key Complex Querying
Indexing “Friendly” to our existing architecture Fast Startup/Initialization
PERFORMANCE BY SIMPLICITY
MVCC/Immutability Wait-free index scans Code Generation Custom API/Direct Access
mov (%rdi,%r11,1),%r11
BASIC HASH TABLE CLOSED ADDRESSING
Next Key=K1
Value=10
hash(key)%tablesizeBucket Table
BASIC HASH TABLE CLOSED ADDRESSING
Next Key=K1
Value=10
hash(key)%tablesize
Next Key=K3
Value=20
Bucket Table
Next Key=K4
Value=30
BASIC HASH TABLE COLLISION
hash(key)%tablesize
Next Key=K3
Value=20
Next Key=K1
Value=10
Bucket Table
MVCC HASH TABLE
Next Prev TSN Key=K1 Value=10
hash(key)%tablesize
TSN = Transaction Sequence Number
Bucket TableTx ID TSN
Transaction Table
Published TSN2
TRANSACTION/COMMIT PHASESPrepar
eCommi
tFinish
Publish
Vacuum
Abort
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
Leader
Commit Phases
TRANSACTION/COMMIT PHASESPrepar
eCommi
tFinish
Publish
Vacuum
Abort
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
Leader
Commit Phases
TRANSACTION/COMMIT PHASESPrepar
eCommi
tFinish
Publish
Vacuum
Abort
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
Leader
TSN=3TSN=
3TSN=3
Commit Phases
TRANSACTION/COMMIT PHASESPrepar
eCommi
tFinish
Publish
Vacuum
Abort
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
STEP
In-Memory
DB
Leader
TSN=3TSN=
3TSN=3
Commit Phases
Tx ID TSN
MVCC HASH TABLE UPDATE - PUT(K1,15)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesize Tx ID TSN
UUID=1234
Bucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key=K1 Value=15
Prepare
Published TSN2
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE UPDATE - PUT(K1,15)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key=K1 Value=15
Finish1. Pull new TSN
Published TSN2
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE UPDATE - PUT(K1,15)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key=K1 Value=15
Finish
Next Prev TSN=3 Key=K1 Value=15
2. Apply TSN
Published TSN2
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE UPDATE - PUT(K1,15)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key Value’
Finish
Next Prev TSN=3 Key=K1 Value=15
3. Link to Prev
Published TSN2
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE UPDATE - PUT(K1,15)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key Value’
Finish
Next Prev TSN=3 Key=K1 Value=15
4. Update Bucket Table
Published TSN2
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE READER
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key Value’Next Prev TSN=3 Key=K1 Value=15
Reader TSN=2Lookup K1
Published TSN2
PrepareFinishPublish
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE UPDATE (PUBLISH)
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
Next Prev Infinite Key Value’Next Prev TSN=3 Key=K1 Value=15
Update Published TSN
Publish
Published TSN23
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC HASH TABLE READER
Next Prev TSN=2 Key=K1 Value=10
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Next Prev Infinite Key Value’Next Prev TSN=3 Key=K1 Value=15
Reader TSN=3Lookup K1
Published TSN3
MVCC HASH TABLE
INDEXING USING SKIP LISTS
SKIP LISTS
H 10 20 30 40 5050% have height >=225% have height >=3Head Height ~= log2(n)
30>=next.value?Find Value=30?
SKIP LISTS - INSERTION
20 30 40 50
15Pick random height
10H
SKIP LISTS - INSERTION
20 30 40 50
15Pick random height
10H
SKIP LISTS - INSERTION
H 10 20 30 40 50
15Pick random height
SKIP LISTS – INSERTION RESULT
H 10 20 30 40 5015
Next Prev TSN=3 Key=K1 Value=15
Index
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC INDEXING USING SKIP LISTS
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Published TSN22
Finish 5. Update Index
Next Prev TSN=2 Key=K1 Value=10
Index
5. Update Index
Next Prev TSN=3 Key=K1 Value=15
Index
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
MVCC INDEXING USING SKIP LISTS
hash(key)%tablesizeBucket Table Transaction Table
PrepareFinishPublish
Published TSN22
Finish
Next Prev TSN=2 Key=K1 Value=10
Index
SKIP LISTS – 5. UPDATE INDEX
H10K1PN
2
15
PN
3
20K3PN
2
30K4PN
2
40K5PN
2
50K6PN
2
Next
Prev
TSN
Key
Value
Index L0
Index L1
Index L2
K1
SKIP LISTS – 5. UPDATE INDEX
H10K1PN
2
15
PN
3
20K3PN
2
30K4PN
2
40K5PN
2
50K6PN
2
Next
Prev
TSN
Key
Value
Index L0
Index L1
Index L2
K1
SKIP LISTS – INSERTION RESULT
H10K1PN
2
15
PN
3
20K3PN
2
30K4PN
2
40K5PN
2
50K6PN
2
Next
Prev
TSN
Key
Value
Index L0
Index L1
Index L2
K1
SKIP LISTS – FIND [12-25], TSN=2
H10K1PN
2
15
PN
3
20K3PN
2
30K4PN
2
40K5PN
2
50K6PN
2
Next
Prev
TSN
Key
Value
Index L0
Index L1
Index L2
K1
SKIP LISTS – FIND [12-25], TSN=3
H10K1PN
2
15
PN
3
20K3PN
2
30K4PN
2
40K5PN
2
50K6PN
2
Next
Prev
TSN
Key
Value
Index L0
Index L1
Index L2
K1
LOCK-FREE INSERTIONS SUMMARY
CAS (compare-and-swap) on previous entity – one winner Bottom-up preserves skip-list for every level, allowing wait-free readers Help vacuum ensures lock-freedom
H 10 20 30 40
15 17
LOCK-FREE INSERTIONS SUMMARY
CAS (compare-and-swap) on previous entity – one winner Bottom-up preserves skip-list for every level, allowing wait-free readers Help vacuum ensures lock-freedom
H 10 20 30 4015 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
H
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot RegistryReader TSN Epoch
Thread=1234 2 17
Thread=1235 3 17Vacuum wait
Reader TSN Epoch
Thread=1234 2 17
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
H
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum wait
Reader TSN Epoch
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
H
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Reader TSN Epoch
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum phase 1
H
Reader TSN Epoch
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum epoch wait
H
Reader TSN Epoch
Thread=2345 3 18
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum epoch wait
H
Reader TSN Epoch
Thread=2345 3 18
Thread=1235 3 17
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum epoch wait
H
Reader TSN Epoch
Thread=2345 3 18
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum epoch wait
H
Reader TSN Epoch
Thread=2345 3 18
Tx ID TSN
UUID=1234
Tx ID TSN
UUID=1234 3
VACUUM, EPOCH BASED DEFERRED RECLAMATION
hash(key)%tablesizeBucket Table Transaction Table
Published TSN23
T
10K1
PN
2
15
PN
3
20K3
PN
2
30K4
PN
2
40K5
PN
2
50K6
PN
2K1
Snapshot Registry
Vacuum phase 2
H
MVCC SUMMARY
Map and indexes both under MVCC Index scans are wait-free (and simple/fast) Insert/update/delete are lock-free Automated reclamation of storage
EFFICIENT AND SAFE APItransactionManager.read((snapshot) -> { QueryIterator<ProductCO> products = snapshot.query(ProductCO._ID.range(‘IMC’,’Stibo’)); while (products.next()) { CacheEntry<ProductCO> entry = queryIterator.entry(); long typeId = entry.longValue(ProductCO::getObjectType); CacheEntry<ObjectTypeCO> type = snapshot.get(typeId);// can do gets, queries etc. on the same snapshot safely for all kinds of objects }}
public class ProductCO { long getObjectType(ValuePointer ptr) { … }} No object copies, no GC, efficient accessOften JVM can inline entire query to one native
method
12345
DIY USEFUL LEARNING
• Memory model (java different from C++) and CAS operations• Assembly• CPU memory architecture• Wait-free and lock-free algorithms
• Enumerate all states• Think about state transitions• Try to formally proof it right• Deletions are often the most tricky part• Do not even think about “this will never happen”, because it will
IN-MEMORY VENDOR QUESTIONS
Direct access to data or only access to copies of data? And direct access to individual fields in an entry?
Index/Query engine MVCC consistent with map gets and/or additional queries? Will index scans/queries acquire locks? Will index inserts acquire locks? Will map get/put operations acquire locks? Memory overhead per entry? Memory overhead per index (per entry)? How do you avoid memory fragmentation? Do you lock pages in memory and use huge/large pages?
Top Related