FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40...

28
FOEDUS: OLTP Engine for a Thousand Cores and NVRAM Hideaki Kimura HP Labs, Palo Alto, CA Slides By : Hideaki Kimura Presented By : Aman Preet Singh

Transcript of FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40...

Page 1: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

FOEDUS:OLTPEngineforaThousandCoresandNVRAM

HideakiKimuraHPLabs,PaloAlto,CA

SlidesBy:HideakiKimuraPresentedBy:AmanPreetSingh

Page 2: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4/16

Next-Generation Server Hardware?

HPThe Machine

UC BerkeleyFirebox

IntelRack Scale

Architecture...

Differences Commonalities

• HP Photonics? Infiniband? QPI?• HP Memristor? PCM? Flash?• ...

• 1,000s of CPU Cores• Fast Interconnect• Huge Low-Latency NVRAM• Endurance will be an issue

Page 3: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

GAP ?• Traditional OLTP databases do not scale to large number of

cores.• Recent main-memory databases achieve better scalability but

they do not efficiently handle NVRAM.

Page 4: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

GAP?• Softwareshouldplacedatawisely.• Systemonachip.• Systemscapableofhandlingcachein-coherrent architectures.• Non-uniformMemoryaccess(NUMA)awaresystems.

• Aimshouldbetotrytomakethedatalocal.

Page 5: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

Solution ?• Toprovidethesecapabilities,FOEDUSwasbuilt.• Groundupdatabase• Opensource• FullyACID• Designedtoscaleuptothousandsofcores• MakesthebestuseofDRAMforOLTP,• AndNVRAMforOLAPqueries.

Page 6: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

FOEDUS Key Principles• Bringin-memoryspeedtoNVRAM• SILO-likeLightweightOptimisticConcurrencyControl

• Dual-Page:physicallyindependent,logicallyequivalent• MutableVolatilePagesinDRAM• ImmutableSnapshotPagesinNVRAM

• Master-Tree:SimpleandScalableOCCforNVRAM• Masstree +FosterB-Tree+Foster-Twin

Page 7: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6/16

SILO: Lightweight/Decentralized OCC [Tu et al]

Read Version

Fence

Read Data

Fence

Read VersionWrite Epoch

Write Data

(consumeor acquire)

(acquire)

Same Version?

yes, commit

no, abort(retry).

Writer Side(After Commit)

Com

mit

Tim

e

EpochStatus-

bits Data (Payload)Version (64 bits)R

ecor

d

Fence

(==unlock)

Page 8: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7/16

SILO: Decentralized Logger with Epoch [Tu et al]

core

SOC

Circular Log Buffer DiskEpoch

X~YLog

Durable Committed Tail Epoch Y~ZLog

Epoch File

- Global DurableEpoch

- Offsets etc

Log Writer

Log Manager

.........

For each epoch(10ms~50ms)

fsync

Page 9: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8/16

SILO’s OCC Benefits

9 Extremely Lightweight and ScalableBlock-free (lock-free with retries)No Write for Reads (cf., read-lock)

9 In-page Lock MechanismNo centralized lock managerCache Friendly

But, how can we go beyond DRAM?

Page 10: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 9/16

In-Memory DBMS Beyond DRAM

DiskPages

All Key/All

Index

Tuple Data

a) Traditional Databases b) H-Store/Anti-Cache

Evict/Page-In

Evict/Retrieve

Bufferpool

DRAM

NVM

Xct

ColdStore

d) FOEDUSc) Hekaton/Siberia

BloomFilters

Volatilepages

Snapshotpages

Dual Pages

HotStore

[Pavlo et al.]

[Larson et al.]

Physically Dependent

No LogicalEquivalence

Page 11: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10/16

Logically Equivalent, Physically Independent Dual�Volatile Page: Mutable, on DRAM�Snapshot Page: Immutable, on NVRAM

NULL NULL

X NULL

NULL Y

X Y

: nullptr

: Volatile-Page is just made. No Snapshot yet.

: Snapshot-Page is latest. No modification.

: X is the latest truth, but Y may be equivalent.

VolatilePointer

SnapshotPointer

Dro

p Vo

lati

le (i

f X≡Y

)

Mod

ify/

Add

Inst

all-

SP

Page 12: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18/16

Log Gleaner

Transactions

On NVRAM

Epoch X~Y

Log FileLog W

riter

StratifiedSnapshots

In SOC-2

Dual Pages

Snapshot Cache

On DRAM

VolatilePages

0xABCD

SP1,PID.

null

SP2,PID.

SP1 SP2 SP3

Epoch Y~Z

Log File

Vola. Ptr.

Snap. Ptr.

SnapshotPages

SOC-

1

SOC-

2

SequentialDump

TID Tuple

TID Tuple

Read-Set

Write-Set

Durable CurrentCommitted

SILO-like DecentralizedLogging and OCC

MasterTree

Page 13: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19/16

1. Assign Partitions

2. Run Mappers/Reducers

3. Combine Root/Metadata

4. Install/Drop Pointers

Part. 1

Part. 2

Log Files

Part. 1

Part. 2

Log Files

SortedRuns

Log Files

Part. 1

Part. 2

Buff.

New Snapshot

Mapper

Reducer 1 Reducer 2

Mapper

Part 1 Part 2

Batch-Apply

Constructing Stratified Snapshot from Logical Logs

Page 14: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20/16

Benefits of Stratified Snapshots

¾Serializable transactions check only a single version of data ↔ LSM-Trees¾Drastically more scalable and efficient construction of NVRAM-resident data pages¾Separate from transactions/logging

¾Everything Batched, Processed in a tight loop¾Large Sequential Write only. No frequent flush.

Page 15: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21/16

Master-Tree in a Nutshell

�Mass-tree: Cache Crafty B-tree and OCC with in-page Lock Mechanism.�Foster B-tree: Single incoming pointer via foster-child to ease page in/out.�Foster-Twins: Drastically Simplifies OCC and Reduces Aborts/Retries.

Page 16: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22/16

OCC Problem 1: Unnecessary Retries/Aborts

Mass-tree

1 10 1 5 5 10

RCU

1 100

Further, SILO must take Page-Version set and abort if changed by pre-commit. Lots of Aborts!

Global Retry!

Local Retry!

TID 7 TID 7

Page 17: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24/16

Foster Twins here to Save You!

• Strong Invariants to drastically simplify OCCand eliminate unnecessary aborts/retries

• Still everything in-page• No per-tuple GC or compaction/migration• No centralized data structure

Page 18: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25/16

Mass-tree

1 10 1 5 5 10

RCU

Foster B-Tree

1015

5 10

Adopt

Bw-Tree

Master-Tree

1 101 5 5 10

Retired1 10

Adopt

1 100 1 100

1 100 1 100

PID Ptr

P

Q

Mapping Table

Foster-child

1 5 5 10

FosterTwins

Pre-commit Verify

TrackMoved

• All data/metadata placed in-page

• Immutable range• All information

always track-able• All invariants

recursively hold

P

10

Q1

55 10

Split Δ

Moved

Page 19: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26/16

Foster-Twins Commit Protocol

Page 20: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11/16

Dual Pages: Benefits

9 Snapshot Pages are ImmutableDrastically simplifies Caching/Replication/Traversal/Commit/etc

9 Physically IndependentTransactions never interfere w/ construction/retrieval/eviction of Snapshot Pages

9 Logically Equivalent� No Bloom Filter or global data needed for Serializability� Volatile Pages guaranteed to contain latest data� Snapshot Pages guaranteed to be complete as of snapshot-epoch

↔ LSM-Trees

Page 21: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12/16

Experiments

� TPC-C, Serializable� vs. H-Store, SILO, Shore-MT� (Traditional DBs (e.g., MySQL) are too slow to compare with.)

� HP Superdome X (DragonHawk):16 sockets, 240 cores, 12TB DRAM.

� Emulated NVRAM

Page 22: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13/16

Experiment 1: vs In-Memory DBMS

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

0 20 40 60

TPC-

C Th

roug

hput

[TPS

]

Remote Transactions [%]

Observations1. SILO-style OCC much more

resilient to contention2. 100x~ faster than H-Store.3. More Cores = More Speedup

FOEDUS

SILO

H-Store

CoresSpeed-up[/H-Store]

16 33x

60 66x

240 394x

Page 23: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14/16

Experiment 2: DBMS on NVRAM (emulated)

1.E+04

1.E+05

1.E+06

1.E+07

1.E+02 1.E+03 1.E+04 1.E+05

TPC-

C Th

roug

hput

[TPS

]

Emulated NVRAM Latency [ns]

Environments- tmpfs-based NVRAM

emulation (varied latency)

- Data/xlog in NVRAM- H-Store w/ anti-cache

Observations1. FOEDUS 100x~ faster2. Performs best when

NVRAM read <10us

FOEDUS

FOEDUS (No Snapshot Cache)

H-Store + Anticache

Shore-MT

Page 24: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15/16

Experiment 3: OLAP Workload

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

Thro

ughp

ut [T

PS]

MAX_OL_CNT (Size of DB)

Workload- 30x Larger Data- Cursor-Heavy- Read-Only- Volatile Pages Only vs

Snapshot Pages Only

Observations1. 100x~ faster than

H-Store.2. FOEDUS runs faster

when database is cold (!!)

In Paper: how Dual Pages achieve it15 100 500

FOEDUS-Volatile

FOEDUS-Snapshot

H-Store

Page 25: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

Ideas• ThekeyideaofFEODUSistousethemaster-treemodeltoreducecontention,makingOCClightweight.Canthisbeextendedtodistributedmemorysystemswithsharednothingtopology,especiallywherereadingremotememoryisnotverycostly,likeRDMA?

Page 26: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17/16

Reserved Slides

Page 27: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16/16

Ongoing Work

9Try on 1,000 Cores~9Further Resilience to ContentionsCombining Pessimistic Approach When Beneficial9SIMD, Xeon-PhiLog-Gleaner’s Sorting, Page Construction, etc9Non-C++ InterfaceSQL/JDBC/ODBC for OLAP. But, for OLTP???

Page 28: FOEDUS: OLTP Engine for a Thousand Cores and NVRAM · 2020-06-10 · 1.E+05 1.E+06 1.E+07 0 20 40 60 TPC-PS] Remote Transactions [%] Observations 1. SILO-style OCC much more resilient

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27/16

Easy OCC with Foster Twins

�Simple, Robust, and Efficient Search:

• No Page-Version-set nor unnecessary aborts: All Records/TIDs always track-able via Foster-Twin!

Find a pointer that probably contains the key;Follow the page pointer;If the page’s key-range contain the key: go on;Else: locally retry;

No hand-over-hand verification/split counters.No unnecessary local retry. Absolutely no global retry.