Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To...
Transcript of Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To...
![Page 1: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/1.jpg)
Best Practices For Loading Data To Distributed Systems With Change Data Capture
Alexey Goncharuk
1
![Page 2: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/2.jpg)
2019 © GridGain Systems
Change Data Capture In The Wild
Agenda● What is CDC?● What can I do with CDC?● What is available in Ignite / GridGain?
2
![Page 3: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/3.jpg)
2019 © GridGain Systems
What is Change Data Capture?
3
![Page 4: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/4.jpg)
2019 © GridGain Systems
4
What is CDC?
● Have a data set or arbitrary size● Determine what records changed since a given moment● Many ways to achieve this...
What is Change Data Capture
![Page 5: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/5.jpg)
2019 © GridGain Systems
What Is CDC?
5
● Timestamps ● Versions● Statuses● Attached to application data model
Record Change Markers
![Page 6: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/6.jpg)
2019 © GridGain Systems
Record Change Markers
6
ID … UPDATE_TS
1 2019-10-10 00:01:02.000
2 2019-10-09 11:01:02.000
3 2018-10-09 18:36:13.000
4 2019-09-01 01:02:03.000
…
10 2019-06-13 11:12:04.000
![Page 7: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/7.jpg)
2019 © GridGain Systems
Record Change Markers
7
ID … UPDATE_TS
1 2019-10-10 00:01:02.000
2 2019-10-09 11:01:02.000
3 2018-10-09 18:36:13.000
4 2019-11-01 23:59:59.000
…
10 2019-11-15 14:00:00.000
![Page 8: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/8.jpg)
2019 © GridGain Systems
Record Change Markers
8
SELECT * FROM Table WHERE UPDATE_TS > ’ 2019-11-01 00:00:00.000’
ID … UPDATE_TS
1 2019-10-10 00:01:02.000
2 2019-10-09 11:01:02.000
3 2018-10-09 18:36:13.000
4 2019-11-01 23:59:59.000
…
10 2019-11-15 14:00:00.000
![Page 9: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/9.jpg)
2019 © GridGain Systems
Record Change Markers
9
● Detecting changes is tricky○ Full scan○ Additional index for change markers
● No previous value (change coalescing)
Cons
![Page 10: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/10.jpg)
2019 © GridGain Systems
Record Change Markers
10
● May be implemented in application layer● Delayed change consumption● Negligible storage overhead
Pros
![Page 11: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/11.jpg)
2019 © GridGain Systems
What Is CDC?
11
● Triggers / interceptors / etc...● User code is supplied to the storage system
Callbacks
![Page 12: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/12.jpg)
2019 © GridGain Systems
Callbacks
12
Update
Callback User-defined Action
![Page 13: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/13.jpg)
2019 © GridGain Systems
Callbacks
13
● Invoked synchronously● Tricky failover in distributed systems
Cons
![Page 14: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/14.jpg)
2019 © GridGain Systems
Callbacks
14
● No system storage/insert overhead● Previous value is usually available● May have an ability to modify updated value
Pros
![Page 15: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/15.jpg)
2019 © GridGain Systems
15
What Is CDC?
Change Feed● Changes are stored as events (Event Sourcing) ● Or changes produce events● Consumers subscribe to a change feed● Database WAL is an events source!
![Page 16: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/16.jpg)
2019 © GridGain Systems
Change Feed
16
UpdateSubscriber / Consumer
![Page 17: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/17.jpg)
2019 © GridGain Systems
17
Change Feed
● Need additional storage to keep changes
Cons
![Page 18: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/18.jpg)
2019 © GridGain Systems
18
Change Feed
● Previous values are usually available● Full change history is preserved● Possibly an ability to re-read the history
Pros
![Page 19: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/19.jpg)
2019 © GridGain Systems
19
CDC Applications
![Page 20: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/20.jpg)
2019 © GridGain Systems
20
CDC Applications
Continuous Data Integration● “Active” database produces changes● The changes are applied to a secondary system
![Page 21: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/21.jpg)
2019 © GridGain Systems
Continuous Data Integration
21
Captured Changes
![Page 22: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/22.jpg)
2019 © GridGain Systems
22
CDC Applications
Continuous Data Integration● Reads offload● Audit Changelog● Cross-system Replication● High Availability
![Page 23: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/23.jpg)
2019 © GridGain Systems
CDC Applications
23
● Computationally expensive function over a large set of items?
● Calculate once, then apply deltas
Running function calculation
![Page 24: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/24.jpg)
2019 © GridGain Systems
CDC Applications
24
● AVG (ITEMS) = SUM (ITEMS) / COUNT (ITEMS)○ O(N) Complexity
● On insert => SUM += New Value, COUNT += 1● On delete => SUM -= Deleted Value, COUNT -= 1● On update => SUM = SUM - Old Value + New Value● Average is a O(1) operation
Running function calculation
![Page 25: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/25.jpg)
2019 © GridGain Systems
CDC Applications
25
● Updates feed is going both ways● Need to resolve conflicts● Conflict-free Replicated Data Types (CRDTs) for help
Cross-System Active-Active Replication
![Page 26: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/26.jpg)
2019 © GridGain Systems
Basic CRDTs
26
• Grow-only counter• Positive-negative counter• Grow-only set• Two-phase set• Last-write-wins• …
![Page 27: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/27.jpg)
2019 © GridGain Systems
CDC In Apache Ignite
27
![Page 28: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/28.jpg)
2019 © GridGain Systems
28
CDC In Ignite
● IgniteDataStreamer to optimally deliver changes to data nodes
● A user can use custom stream receiver● Out-of-the-box integrations
● Kafka● MQTT● …
Applying Changes To Ignite
![Page 29: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/29.jpg)
2019 © GridGain Systems
CDC In Ignite
29
● CacheInterceptor○ Guarantees update order○ May alter inserted value○ Synchronous, may affect performance
Callbacks
![Page 30: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/30.jpg)
2019 © GridGain Systems
CDC In Ignite
30
● Cache Events○ Guarantee update order○ Asynchronous
Callbacks
![Page 31: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/31.jpg)
2019 © GridGain Systems
CDC In Ignite
31
● ContinuousQuery○ Client - server subscription○ Remote filter acts as a synchronous callback○ Local listener acts as a sink
Callbacks And Change Feed Combined
![Page 32: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/32.jpg)
2019 © GridGain Systems
CDC In Ignite
32
Consumer
Remote Filter
Remote Filter
Update (K1)
Update (K2)
![Page 33: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/33.jpg)
2019 © GridGain Systems
CDC In Ignite
33
● Automatic failover in case of primary node crash● Single-key ordering guarantees
Callbacks And Change Feed Combined
![Page 34: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/34.jpg)
2019 © GridGain Systems
CDC In Ignite
34
● Ingestion● IgniteDataStreamer
● Capturing Changes● CacheInterceptor● Events● ContinuousQuery
![Page 35: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/35.jpg)
2019 © GridGain Systems
Summary
35
● CDC is a powerful and a well-known technique● Many systems have built-in support for CDC● May improve both development time and performance
![Page 36: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/36.jpg)
2019 © GridGain Systems
Apache Ignite
36
Want To Contribute?
![Page 37: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1](https://reader030.fdocuments.us/reader030/viewer/2022041019/5ecd5e14a9671c5f1d4b8c2d/html5/thumbnails/37.jpg)
2019 © GridGain Systems
Q&A
37
Thank you for your attention!