Big Data Computing Architecture
-
Upload
gang-tao -
Category
Technology
-
view
245 -
download
1
Transcript of Big Data Computing Architecture
Non Functional Requirements
• Latency
• Throughput
• Fault Tolerant
• Scalable
• Exactly Once Semantic
Design Principle
human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.
data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !)
recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
Batch Processing
Using only batch processing, leaves you always with a portion of non- processed data.
Lambda the Bad?
What is Good?
What is Bad?
Inmutable Data , Reprocessing
Keeping code written in two different systems perfectly in sync
Out of Control
Build Complexity with Simplicity
Stream Processing Model
One at a time Micro Batch
Low Latency Y N
High Throughput N Y
at least once Y Y
excatly once Sometimes Y
simple programing model Y N
Stream Computing the Limitation• Queries must be written before data
• There should be another way to query past data
• Queries cannot be run twice
• All results will be lost when any error occurs All data have gone when bugs found
• Disorders of events break results
• Recorded time based queries? Or arrival time based queries?
Fault Tolerance in Stream
• At Least Once : ensure all operators see all events
• Stream -> Replay on failure
• Exactly Once :
• Flink : distributed Snapshot
• Spark : Micro Batch