Intro to Big Data · •Apache Samza is a stream processing framework that is tightly tied to the...
Transcript of Intro to Big Data · •Apache Samza is a stream processing framework that is tightly tied to the...
Intro to Big Data Processing
Frameworks
By: Shahab Safaee
Computer Software Engineering PhD
Email: [email protected]
cibtrc.ir
t.me/cibtrc
www.instagram.com/cibtrc/
Agenda • Introduction
• Big Data Processing Architecture
• Type of Processing Systems
• Processing Types of Big Data Processing Frameworks
• Big Data Execution Engine
• Taxonomy of Programming Models
• Hadoop Execution Engine
2
Introduction
• Big Data Processing Terminology
▫ Big Data Processing Framework
Actual component responsible for operating on data
▫ Big Data Execution Engine
Provide the computational model that performs several independent
sequential computations in parallel which comprises of
subcomponents of a larger computation.
▫ Big Data Programming Models
Programming models normally the core feature of big data
frameworks as they implicitly affects the execution model of big data
processing engines and also drives the way for users to express and
construct the big data applications and programs.
3
Big Data Processing Architecture
4
Type of Processing Systems (1)
• Batch Processing Systems
▫ Batch processing involves operating over a large, static dataset and
returning the result at a later time when the computation is complete.
• The datasets in batch processing are typically...
▫ Bounded: batch datasets represent a finite collection of data
▫ Persistent: data is almost always backed by some type of permanent
storage
▫ Large: batch operations are often the only option for processing
extremely large sets of data
▫ Batch processing is well-suited for calculations where access to a
complete set of records is required.
5
Type of Processing Systems (2)
• Stream Processing Systems
▫ Stream processing systems compute over data as it enters the system.
▫ Instead of defining operations to apply to an entire dataset, stream
processors define operations that will be applied to each individual data
item as it passes through the system.
• The datasets in stream processing are considered "unbounded". This
has a few important implications:
▫ The total dataset is only defined as the amount of data that has entered
the system so far.
▫ The working dataset is perhaps more relevant, and is limited to a single
item at a time.
▫ Processing is event-based and does not "end" until explicitly stopped.
Results are immediately available and will be continually updated as
new data arrives.
6
Processing Types of Big Data
Processing Frameworks • Batch-only frameworks:
▫ Apache Hadoop
• Stream-only frameworks:
▫ Apache Storm
▫ Apache Samza
• Hybrid frameworks:
▫ Apache Spark
▫ Apache Flink
7
Batch-only frameworks
8
Apache Hadoop (1)
• Apache Hadoop is a processing framework that exclusively provides batch processing.
• Hadoop was the first big data framework to gain significant traction in the open-source community.
• Modern versions of Hadoop are composed of several components or layers, that work together to process batch data:
▫ HDFS:
HDFS is the distributed file system layer that coordinates storage and replication across the cluster nodes.
HDFS ensures that data remains available in spite of inevitable host failures
▫ YARN: YARN, which stands for Yet Another Resource Negotiator, is the cluster
coordinating component of the Hadoop stack. It is responsible for coordinating and managing the underlying resources and scheduling jobs to be run.
▫ MapReduce:
MapReduce is Hadoop's native batch processing engine.
9
Apache Hadoop (2)
• Apache Hadoop and its MapReduce processing engine offer a well-
tested batch processing model that is best suited for handling very
large data sets where time is not a significant factor.
• The low cost of components necessary for a well-functioning
Hadoop cluster makes this processing inexpensive and effective for
many use cases.
• Compatibility and integration with other frameworks and engines
mean that Hadoop can often serve as the foundation for multiple
processing workloads using diverse technology.
10
Stream-only frameworks/Kappa Architecture
11
Apache Storm
• Apache Storm is a stream processing framework that focuses on
extremely low latency and is perhaps the best option for workloads
that require near real-time processing.
• It can handle very large quantities of data with and deliver results
with less latency than other solutions.
• For pure stream processing workloads with very strict latency
requirements, Storm is probably the best mature option.
• It can guarantee message processing and can be used with a large
number of programming languages.
• Because Storm does not do batch processing, you will have to use
additional software if you require those capabilities.
• If you have a strong need for exactly-once processing guarantees,
Trident can provide that.
12
Apache Samza
• Apache Samza is a stream processing framework that is tightly tied
to the Apache Kafka messaging system. • While Kafka can be used by many stream processing systems,
Samza is designed specifically to take advantage of Kafka's unique architecture and guarantees.
• It uses Kafka to provide fault tolerance, buffering, and state storage.
• Apache Samza is a good choice for streaming workloads where Hadoop and Kafka are either already available or sensible to implement.
• Samza itself is a good fit for organizations with multiple teams using (but not necessarily tightly coordinating around) data streams at various stages of processing.
• Samza greatly simplifies many parts of stream processing and offers low latency performance.
13
Hybrid Processing Systems: Batch and
Stream Processors/Lambda processing • Some processing frameworks can handle both batch and stream
workloads.
▫ These frameworks simplify diverse processing requirements by allowing
the same or related components and APIs to be used for both types of data.
14
Apache Spark
• Apache Spark is a next generation batch processing framework with
stream processing capabilities.
• Built using many of the same principles of Hadoop's MapReduce
engine, Spark focuses primarily on speeding up batch processing
workloads by offering full in-memory computation and processing
optimization.
• Spark is a great option for those with diverse processing workloads.
Spark batch processing offers incredible speed advantages, trading
off high memory usage.
• Spark Streaming is a good stream processing solution for workloads
that value throughput over latency.
15
Apache Flink (1)
• Apache Flink is a stream processing framework that can also handle
batch tasks. It considers batches to simply be data streams with
finite boundaries, and thus treats batch processing as a subset of
stream processing. This stream-first approach to all processing has a
number of interesting side effects.
• This stream-first approach has been called the Kappa architecture,
in contrast to the more widely known Lambda architecture (where
batching is used as the primary processing method with streams
used to supplement and provide early but unrefined results).
• Kappa architecture, where streams are used for everything,
simplifies the model and has only recently become possible as
stream processing engines have grown more sophisticated.
16
Apache Flink (2)
• Flink offers both low latency stream processing with support for
traditional batch tasks.
• Flink is probably best suited for organizations that have heavy
stream processing requirements and some batch-oriented tasks.
• Its compatibility with native Storm and Hadoop programs, and its
ability to run on a YARN-managed cluster can make it easy to
evaluate.
• Its rapid development makes it worth keeping an eye on.
17
Big Data Processing Frameworks Comparison
• There are plenty of options for processing within a big data system.
▫ For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less expensive to implement than some other solutions.
▫ For stream-only workloads, Storm has wide language support and can deliver very low latency processing, but can deliver duplicates and cannot guarantee ordering in its default configuration.
▫ For mixed workloads, Spark provides high speed batch processing and micro-batch processing for streaming. It has wide support, integrated libraries and tooling, and flexible integrations. Flink provides true stream processing with batch processing support. It is heavily optimized, can run tasks written for other platforms, and provides low latency processing, but is still in the early days of adoption.
▫ The best fit for your situation will depend heavily upon the state of the data to process
18
Big Data Execution Engine
• Features
▫ Scalable
▫ Fault Tolerant
▫ Parallelism
• Infrastructure Processing
▫ Commodity Clustered Machines
• Programming Models
▫ A programming model is the fundamental style and interfaces for
developers to write computing programs and applications.
• Instance of Execution Engines:
▫ MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
19
Distributed Data-Parallel Patterns and
Execution Engines
• The are many distributed data-parallel patterns
• The main advantages of using these patterns: ▫ Support the distribution of data and parallel processing of data
with distributed data on multiple nodes/cores.
▫ Provide a higher-level programming model in order to facilitate the user program parallelization.
▫ Follow a principle of "moving computations to data" that reduces the data movements overheads.
▫ Have a good scalability and efficiency in performance when executing on distributed resources.
▫ Support run time features such as load balancing, fault-tolerance, etc.
▫ Simplify the difficulties for parallel programming as compared to the traditional programming interfaces MPI and openMP.
20
Main Characteristics of Distributed
Execution Engines
• The design purpose of Distributed Execution Engines is to
consider the fundamental characteristics which are as follows:
▫ Task Scheduling
▫ Data Distribution
▫ Load Balancing
▫ Transparent Fault Tolerance
▫ Control Flow
▫ Tracking Dependencies
21
Taxonomy of Programming Models
22
MapReduce
• MapReduce the current framework/paradigm for writing data-centric parallel applications in both industry and academia.
• MapReduce is inspired by the commonly used functions - Map and Reduce in combination with the divide-and conquer parallel paradigm.
• For a single MapReduce job, users implement two basic procedure objects Mapper and Reducer for different processing stages as shown in Figure.
23
MapReduce Paradigm (1)
24
MapReduce Paradigm (2)
• Then the MapReduce program is automatically interpreted by the execution engine and executed in parallel in a distributed environments.
• MapReduce is considered as a simple yet powerful enough programming model to support a variety of the data-intensive programs.
25
MapReduce Dataflow
26
MapReduce Features (1)
• Map and Reduce functions
▫ A MapReduce program contains a Map function doing the parallel transformation and a Reduce function doing the parallel aggregation and summary of the job.
▫ Between Map and Reduce an implied Shuffle step is responsible for grouping and sorting the Mapped results and then feeding it into the Reduce step.
27
MapReduce Features (2)
• Simple paradigm
▫ In MapReduce programming, users only need to write the logic of Mapper and Reducer while the logic of shuffling, partitioning and sorting is automatically done by the execution engine.
▫ Complex applications and algorithms can be implemented by connecting a sequence of MapReduce jobs.
▫ Due to this simple programming paradigm, it is much more convenient to write data-driven parallel applications, because users only need to consider the logic of processing data in each Mapper and Reducer without worrying about how to parallelize and coordinate the jobs.
28
MapReduce Features (3)
• Key-Value based
▫ In MapReduce, both input and output data are considered as Key-Value pairs with different types.
▫ This design is because of the requirements of parallelization and scalability.
▫ Key-value pairs can be easily partitioned and distributed to be processed on distributed clusters.
• Parallelable and Scalable
▫ Both Map and Reduce functions are designed to facilitate parallelization, so MapReduce applications are generally linearly-scalable to thousands of nodes.
29
Hadoop Execution Engine (1)
• Hadoop MapReduce
▫ Hadoop is the open-source implementation of Google’s MapReduce paradigm.
▫ The native programming primitives in Hadoop are Mapper and Reducer interfaces which can be implemented by programmers with their actual logic of processing map and reduce stage transformation and processing.
▫ To support more complicated applications, users may need to chain a sequence of MapReduce jobs each of which is responsible for a processing module with well defined functionality.
30
Hadoop Execution Engine (2)
• Hadoop MapReduce
▫ Hadoop is mainly implemented in Java, therefore, the map and reduce functions are wrapped as two interfaces called Mapper and Reducer.
▫ The Mapper contains the logic of processing each key-value pair from the input.
▫ The Reducer contains the logic for processing a set of values for each key.
▫ Programmers build their MapReduce application by implementing those two interfaces and chaining them as an execution pipeline.
31
MapReduce Word Count Process
32
Word Count example in Hadoop
33
34