Big Data and Cloud Computing
-
Upload
farzad-nozarian -
Category
Education
-
view
269 -
download
1
Transcript of Big Data and Cloud Computing
![Page 1: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/1.jpg)
Cloud and Big DataFarzad Nozarian ([email protected])
Amirkabir University of Technology
With the help of Dr. Amir H. Payberah ([email protected])
![Page 2: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/2.jpg)
![Page 3: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/3.jpg)
Big Data Analytics Stack
![Page 4: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/4.jpg)
Hadoop Big Data Analytics
Stack
![Page 5: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/5.jpg)
Spark Big Data Analytics
Stack
![Page 6: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/6.jpg)
Big Data - File systems
• Traditional file-systems are not well-designed for large-scale data
processing systems.
• Efficiency has a higher priority than other features, e.g., directory
service.
• Massive size of data tends to store it across multiple machines in a
distributed way.
• HDFS/GFS, Amazon S3, ...
![Page 7: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/7.jpg)
Big Data - Database
• Relational Databases Management Systems (RDMS) were not designed to be
distributed.
• NoSQL databases relax one or more of the ACID properties: BASE
• Different data models: key/value, column-family, graph, document.
• Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB, Voldemort, Riak,
Neo4J, ...
![Page 8: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/8.jpg)
Big Data - Resource Management
• Different frameworks require different computing resources.
• Large organizations need the ability to share data and resources between
multiple frameworks.
• Resource management share resources in a cluster between multiple
frameworks while providing resource isolation.
• Mesos, YARN, Quincy, ...
![Page 9: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/9.jpg)
Big Data - Execution Engine
• Scalable and fault tolerance parallel data processing on clusters of unreliable
machines.
• Data-parallel programming model for clusters of commodity machines.
• MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
![Page 10: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/10.jpg)
Big Data - Query/Scripting Language
• Low-level programming of execution engines, e.g., MapReduce, is not easy
for end users.
• Need high-level language to improve the query capabilities of execution
engines.
• It translates user-defined functions to low-level API of the execution
engines.
• Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
![Page 11: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/11.jpg)
Big Data - Stream Processing
• Providing users with fresh and low latency results.
• Database Management Systems (DBMS) vs. Data Stream Management Systems (DSMS)
• Storm, S4, SEEP, D-Stream, Naiad, ...
![Page 12: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/12.jpg)
Big Data - Graph Processing
• Many problems are expressed using graphs: sparse computational
dependencies, and multiple iterations to converge.
• Data-parallel frameworks, such as MapReduce, are not ideal for these
problems: slow
• Graph processing frameworks are optimized for graph-based problems.
• Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...
![Page 13: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/13.jpg)
Big Data - Machine Learning
• Implementing and consuming machine learning techniques at scale are
difficult tasks for developers and end users.
• There exist platforms that address it by providing scalable machine learning
and data mining libraries.
• Mahout, MLBase, SystemML, Ricardo, Presto, ...
![Page 14: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/14.jpg)
Big Data - Configuration and Synchronization
Service
• A means to synchronize distributed applications accesses to shared resources.
• Allows distributed processes to coordinate with each other.
• Zookeeper, Chubby, ...
![Page 15: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/15.jpg)
Hadoop Ecosystem
![Page 16: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/16.jpg)
Hadoop Ecosystem-HDFS
• A foundational component of the Hadoop ecosystem is the Hadoop
Distributed File System (HDFS).
• HDFS is the mechanism by which a large amount of data can be distributed
over a cluster of computers, and data is written once, but read many times
for analytics.
• It provides the foundation for other tools, such as HBase.
![Page 17: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/17.jpg)
Hadoop Ecosystem-MapReduce
• Hadoop’s main execution framework is MapReduce, a programming model
for distributed, parallel data processing, breaking jobs into mapping phases
and reduce phases.
• Developers write MapReduce jobs for Hadoop, using data stored in HDFS for
fast data access.
• Because of the nature of how MapReduce works, Hadoop brings the
processing to the data in a parallel fashion, resulting in fast implementation.
![Page 18: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/18.jpg)
Hadoop Ecosystem-HBase
• A column-oriented NoSQL database built on top of HDFS, HBase is used
for fast read/write access to large amounts of data.
• HBase uses Zookeeper for its management to ensure that all of its
components are up and running.
![Page 19: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/19.jpg)
Hadoop Ecosystem-Zookeeper
• Zookeeper is Hadoop’s distributed coordination service.
• Designed to run over a cluster of machines, it is a highly available service
used for the management of Hadoop operations, and many components of
Hadoop depend on it.
![Page 20: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/20.jpg)
Hadoop Ecosystem-Oozie
• A scalable workflow system
• Oozie is integrated into the Hadoop stack, and is used to coordinate
execution of multiple MapReduce jobs.
• It is capable of managing a significant amount of complexity, basing
execution on external events that include timing and presence of required
data.
![Page 21: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/21.jpg)
Hadoop Ecosystem-Pig
• An abstraction over the complexity of MapReduce programming
• the Pig platform includes an execution environment and a scripting language
(Pig Latin) used to analyze Hadoop data sets.
• Its compiler translates Pig Latin into sequences of MapReduce programs.
![Page 22: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/22.jpg)
Hadoop Ecosystem-Hive
• An SQL-like, high-level language used to run queries on data stored in
Hadoop
• Hive enables developers not familiar with MapReduce to write data queries
that are translated into MapReduce jobs in Hadoop.
![Page 23: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/23.jpg)
Hadoop Ecosystem-Sqoop
• a connectivity tool for moving data between relational databases and data
warehouses and Hadoop.
• Sqoop leverages database to describe the schema for the imported/exported
data and MapReduce for parallelization operation and fault tolerance.
![Page 24: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/24.jpg)
Hadoop Ecosystem-Flume
• a distributed, reliable, and highly available service for efficiently collecting,
aggregating, and moving large amounts of data from individual machines to
HDFS.
• provides a streaming of data flows
• allowing to move data from multiple machines within an enterprise into
Hadoop.
![Page 25: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/25.jpg)
Beyond the core components
• Whirr — This is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure.
• Mahout — This is a machine-learning and data-mining library that provides MapReduce implementations for popular algorithms used for clustering, regression testing, and statistical modeling.
• BigTop — This is a formal process and framework for packaging and interoperability testing of Hadoop’s sub-projects and related components.
• Ambari — This is a project aimed at simplifying Hadoop management by providing support for provisioning, managing, and monitoring Hadoop clusters.
![Page 26: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/26.jpg)
Storing Data in Hadoop
HDFS - HBase
![Page 27: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/27.jpg)
HDFS-Architecture
• The HDFS design is based on the design of the Google File System (GFS).
• To be able to store a very large amount of data (terabytes or petabytes)
• HDFS is designed to spread the data across a large number of machines, and to
support much larger file sizes compared to distributed filesystems such as NFS.
• HDFS uses data replication
• To better integrate with Hadoop’s MapReduce, HDFS allows data to be read and
processed locally.
![Page 28: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/28.jpg)
HDFS-ArchitectureHDFS is implemented as a block-structured file system
![Page 29: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/29.jpg)
HDFS-Using HDFS Files
• User applications access the HDFS file system using an HDFS client
• ACCESSING HDFS
• FileSystem (FS) shell
• HDFS Java APIs
![Page 30: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/30.jpg)
HDFS-Using HDFS Files
![Page 31: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/31.jpg)
HBase-Architecture
• HBase is a distributed, versioned, column-oriented, multidimensional storage system, designed for high performance and high availability.
• HBase is an open source implementation of Google’s BigTable architecture.
• Similar to traditional relational database management systems (RDBMSs), data in HBase is organized in tables.
• Unlike RDBMSs, however, HBase supports a very loose schema definition, and does not provide any joins, query language, or SQL.
• The main focus of HBase is on Create, Read, Update, and Delete (CRUD) operations on wide sparse tables.
• HBase leverages HDFS for its persistent data storage.
![Page 32: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/32.jpg)
Processing Data with MapReduce
![Page 33: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/33.jpg)
MapReduce-Roadmap
• Understanding MapReduce fundamentals
• Getting to know MapReduce application execution
• Understanding MapReduce application design
![Page 34: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/34.jpg)
MAPREDUCE-GETTING TO KNOW
• MapReduce is a framework for executing highly parallelizable and distributable algorithms across huge data sets using a large number of commodity computers.
• inspired by these concepts and introduced by Google in 2004
• MapReduce was introduced to solve large-data computational problems, and is specifically designed to run on commodity hardware.
• It is based on divide-and-conquer principles — the input data sets are split into independent chunks, which are processed by the mappers in parallel.
![Page 35: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/35.jpg)
MAPREDUCE-GETTING TO KNOW
![Page 36: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/36.jpg)
![Page 37: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/37.jpg)
MAPREDUCE-Execution
Pipeline
![Page 38: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/38.jpg)
MAPREDUCE-Runtime Coordination and
Task Management
![Page 39: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/39.jpg)
word count implementation-Map Phase
![Page 40: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/40.jpg)
word count implementation-Reduce Phase
![Page 41: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/41.jpg)
word count implementation-
Driver
![Page 42: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/42.jpg)
DESIGNING MAPREDUCE
IMPLEMENTATIONS
![Page 43: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/43.jpg)
Necessary questions to reformulate the initial
problem in terms of MapReduce
• How do you break up a large problem into smaller tasks? More specifically,
how do you decompose the problem so that the smaller tasks can be
executed in parallel?
• Which key/value pairs can you use as inputs/outputs of every task?
• How do you bring together all the data required for calculation? More
specifically, how do you organize processing the way that all the data
necessary for calculation is in memory at the same time?
![Page 44: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/44.jpg)
Simple Data Processing with
MapReduce
![Page 45: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/45.jpg)
Inverted Indexes Example
![Page 46: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/46.jpg)
![Page 47: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/47.jpg)
Building Joins with MapReduce
• Two “standard” implementations exist for joining data in MapReduce:
• Reduce-side join
• Map-side join
• A most common implementation of a join is a reduce-side join.
• Map-side join is very well in the case of one-to-one joins, where at most one
record from every data set has the same key.
![Page 48: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/48.jpg)
Road Enrichment Example
![Page 49: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/49.jpg)
A simplified road enrichment algorithm
1. Find all links connected to a given node. For example, as shown in Figure, node N1 has links L1, L2, L3, and L4, while node N2 has links L4, L5, and L6.
2. Based on the number of lanes for every link at the node, calculate the road width at the intersection.
3. Based on the road width, calculate the intersection geometry.
4. Based on the intersection geometry, move the road’s end point to tie it to the intersection geometry.
![Page 50: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/50.jpg)
Algorithm assumptions
• A node is described with an object N with the key NN1… NNm. For example, node
N1 can be described as NN1and N2 as NN2. All nodes are stored in the nodes input
file.
• A link is described with an object L with the key LL1… LLm. For example, link L1
can be described as LL1 , L2 as LL2, and so on. All the links are stored in the links
source file.
• Also introduce an object of the type link or node (LN), which can have any key.
• Finally, it is necessary to define two more types — intersection (S) and road (R).
![Page 51: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/51.jpg)
Phase 1
Calculation of Intersection Geometry and
Moving the Road’s End Points Job
![Page 52: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/52.jpg)
Phase 2Merge Roads Job
![Page 53: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/53.jpg)
Links Elevation Example
• This problem can be defined as follows. Given a links graph and terrain
model, convert two dimensional (x,y) links into three-dimensional (x, y, z)
links. This process is called link elevation.
![Page 54: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/54.jpg)
Simplified link elevation
algorithm
1. Split every link into fixed-length
fragments (for example, 10 meters).
2. For every piece, calculate heights (from
the terrain model) for both start and
end points of each link.
3. Combine pieces together into original
links.
![Page 55: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/55.jpg)
Phase 1
Split Links into Pieces and Elevate Each
Piece Job
![Page 56: Big Data and Cloud Computing](https://reader034.fdocuments.us/reader034/viewer/2022051112/55a5138a1a28ab3d2d8b4857/html5/thumbnails/56.jpg)
Phase 2Combine Link’s Pieces into Original Links Job