Big data
-
Upload
luan-cestari -
Category
Technology
-
view
335 -
download
6
Transcript of Big data
![Page 1: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/1.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1
Introduction to Big Data -
Survival Guide!
Luan CestariFebruary 28 , 2014
![Page 2: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/2.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2
Please, let me ask ...
● Who already tested a product/project related to Big Data?
● Who does work with Big Data?
![Page 3: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/3.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3
What are we going to see here
● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and
databases● Some facts about database history ● Why there are so many DB available?
● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
![Page 4: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/4.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4
Why Big Data is important
● Many companies is already dealing with Big Data using Open Source tools
● There is demand for people to work with those tools as a developer and analyst
● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool
![Page 5: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/5.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5
Why Big Data is important
● When a company is using Big Data tools, it can grow very fast and complex:
● Many different clusters (due tenant, geo localized or different versions)
● Different technologies for very related propose (also due different team skills or use cases)
● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace
![Page 6: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/6.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6
Cool ... but what is Big Data after all?
● Just tons of information isn't enough, it also needs to be have:
● Variety● Velocity● Value● And Volume
![Page 7: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/7.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7
More about Volume: How Big it can be?
● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?
● Answer:104 857 600 gigabytes of users log
![Page 8: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/8.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8
More about Variety: Where the data are from?
● Customer generated Content
● M2M
● Sensors
● B2B
● B2C
● Social Network
● And others Devices: mobile phones, setbox, Security Cameras
![Page 9: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/9.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:
● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)
![Page 10: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/10.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:
● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)
![Page 11: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/11.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11
More about Value
● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors
● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools
![Page 12: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/12.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12
... and the Velocity
● This is a very interesting point due different analyzes may require different times:
● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city
● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch
![Page 13: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/13.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13
... and the Velocity
● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide
![Page 14: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/14.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14
SQL History
● Hierarchical Database in 60`s
● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise
● Big companies used to buy expensive special DW database system to analyze their data
![Page 15: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/15.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15
... and now
![Page 16: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/16.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16
... and now
![Page 17: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/17.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17
Again the reason for that
● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet
● Harvard Business Review● A change from DW to a Big Data system made a 96
hours job run in just 4 hours● 2012 2.5 exabyte create a day
![Page 18: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/18.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18
We need to avoid the Golden hammer/Silver Bullet Anti-pattern
![Page 19: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/19.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19
Hadoop ecosystem save the day
● Open Source projects that help you to deal with the Big Data
● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results
● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to
avoid moving data around)● Less complex programming model● It have low level native lib for high performance
![Page 20: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/20.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it
![Page 21: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/21.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it● There are several big companies that offer Hadoop and
other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
![Page 22: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/22.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22
Hadoop ecosystem save the day
● Cluadera: CDH
![Page 23: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/23.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23
Hadoop ecosystem save the day
● Cluadera:● How to create this whole stack with minimum effort:
Cloudera Manager
![Page 24: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/24.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24
Hadoop ecosystem save the day
● Hortonworks: HDP
![Page 25: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/25.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25
Hadoop ecosystem save the day
● Hortonworks: ● They use Ambari to management the cluster like
Claudera Manager does● They also have Tez to enhance the speed of the
workloads
![Page 26: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/26.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26
Hadoop ecosystem save the day
● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example tenants/cloud)
● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more
![Page 27: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/27.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27
Hadoop ecosystem save the day
● There more tools for specific cases, like low latency with Spark ecosystem
![Page 28: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/28.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28
Hadoop ecosystem save the day
● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel
![Page 29: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/29.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29
The integration with other system will be complex
● An overview:
![Page 30: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/30.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30
A different approach: Lambda Architecture
● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems
![Page 31: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/31.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31
Questions?
![Page 32: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/32.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1
Introduction to Big Data -
Survival Guide!
Luan CestariFebruary 28 , 2014
![Page 33: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/33.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2
Please, let me ask ...
● Who already tested a product/project related to Big Data?
● Who does work with Big Data?
ScalablePortableOn-demandResource ManagementMeasureable
![Page 34: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/34.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3
What are we going to see here
● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and
databases● Some facts about database history ● Why there are so many DB available?
● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
The difference in http://www.slideshare.net/CAinc/cloud-expo-session-from-virtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
![Page 35: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/35.jpg)
4
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4
Why Big Data is important
● Many companies is already dealing with Big Data using Open Source tools
● There is demand for people to work with those tools as a developer and analyst
● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool
![Page 36: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/36.jpg)
5
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5
Why Big Data is important
● When a company is using Big Data tools, it can grow very fast and complex:
● Many different clusters (due tenant, geo localized or different versions)
● Different technologies for very related propose (also due different team skills or use cases)
● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace
![Page 37: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/37.jpg)
6
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6
Cool ... but what is Big Data after all?
● Just tons of information isn't enough, it also needs to be have:
● Variety● Velocity● Value● And Volume
![Page 38: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/38.jpg)
7
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7
More about Volume: How Big it can be?
● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?
● Answer:104 857 600 gigabytes of users log
![Page 39: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/39.jpg)
8
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8
More about Variety: Where the data are from?
● Customer generated Content
● M2M
● Sensors
● B2B
● B2C
● Social Network
● And others Devices: mobile phones, setbox, Security Cameras
![Page 40: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/40.jpg)
9
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:
● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)
![Page 41: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/41.jpg)
10
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:
● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)
![Page 42: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/42.jpg)
11
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11
More about Value
● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors
● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools
![Page 43: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/43.jpg)
12
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12
... and the Velocity
● This is a very interesting point due different analyzes may require different times:
● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city
● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch
![Page 44: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/44.jpg)
13
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13
... and the Velocity
● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide
![Page 45: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/45.jpg)
14
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14
SQL History
● Hierarchical Database in 60`s
● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise
● Big companies used to buy expensive special DW database system to analyze their data
![Page 46: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/46.jpg)
15
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15
... and now
![Page 47: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/47.jpg)
16
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16
... and now
![Page 48: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/48.jpg)
17
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17
Again the reason for that
● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet
● Harvard Business Review● A change from DW to a Big Data system made a 96
hours job run in just 4 hours● 2012 2.5 exabyte create a day
![Page 49: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/49.jpg)
18
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18
We need to avoid the Golden hammer/Silver Bullet Anti-pattern
![Page 50: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/50.jpg)
19
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19
Hadoop ecosystem save the day
● Open Source projects that help you to deal with the Big Data
● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results
● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to
avoid moving data around)● Less complex programming model● It have low level native lib for high performance
![Page 51: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/51.jpg)
20
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it
![Page 52: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/52.jpg)
21
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it● There are several big companies that offer Hadoop and
other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
![Page 53: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/53.jpg)
22
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22
Hadoop ecosystem save the day
● Cluadera: CDH
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
![Page 54: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/54.jpg)
23
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23
Hadoop ecosystem save the day
● Cluadera:● How to create this whole stack with minimum effort:
Cloudera Manager
![Page 55: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/55.jpg)
24
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24
Hadoop ecosystem save the day
● Hortonworks: HDP
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty
![Page 56: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/56.jpg)
25
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25
Hadoop ecosystem save the day
● Hortonworks: ● They use Ambari to management the cluster like
Claudera Manager does● They also have Tez to enhance the speed of the
workloads
![Page 57: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/57.jpg)
26
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26
Hadoop ecosystem save the day
● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example tenants/cloud)
● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more
Apache Whirr is a set of libraries for running cloud services.
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Open MPI is a standardized API typically used for parallel and/or distributed computing
![Page 58: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/58.jpg)
27
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27
Hadoop ecosystem save the day
● There more tools for specific cases, like low latency with Spark ecosystem
Apache Whirr is a set of libraries for running cloud services.
![Page 59: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/59.jpg)
28
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28
Hadoop ecosystem save the day
● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel
Apache Whirr is a set of libraries for running cloud services.
![Page 60: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/60.jpg)
29
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29
The integration with other system will be complex
● An overview:
![Page 61: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/61.jpg)
30
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30
A different approach: Lambda Architecture
● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems
![Page 62: Big data](https://reader034.fdocuments.us/reader034/viewer/2022042700/557b2b6bd8b42a71798b54df/html5/thumbnails/62.jpg)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31
Questions?