Big Data Technology Stack

By: Khalid Imran

Big Data : Technology Stack

https://www.linkedin.com/in/khalidimran

Agenda

▪ Big Data Stack : In a Nutshell

▪ Data Layer

▪ Data Processing Layer

▪ Data Ingestion Layer

▪ Data Presentation Layer

▪ Operations & Scheduling Layer

▪ Security & Governance

2

Big Data Technology Stack : In a nutshell

3

Data Layer

4

Hadoop Distributed File System (HDFS)

HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing large volumes of data in inexpensive commodity hardware.

Amazon Simple Storage Service (S3)

S3 is a cloud based scalable, distributed file system offering from Amazon. It can be utilized as the data layer in big data applications, coupled with other required components.

IBM General Parallel File System (GPFS) / Spectrum Scale

GPFS is a high-performance clustered file system developed by IBM.

Data Processing Layer

5

Hadoop MapReduce

Hadoop Map/Reduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

Apache Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.


6

Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn it into actionable business insight. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.

Apache HBase

HBase is an open-source NoSQL database that provides real-time read/write access to large datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase provides a strong consistency model, and range-based partitioning. Reads, including range-based reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on Cassandra..


7

Apache Cassandra

Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant and can be used to manage huge volumes of data. Cassandra's consistency model is based on Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very good write scaling.

Apache Storm

Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Apache Solr

Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solrpowers the search and navigation features of many of the world’s largest Internet sites, enabling powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding tabular, text, geo-location or sensor data that is stored in Hadoop.


8

Apache Spark

Apache Spark is an open source cluster computing framework for large-scale data processing. Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk for program execution. It provides in-memory computations for increased speed and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data from HDFS, Flume, Kafka, Twitter and other sources.

Apache Mahout

Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes. Mahout provides the tools and algorithms to automatically find meaningful patterns in those big data sets stored in the HDFS.

Data Ingestion Layer

9

Apache Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting,

aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and

machine data, geo-location data and social media) into the HDFS. It has a simple and flexible

architecture based on streaming data flows; and is robust and fault tolerant with comes with

configurable reliability mechanisms for failover and recovery.

Apache Kafka

Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages

in categories called topics. Producers are processes that publish messages to a Kafka topic.

Consumers are processes that subscribe to topics and process the feed of published

messages. Kafka is run as a cluster comprising of one or more servers each of which is called

a broker.

Apache Sqoop

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or

mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,

transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.

Data Presentation Layer

10

Kibana

Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time summary and charting of streaming data. The visualization capabilities it provides allow users to different charts, plots and maps of large volumes of data.

Operations & Scheduling Layer

11

Ambari

Ambari is an open framework that helps in provisioning, managing and monitoring of Apache Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an intuitive web interface that allows one to easily provision, configure and test all the Hadoop services and core components. It also comes with the powerful Ambari Blueprints API that can be utilized for automating cluster installations without any user intervention.

Apache Oozie

Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around job scheduling within the cluster. Oozie is a Java based web application that is primarily used to schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.

Apache ZooKeeper

Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed configuration service, a synchronization service and a naming registry for distributed systems that can use Zookeeper to store and mediate updates to important configuration information.

About Me – Khalid Imran

12

A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developinginnovative solutions, designing test strategies, challenging conventional methods, proving new techniques andembracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing,non-functional testing, manual, automation, test and project execution methodologies, licensed and open stacktools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management,client engagement, on-site, off-shore team dynamics and more.

I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the TestingCentre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization,pre-sales and business development, solutioning and consultancy, training and process improvement group. I holdmultiple certifications namely: CSQA, CSM and CPISI.

I welcome any questions or feedback you may have on this presentation.



Big Data Technology Stack

Documents

Transcript of Big Data Technology Stack