Big data and tools

Big DataPresented by : SHIVAM SHUKLA

Contents

What is Big data ?

History

Three V’s

Why Big Data important ?

Technologies related to Big Data

Hadoop

Why Hadoop?

Hbase

Why Hbase?

Some features of Hbase

Hive

About

Points to remember

Sqoop

Working

Difference

What is Big Data ?

Big data is a term that describes the large volume of data :

a) Structured

b) Unstructured

c) Semi-structured

That inundates a business on a day-to-day basis.

But it’s not the amount of data that’s important. It’s what

organizations do with the data that matters.

History

While the term “big data” is relatively new, the act of gathering and

storing large amounts of information for eventual analysis is ages

old.

The concept gained momentum in the early 2000s, when industry

analyst Doug Laney articulated the now-mainstream definition of

big data as the three Vs:

Volume

Velocity

Variety

Three V’s :

Volume

Defines the huge amount of data that is produced each day by

organizations in the world

Velocity

Refers to speed with which the data is generated , analyzed and

reprocessed

Variety

refers to diversity of data and data sources

Additional V’s

With the time new V’s of big data introduced

Validity

It refers to the guarantee of data quality or,

alternatively, Veracity is the authenticity and credibility of the data.

Value

denotes the added value for companies. Many companies have

recently established their own data platforms, filled their data pools

and invested a lot of money in infrastructure. It is now a question of

generating business value from their investments.

Why is Big Data important ?

The importance of big data doesn’t revolve around how much data

you have, but what you do with it.

You can take data from any source and analyze it to find answers

that enable

Cost reduction

Time reduction

Smart decision making

Some Technologies related to Big

data

Hadoop framework

Hbase

Hive

Scoop

Hadoop

Hadoop is developed by Doug cutting and Michael j. cafarella.

Hadoop is a apache open source frame work designed for

Managing the data

Processing the data

Analyzing the data

Storing the data

Hadoop is written in java and not OLAP(online analytical

processing).

It is used for offline processing.

Logo for Hadoop is a YELLOW ELEPHANT

Why Hadoop ? Fast :

In HDFS the data distributed over the cluster and are mappedwhich helps in faster retrieval.

Scalable :

Hadoop cluster can be extended by just adding nodes in thecluster.

Cost Effective :

Hadoop is open source and uses commodity hardware to storedata so it really cost effective as compared to traditionalrelational database management system.

Resilient to failure :

HDFS has the property with which it can replicate data over thenetwork, so if one node is down or some other network failurehappens, then Hadoop takes the other copy of data and use it.

HBase

HBase is an open source framework provided by Apache. It is a

sorted map data built on Hadoop.

It is column oriented and horizontally scalable.

It has set of tables which keep data in key value format.

It is type of a database designed for mainly managing the

unstructured data

Logo for Apache HBase is a DOLPHIN

Why Hbase?

RDBMS get exponentially slow as the data becomes large.

Expects data to be highly structured, i.e. ability to fit in a well-

defined schema.

Any change in schema might require a downtime.

For sparse datasets, too much of overhead of maintaining NULL

values.

Some feature ofHbase

Horizontally scalable: You can add any number of columns anytime.

Often referred as a key value store or column family-oriented

database, or storing versioned maps of maps.

fundamentally, it's a platform for storing and retrieving data with

random access.

It doesn't care about datatypes(storing an integer in one row and a

string in another for the same column).

There is only one kind of data type which is byte array.

It doesn't enforce relationships within your data.

It is designed to run on a cluster of computers.

Hive

Hive is a data warehouse infrastructure tool to process structured

data in Hadoop.

It runs SQL like queries called HQL (Hive query language) which

gets internally converted to map reduce jobs.

Initially Hive was developed by Facebook, later the Apache

Software Foundation took it up and developed it further as an open

source under the name Apache Hive.

Hive supports Data definition Language(DDL), Data Manipulation

Language(DML) and user defined functions.

The logo for hive is a yellow and black BEE

Hive is not :

A relational database

designed for Online Transaction Processing (OLTP)

A language for real-time queries and row-level updates

Even with small amount of data ,time to return the response can’t be

compared to RDBMS.

Points to remember about

hive

Hive Query Language is similar to SQL and gets reduced to map

reduce jobs in backend.

Hive's default database is derby.

It also called as a No Sql.

It provides SQL type language for querying called HiveQL or HQL.

It is designed for OLAP(Online analytics processing).

Sqoop

Sqoop is a tool designed to transfer data between Hadoop and

relational database servers.

It is used to import data from relational databases such as MySQL,

Oracle to Hadoop HDFS, and export from Hadoop file system to

relational databases.

It is provided by the Apache Software Foundation.

Sqoop- “SQL to Hadoop and Hadoop to SQL”

Working of sqoop

Difference

Sqoop Import

The import tool imports

individual tables from

RDBMS to HDFS.

Each row in a table is treated

as a record in HDFS.

All records are stored as text

data in text files or as binary

data in Avro and Sequence

files.

Sqoop Export

The export tool exports a set of

files from HDFS back to an

RDBMS.

The files given as input to

Sqoop contain records, which

are called as rows in table.

Those are read and parsed into

a set of records and delimited

with user-specified delimiter.

Thank youAny queries

Big data and tools

Data & Analytics

Transcript of Big data and tools