Big data overview

Presented By Ladislav Urban

www.syoncloud.com

Ladislav Urban CEO of Syoncloud.

Syoncloud is a consulting company specialized in Big Data analytics and integration of existing

systems.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED

Documents

Existing relational databases (CRM, ERP, Accounting, Billing)

E-mails and attachments

Imaging data (graphs, technical plans)

Sensor or device data

Internet search indexing

Log files

Social media


CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED


Telephone conversations Videos Pictures Clickstreams (clicks from users on web pages)

SCALE OF THE DATA


WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?

If relational databases do not scale to your traffic needs If normalized schema of your relational database became too

complex. If your business applications generate lots of supporting and

temporary data If database schema is already denormalized in order to

improve response times If joins in relational databases slow the system down to a crawl


WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? We try to map complex hierarchical documents to

Database tables Documents from different sources require flexible

schema When more data beats clever algorithms Flexibility is required for analytics Queries for values at specific time in history Need to utilize outputs from many existing systems


WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? To analyze unstructured data such as documents, log

files or semi-structured data such as CSV files and forms


WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES? SQL language. It is well known, standardized and based on

strong mathematical theories.

Database schemas that do not to be modified during production.

Scalability is not required

Mature security features: Role-based security, encrypted communications, row and field access control

Full support of ACID transactions (atomicity, consistency, isolation, durability)


Support for backup and rollback for data in case of data loss or corruption.

Relational database do have development, tuning and monitoring tools with good GUI


WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?

Batch vs Real-time Processing

Batch processing is used when real-time processing is not required, not possible or too expensive.

Conversion of unstructured data such as text files and log files into more structured records

Transformation during ETL Ad-hoc analysis of data Data analytics application and reporting


BATCH PROCESSING INFRASTRUCTURE


Batch processing systems utilize Map/Reduce and HDFS implementation in Apache Hadoop.

It is possible to develop batch processing application in Java using only Hadoop but we should mention other important systems and how they fit into Hadoop infrastructure.

BATCH PROCESSING INFRASTRUCTURE


APACHE AVRO In order to process data we need to have information

about data-types and data-schemas. This information is used for serialization and

deserialization for RPC communications as well as reading and writing to files.


RPC and serialization system that supports reach data structures

It uses JSON to define data types and protocols It serializes data in a compact binary format Avro supports Schema evolution Avro will handle missing/extra/modified fields.

APACHE AVRO


SCRIPT LANGUAGE FOR MAP/REDUCE

We need a quick and simple way to create Map/Reduce transformations, analysis and applications.

We need a script language that can be used in scripts as well as interactively on command line.


APACHE PIG


High-level procedural language for querying large semi-structured data sets using Hadoop and the Map/Reduce Platform

Pig simplifies the use of Hadoop by allowing SQL-like queries to run on distributed dataset.

APACHE PIG


An example of filtering log file for only Warning messages that will run in parallel on large cluster.

Given script is automatically transformed into Map/Reduce program and distributed across Hadoop cluster.

APACHE PIG

messages = LOAD '/var/log/messages';warns = FILTER messages BY $0 MATCHES '.*WARN+.*';DUMP warns


FILTER - Select a set of tuples from a relation based on a condition. FOREACH - Iterate the tuples of a relation, generating a data

transformation. GROUP - Group the data in one or more relations. JOIN - Join two or more relations (inner or outer join). LOAD - Load data from the file system. ORDER - Sort a relation based on one or more fields. SPLIT - Partition a relation into two or more relations. STORE - Store data in the file system.

APACHE PIGRelational operators that can be used in Pig


What if we want to use SQL to create map/reduce jobs?

Apache Hive is a data warehousing infrastructure based on the Hadoop

It provides query language called Hive QL, which is based on SQL.


APACHE HIVE Hive functions: data summarization, query and

analysis. It uses system catalog called Hive-Metastore. Hive is not designed for OLTP or Real-time queries. It is best used for batch jobs over large sets of append-

only data.


APACHE HIVE


HiveQL language supports ability to Filter rows from a table using a where clause. Select certain columns from the table using a select clause. Do equi-joins between two tables. Evaluate aggregations on multiple "group by" columns for the

data stored in a table. Store the results of a query into another table. Download the contents of a table to a local (NFS) directory.


HiveQL language supports ability to

Store the results of a query in a HDFS directory. Manage tables and partitions (create, drop and alter). Plug in custom scripts in the language of choice for custom

map/reduce jobs.


APACHE OOZIE Map/Reduce jobs, Pig Scripts and Hive queries

should be simple and single purposed. How can we create complex ETL or data analysis in

Hadoop? We chain scripts so output of one script is an input

for another. Complex workflows that represents real-world

scenarios need workflow engine such as Apache Oozie.


Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce, Pig jobs and other.

Oozie workflow is a collection of actions arranged in DAG (Directed Acyclic Graph).

This means that second action can not run until the first one is completed.

Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL).


APACHE OOZIE

Workflow actions start jobs in Hadoop cluster. Upon action completion, the Hadoop callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.

Oozie workflows contain control flow nodes (start, end, fail, decision, fork and join) and action nodes (Actual Jobs).

Workflows can be parameterized (using variables like ${inputDir} within the workflow definition)


APACHE OOZIE

Example of OOZIE workflow definition


Example of OOZIE workflow definitionworkflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property>



Example of OOZIE workflow definition < property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/></workflow-app>

APACHE Sqoop


Apache Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases or data warehouses.

It can be used to populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule

and automate import and export tasks. Sqoop uses a connector based architecture which

supports plugins that provide connectivity to external systems.


APACHE Sqoop

Sqoop includes connectors for databases such as MySQL, PostgreSQL, Oracle, SQL Server, DB2 and generic JDBC connector.

Transferred dataset is sliced up into partitions and map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Sqoop uses the database metadata to infer data types


APACHE Sqoop

Apache Sqoop – Import to HDFS



APACHE Sqoop Sqoop example to import data from MySQL database ORDERS

table to Hive table running on Hadoop.

sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** --hive-import

Sqoop takes care of populating Hive metastore with appropriate metadata for the table and also invokes necessary commands to load the table or partition.


Apache Sqoop – Export to Database

APACHE FLUME▪ Is a distributed system to reliably collect, aggregate and

move large amounts of log data from many different sources to a centralized data store.


APACHE FLUME


APACHE FLUME


Flume Source consumes events delivered to it by an external source like a web server.

When a Flume Source receives an event, it stores it into one or more Channels.

The Channel is a passive store that keeps the event until it is consumed by a Flume Sink.

The Sink removes the event from the Channel and puts it into an external repository like HDFS

APACHE FLUME FEATURES


It allows to build multi-hop flows where events travel through multiple agents before reaching the final destination.

It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume uses a transactional approach to guarantee reliable delivery of events.

Events are staged in the channel, which manages recovery from failure.

Flume supports log stream types such as Avro, Syslog, Netcat .

DISTCP - DISTRIBUTED COPY DistCp (distributed copy) is a tool used for large

inter/intra-cluster copying. It uses Map/Reduce for its distribution, error handling

and recovery and reporting. It expands a list of files and directories into input to map

tasks, each of which will copy a partition of the files specified in the source list.


REAL-TIME PROCESSING – NOSQL DATABASES

▪ 5.1 Document stores

Apache CouchDB, MongoDB,

▪ 5.2 Graph Stores

Neo4j

▪ 5.3 Key-Value Stores

Apache Cassandra, Riak

▪ 5.4 Tabular Stores

Apache Hbase


http://en.wikipedia.org/wiki/CouchDB

http://en.wikipedia.org/wiki/MongoDB

http://en.wikipedia.org/wiki/Neo4j

http://en.wikipedia.org/wiki/Apache_Cassandra

http://en.wikipedia.org/wiki/HBase

CAP THEOREM



HBASE ARCHITECTURE

QUESTIONS & ANSWERS

www.syoncloud.com

[email protected]

Mobile : 077 9664 6474

LADISLAV URBAN

Big data overview

Documents

Transcript of Big data overview