Big data overview

46
Presented By Ladislav Urban www.syoncloud.com

Transcript of Big data overview

Page 1: Big data overview

Presented By Ladislav Urban

www.syoncloud.com

Page 2: Big data overview

Ladislav Urban CEO of Syoncloud.

Syoncloud is a consulting company specialized in Big Data analytics and integration of existing

systems.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 3: Big data overview

CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED

Documents

Existing relational databases (CRM, ERP, Accounting, Billing)

E-mails and attachments

Imaging data (graphs, technical plans)

Sensor or device data

Internet search indexing

Log files

Social media

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 4: Big data overview

CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Telephone conversations Videos Pictures Clickstreams (clicks from users on web pages)

Page 5: Big data overview

SCALE OF THE DATA

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 6: Big data overview

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?

If relational databases do not scale to your traffic needs If normalized schema of your relational database became too

complex. If your business applications generate lots of supporting and

temporary data If database schema is already denormalized in order to

improve response times If joins in relational databases slow the system down to a crawl

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 7: Big data overview

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? We try to map complex hierarchical documents to

Database tables Documents from different sources require flexible

schema When more data beats clever algorithms Flexibility is required for analytics Queries for values at specific time in history Need to utilize outputs from many existing systems

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 8: Big data overview

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? To analyze unstructured data such as documents, log

files or semi-structured data such as CSV files and forms

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 9: Big data overview

WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES? SQL language. It is well known, standardized and based on

strong mathematical theories.

Database schemas that do not to be modified during production.

Scalability is not required

Mature security features: Role-based security, encrypted communications, row and field access control

Full support of ACID transactions (atomicity, consistency, isolation, durability)

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 10: Big data overview

Support for backup and rollback for data in case of data loss or corruption.

Relational database do have development, tuning and monitoring tools with good GUI

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?

Page 11: Big data overview

Batch vs Real-time Processing

Batch processing is used when real-time processing is not required, not possible or too expensive.

Conversion of unstructured data such as text files and log files into more structured records

Transformation during ETL Ad-hoc analysis of data Data analytics application and reporting

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 12: Big data overview

BATCH PROCESSING INFRASTRUCTURE

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 13: Big data overview

Batch processing systems utilize Map/Reduce and HDFS implementation in Apache Hadoop.

It is possible to develop batch processing application in Java using only Hadoop but we should mention other important systems and how they fit into Hadoop infrastructure.

BATCH PROCESSING INFRASTRUCTURE

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 14: Big data overview

APACHE AVRO In order to process data we need to have information

about data-types and data-schemas. This information is used for serialization and

deserialization for RPC communications as well as reading and writing to files.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 15: Big data overview

RPC and serialization system that supports reach data structures

It uses JSON to define data types and protocols It serializes data in a compact binary format Avro supports Schema evolution Avro will handle missing/extra/modified fields.

APACHE AVRO

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 16: Big data overview

SCRIPT LANGUAGE FOR MAP/REDUCE

We need a quick and simple way to create Map/Reduce transformations, analysis and applications.

We need a script language that can be used in scripts as well as interactively on command line.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 17: Big data overview

APACHE PIG

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 18: Big data overview

High-level procedural language for querying large semi-structured data sets using Hadoop and the Map/Reduce Platform

Pig simplifies the use of Hadoop by allowing SQL-like queries to run on distributed dataset.

APACHE PIG

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 19: Big data overview

An example of filtering log file for only Warning messages that will run in parallel on large cluster.

Given script is automatically transformed into Map/Reduce program and distributed across Hadoop cluster.

APACHE PIG

messages = LOAD '/var/log/messages';warns = FILTER messages BY $0 MATCHES '.*WARN+.*';DUMP warns

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 20: Big data overview

FILTER - Select a set of tuples from a relation based on a condition. FOREACH - Iterate the tuples of a relation, generating a data

transformation. GROUP - Group the data in one or more relations. JOIN - Join two or more relations (inner or outer join). LOAD - Load data from the file system. ORDER - Sort a relation based on one or more fields. SPLIT - Partition a relation into two or more relations. STORE - Store data in the file system.

APACHE PIGRelational operators that can be used in Pig

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 21: Big data overview

What if we want to use SQL to create map/reduce jobs?

Apache Hive is a data warehousing infrastructure based on the Hadoop

It provides query language called Hive QL, which is based on SQL.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 22: Big data overview

APACHE HIVE Hive functions: data summarization, query and

analysis. It uses system catalog called Hive-Metastore. Hive is not designed for OLTP or Real-time queries. It is best used for batch jobs over large sets of append-

only data.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 23: Big data overview

APACHE HIVE

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 24: Big data overview

HiveQL language supports ability to Filter rows from a table using a where clause. Select certain columns from the table using a select clause. Do equi-joins between two tables. Evaluate aggregations on multiple "group by" columns for the

data stored in a table. Store the results of a query into another table. Download the contents of a table to a local (NFS) directory.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 25: Big data overview

HiveQL language supports ability to

Store the results of a query in a HDFS directory. Manage tables and partitions (create, drop and alter). Plug in custom scripts in the language of choice for custom

map/reduce jobs.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 26: Big data overview

APACHE OOZIE Map/Reduce jobs, Pig Scripts and Hive queries

should be simple and single purposed. How can we create complex ETL or data analysis in

Hadoop? We chain scripts so output of one script is an input

for another. Complex workflows that represents real-world

scenarios need workflow engine such as Apache Oozie.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 27: Big data overview

Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce, Pig jobs and other.

Oozie workflow is a collection of actions arranged in DAG (Directed Acyclic Graph).

This means that second action can not run until the first one is completed.

Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL).

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

APACHE OOZIE

Page 28: Big data overview

Workflow actions start jobs in Hadoop cluster. Upon action completion, the Hadoop callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.

Oozie workflows contain control flow nodes (start, end, fail, decision, fork and join) and action nodes (Actual Jobs).

Workflows can be parameterized (using variables like ${inputDir} within the workflow definition)

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

APACHE OOZIE

Page 29: Big data overview

Example of OOZIE workflow definition

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 30: Big data overview

Example of OOZIE workflow definitionworkflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property>

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 31: Big data overview

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Example of OOZIE workflow definition < property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/></workflow-app>

Page 32: Big data overview

APACHE Sqoop

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 33: Big data overview

Apache Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases or data warehouses.

It can be used to populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule

and automate import and export tasks. Sqoop uses a connector based architecture which

supports plugins that provide connectivity to external systems.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

APACHE Sqoop

Page 34: Big data overview

Sqoop includes connectors for databases such as MySQL, PostgreSQL, Oracle, SQL Server, DB2 and generic JDBC connector.

Transferred dataset is sliced up into partitions and map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Sqoop uses the database metadata to infer data types

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

APACHE Sqoop

Page 35: Big data overview

Apache Sqoop – Import to HDFS

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 36: Big data overview

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

APACHE Sqoop Sqoop example to import data from MySQL database ORDERS

table to Hive table running on Hadoop.

sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** --hive-import

Sqoop takes care of populating Hive metastore with appropriate metadata for the table and also invokes necessary commands to load the table or partition.

Page 37: Big data overview

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Apache Sqoop – Export to Database

Page 38: Big data overview

APACHE FLUME▪ Is a distributed system to reliably collect, aggregate and

move large amounts of log data from many different sources to a centralized data store.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 39: Big data overview

APACHE FLUME

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 40: Big data overview

APACHE FLUME

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Flume Source consumes events delivered to it by an external source like a web server.

When a Flume Source receives an event, it stores it into one or more Channels.

The Channel is a passive store that keeps the event until it is consumed by a Flume Sink.

The Sink removes the event from the Channel and puts it into an external repository like HDFS

Page 41: Big data overview

APACHE FLUME FEATURES

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

It allows to build multi-hop flows where events travel through multiple agents before reaching the final destination.

It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume uses a transactional approach to guarantee reliable delivery of events.

Events are staged in the channel, which manages recovery from failure.

Flume supports log stream types such as Avro, Syslog, Netcat .

Page 42: Big data overview

DISTCP - DISTRIBUTED COPY DistCp (distributed copy) is a tool used for large

inter/intra-cluster copying. It uses Map/Reduce for its distribution, error handling

and recovery and reporting. It expands a list of files and directories into input to map

tasks, each of which will copy a partition of the files specified in the source list.

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 43: Big data overview

REAL-TIME PROCESSING – NOSQL DATABASES

▪ 5.1 Document stores

Apache CouchDB, MongoDB,

▪ 5.2 Graph Stores

Neo4j

▪ 5.3 Key-Value Stores

Apache Cassandra, Riak

▪ 5.4 Tabular Stores

Apache Hbase

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 44: Big data overview

CAP THEOREM

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

Page 45: Big data overview

WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474

HBASE ARCHITECTURE

Page 46: Big data overview

QUESTIONS & ANSWERS

www.syoncloud.com

[email protected]

Mobile : 077 9664 6474

LADISLAV URBAN