Big data overview
Transcript of Big data overview
Presented By Ladislav Urban
www.syoncloud.com
Ladislav Urban CEO of Syoncloud.
Syoncloud is a consulting company specialized in Big Data analytics and integration of existing
systems.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED
Documents
Existing relational databases (CRM, ERP, Accounting, Billing)
E-mails and attachments
Imaging data (graphs, technical plans)
Sensor or device data
Internet search indexing
Log files
Social media
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Telephone conversations Videos Pictures Clickstreams (clicks from users on web pages)
SCALE OF THE DATA
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?
If relational databases do not scale to your traffic needs If normalized schema of your relational database became too
complex. If your business applications generate lots of supporting and
temporary data If database schema is already denormalized in order to
improve response times If joins in relational databases slow the system down to a crawl
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? We try to map complex hierarchical documents to
Database tables Documents from different sources require flexible
schema When more data beats clever algorithms Flexibility is required for analytics Queries for values at specific time in history Need to utilize outputs from many existing systems
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? To analyze unstructured data such as documents, log
files or semi-structured data such as CSV files and forms
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES? SQL language. It is well known, standardized and based on
strong mathematical theories.
Database schemas that do not to be modified during production.
Scalability is not required
Mature security features: Role-based security, encrypted communications, row and field access control
Full support of ACID transactions (atomicity, consistency, isolation, durability)
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Support for backup and rollback for data in case of data loss or corruption.
Relational database do have development, tuning and monitoring tools with good GUI
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?
Batch vs Real-time Processing
Batch processing is used when real-time processing is not required, not possible or too expensive.
Conversion of unstructured data such as text files and log files into more structured records
Transformation during ETL Ad-hoc analysis of data Data analytics application and reporting
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
BATCH PROCESSING INFRASTRUCTURE
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Batch processing systems utilize Map/Reduce and HDFS implementation in Apache Hadoop.
It is possible to develop batch processing application in Java using only Hadoop but we should mention other important systems and how they fit into Hadoop infrastructure.
BATCH PROCESSING INFRASTRUCTURE
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE AVRO In order to process data we need to have information
about data-types and data-schemas. This information is used for serialization and
deserialization for RPC communications as well as reading and writing to files.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
RPC and serialization system that supports reach data structures
It uses JSON to define data types and protocols It serializes data in a compact binary format Avro supports Schema evolution Avro will handle missing/extra/modified fields.
APACHE AVRO
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
SCRIPT LANGUAGE FOR MAP/REDUCE
We need a quick and simple way to create Map/Reduce transformations, analysis and applications.
We need a script language that can be used in scripts as well as interactively on command line.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE PIG
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
High-level procedural language for querying large semi-structured data sets using Hadoop and the Map/Reduce Platform
Pig simplifies the use of Hadoop by allowing SQL-like queries to run on distributed dataset.
APACHE PIG
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
An example of filtering log file for only Warning messages that will run in parallel on large cluster.
Given script is automatically transformed into Map/Reduce program and distributed across Hadoop cluster.
APACHE PIG
messages = LOAD '/var/log/messages';warns = FILTER messages BY $0 MATCHES '.*WARN+.*';DUMP warns
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
FILTER - Select a set of tuples from a relation based on a condition. FOREACH - Iterate the tuples of a relation, generating a data
transformation. GROUP - Group the data in one or more relations. JOIN - Join two or more relations (inner or outer join). LOAD - Load data from the file system. ORDER - Sort a relation based on one or more fields. SPLIT - Partition a relation into two or more relations. STORE - Store data in the file system.
APACHE PIGRelational operators that can be used in Pig
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
What if we want to use SQL to create map/reduce jobs?
Apache Hive is a data warehousing infrastructure based on the Hadoop
It provides query language called Hive QL, which is based on SQL.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE HIVE Hive functions: data summarization, query and
analysis. It uses system catalog called Hive-Metastore. Hive is not designed for OLTP or Real-time queries. It is best used for batch jobs over large sets of append-
only data.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE HIVE
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
HiveQL language supports ability to Filter rows from a table using a where clause. Select certain columns from the table using a select clause. Do equi-joins between two tables. Evaluate aggregations on multiple "group by" columns for the
data stored in a table. Store the results of a query into another table. Download the contents of a table to a local (NFS) directory.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
HiveQL language supports ability to
Store the results of a query in a HDFS directory. Manage tables and partitions (create, drop and alter). Plug in custom scripts in the language of choice for custom
map/reduce jobs.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE OOZIE Map/Reduce jobs, Pig Scripts and Hive queries
should be simple and single purposed. How can we create complex ETL or data analysis in
Hadoop? We chain scripts so output of one script is an input
for another. Complex workflows that represents real-world
scenarios need workflow engine such as Apache Oozie.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce, Pig jobs and other.
Oozie workflow is a collection of actions arranged in DAG (Directed Acyclic Graph).
This means that second action can not run until the first one is completed.
Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL).
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE OOZIE
Workflow actions start jobs in Hadoop cluster. Upon action completion, the Hadoop callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.
Oozie workflows contain control flow nodes (start, end, fail, decision, fork and join) and action nodes (Actual Jobs).
Workflows can be parameterized (using variables like ${inputDir} within the workflow definition)
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE OOZIE
Example of OOZIE workflow definition
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Example of OOZIE workflow definitionworkflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property>
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Example of OOZIE workflow definition < property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/></workflow-app>
APACHE Sqoop
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Apache Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases or data warehouses.
It can be used to populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule
and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to external systems.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE Sqoop
Sqoop includes connectors for databases such as MySQL, PostgreSQL, Oracle, SQL Server, DB2 and generic JDBC connector.
Transferred dataset is sliced up into partitions and map-only job is launched with individual mappers responsible for transferring a slice of this dataset.
Sqoop uses the database metadata to infer data types
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE Sqoop
Apache Sqoop – Import to HDFS
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE Sqoop Sqoop example to import data from MySQL database ORDERS
table to Hive table running on Hadoop.
sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** --hive-import
Sqoop takes care of populating Hive metastore with appropriate metadata for the table and also invokes necessary commands to load the table or partition.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Apache Sqoop – Export to Database
APACHE FLUME▪ Is a distributed system to reliably collect, aggregate and
move large amounts of log data from many different sources to a centralized data store.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE FLUME
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
APACHE FLUME
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
Flume Source consumes events delivered to it by an external source like a web server.
When a Flume Source receives an event, it stores it into one or more Channels.
The Channel is a passive store that keeps the event until it is consumed by a Flume Sink.
The Sink removes the event from the Channel and puts it into an external repository like HDFS
APACHE FLUME FEATURES
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
It allows to build multi-hop flows where events travel through multiple agents before reaching the final destination.
It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.
Flume uses a transactional approach to guarantee reliable delivery of events.
Events are staged in the channel, which manages recovery from failure.
Flume supports log stream types such as Avro, Syslog, Netcat .
DISTCP - DISTRIBUTED COPY DistCp (distributed copy) is a tool used for large
inter/intra-cluster copying. It uses Map/Reduce for its distribution, error handling
and recovery and reporting. It expands a list of files and directories into input to map
tasks, each of which will copy a partition of the files specified in the source list.
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
REAL-TIME PROCESSING – NOSQL DATABASES
▪ 5.1 Document stores
Apache CouchDB, MongoDB,
▪ 5.2 Graph Stores
Neo4j
▪ 5.3 Key-Value Stores
Apache Cassandra, Riak
▪ 5.4 Tabular Stores
Apache Hbase
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
CAP THEOREM
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
WWW.SYONCLOUD.COM E-MAIL : [email protected] MOBILE : 077 9664 6474
HBASE ARCHITECTURE