Big Data tools in practice

Darko Marjanović, darko@thingsolver.comMiloš Milovanović, milos@thingsolver.com

Agenda• Hadoop• Spark• Python

Hadoop

• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • Any data types.• Analytical queries.• Integration with traditional

systems.

• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little

data.• In memory.• Real time analytics.

The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop• Hadoop Common

• HDFS

• Map Reduce

• YARN

Hadoop HDFS

Apache Spark

• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.

• Cons• Integration with traditional

systems.• Limited memory per machine(GC).• Configuration.

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Spark stack

Resilient Distributed DatasetsA distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.*

• Immutability• Lineage (reconstruct lost partitions)• Fault tolerance through logging updates made to a dataset (single operation applied to

many records)• Creation:

• Reading a dataset from storage (HDFS or any other)• From other RDDs

*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html

RDD operations• Transformations

• Lazy evaluated (executed by calling an action)

• Reduces wait states• Better pipelining

• Actions• Runned immediately• Return value to the application or

export to storage system

• map(f : T U)⇒• filter(f : T Bool)⇒• groupByKey()• join()

• count()• collect()• reduce(f : (T, T) T)⇒• save(path: String)

Spark program lifecycle

Create RDD(external data or parallelize collection)

Transformation(lazy evaluated)

Cache RDD(for reuse)

Action(execute computation and return results)

Spark in a cluster mode

* http://spark.apache.org/docs/latest/img/cluster-overview.png

PySpark• Python API for Spark

• Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data”

• Dynamically typed (RDDs can hold objects of multiple types)

• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask

• Run Spark from Jupyter notebooks

Spark DataframesDataFrames are a common data science abstraction that go across languages.

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

A Spark DataFrame is a distributed collection of data organized into named columns, and can be created:• - from structured data files• - from Hive tables• - from external databases• - from RDDs

Some supported operations: - slice data

• - sort data• - aggregate data• - join with other dataframes

Dataframe benefits• Lazy evaluation

• Domain specific language for distributed data manipulation

• Automatic parallelization and cluster distribution

• Integration with pipeline API for Mllib

• Query structured data with SQL (using SQLContext)

• Integration with Pandas Dataframes (and other Python data libraries)

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

df = sqlContext.read.json("data.json")df.show()

df.select(“id”).show()

df.filter(df[”id”] > 10).show()

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

df = sqlContext.read.json("data.json")df.registerTempTable(“data”)

results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)

Pandas DF vs Spark DF

Single machine tool (all data needs to fit to memory, except with HDF5)

Distributed (data > memory)

Better API Good API

No parallelism Parallel by default

Mutable Immutable

Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )

A very popular benchmark

* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png

Big Data tools in practice

Darko Marjanović, darko@thingsolver.comMiloš Milovanović, milos@thingsolver.com

Big Data tools in practice

Technology

Transcript of Big Data tools in practice

Big data and hadoop ecosystem tools

Workshop practice beginning - machining tools

Big Data Smart City processes and tools, Real Time data processing tools

Essential Tools For Your Big Data Arsenal

Tools and Tech for Big Data Success

Great New Tools, Big New Savings

Open Source Security Tools for Big Data

Finding Practice Tools Power Point

Pretty Big Trends, “Pretty Good” Practice, and New Tools Tim Jewell UW Libraries Collection Management Services tjewell@u.washington.edu UCLA Libraries.

TLS - Integrating Tools for Big Results

Big Shot Underground Piercing Tools

14 Teknologi dan Tools Big Data Bagian 3 Big Data L1617imamcs.lecture.ub.ac.id/files/2012/08/12-2-Teknologi-dan-Tools-Big... · 12 Desember 2016 Teknologi dan Tools Big Data (Bagian

Big Data in Practice Sample Chapter

Small tools - Big Productions_lr_2013.pdf

Big Tools for Big Data

Sat Practice Tools Brochure Us

BIG Daishowa Profit Maker Tools(1)

Evidence based practice tools for home Health rehabilitation for Home Care... · EVIDENCE BASED PRACTICE TOOLS FOR ... or Tinetti Balance and Gait ... Evidence based practice tools

2014 SBS Big Bike Book: Tools

Deciphering Big Data Stacks: An Overview of Big Data Tools