Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig...

Power Pig with Spark

Kelly Zhang (liyun.zhang@intel.com)

Apache Big Data Europe 2016

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Background

Apache Pig

● Procedural scripting language

● Pig Latin: similar to sql

● Heavily used for ETL

● Schema / No schema data, Pig eats everything

● Faster

● Generality

● Easy of use

Agenda

● Background

● Benchmark

● Optimization

● Q&A

Why Pig on Spark

● Better Performance

○ No intermediate data between stages

○ In-memory caching abstraction

○ Executor JVM Reuse

● Support Pig users to experience Spark conveniently

Agenda

● Background

● Benchmark

● Optimization

● Q&A

Design Architecture

Pig Latin to RDD<Tuple> transformations

Operator Mapping

Pig Operator Spark Operator

Load newAPIHadoopFile

Store saveAsNewAPIHadoopFile

Filter filter

GroupBy groupby/reduceBy

Join CoGroupRDD

ForEach mapPartitions

Sort sortByKey

Agenda

● Background

● Benchmark

● Optimization

● Q&A

Benchmark Overview

Component Version

Pig Spark branch

Hadoop 2.6.0

Spark 1.6.2

PigMix Trunk

Basic Configuration

spark.master=yarn-client

spark.executor.memory=6553m

spark.yarn.executor.memoryOverhead=1638

spark.executor.cores=8

spark.dynamicAllocation.enabled=true

spark.network.timeout=1200000

Benchmark Overview (cont’d)

Agenda

● Background

● Benchmark

● Optimization

● Q&A

Optimize GroupBy/Join

Skewed Key Sort

Salted Key Solution

Skewed Key Sort Performance

There are significant performance Improvement in sort case(L10) and skewed key sort case(L9)

Agenda

● Background

● Benchmark

● Optimization

● Q&A

Current Status: Nearing end of Milestone 1

● Functional completeness: DONE

● All Unit Tests Pass: DONE

● Merge Spark Branch to Master: In Code Review

Ongoing Work towards Milestone 2● Implement Optimizations

○ Optimize Group by/Join - PIG-4797: DONE

○ FR Join - PIG-4771: DONE

○ Merge Join - PIG-4810: DONE

○ Skewed Join: UNDER REVIEW

● Enhance Test Infrastructure

○ Use “local-cluster” mode to run unit tests

● Spark Integration

○ Improved error, progress, stats reporting

○ YARN Cluster Mode

Future work: Milestone 3

● Implement More Optimizations

○ Split / MultiQuery using RDD.cache()

○ Compute optimal Shuffle Parallelism

○ Optimize/Redesign Spark Plan

● Code Stablization, Bug Fixes

Contribution welcomed● Git:

○ https://github.com/apache/pig/tree/spark

● Wiki :

○ https://cwiki.apache.org/confluence/display/PIG/Pig

+on+Spark

● Umbrella jira:

○ PIG-4059

Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig...

Documents

Transcript of Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig...

Apache Pig Example

EEDC Apache Pig Language

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

NoSQL HBase schema design and SQL with Apache Drill

Apache Pig -- PittsburghHug

Studio Schema Editor Apache Directorydirectory.apache.org/.../Apache_Directory_Studio... · Beside the integration in Apache Directory Studio the Apache Directory Studio Schema Editor

A Smarter Pig: Building a SQL interface to Pig using …...A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/16

Apache Pig for Data Science · 2017. 12. 14. · Apache Pig for Data Science Casey Stella April9,2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

MODULE 2 - pace.edu.in · Apache Pig Apache Pig is high level language that enables programmer to write complex MapReduce transformation using simple scripting language. Pig often

Hortonworks Data Platform for HDInsight · • Apache Pig 0.16.0 • Apache Ranger 1.2.0 • Apache Spark 2.4.0 • Apache Sqoop 1.4.7 • Apache Storm 1.2.1 • Apache TEZ 0.9.1

Classes, Methods and Schema - Apache Isis · Table of Contents 1. Classes, Methods and Schema ...

High-level Programming Languages: Apache Pig and Pig Latin

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig team @ Yahoo! Apache pig.

COSC 416 Apache Pig - People | UBC's Okanagan campus · 2013. 1. 27. · Apache Pig Apache Pig is a high-level language for expressing Map-Reduce programs. Pig defines a language

Apache Pig on Amazon AWS - Swine Not?

Pig Udf Schema Example

Apache Pig

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer