MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java...

MapReduce with Scalding Antonios Chalkiopoulos 24th Big Data London Meetup

Scalding.io

$ whoami

Scalding.io

http://scalding.io

http://github.com/scalding-io

@chalkiopoulos

My recent achievement..

Scalding.io

What are we gonna talk about..?

Scalding.io

A Scala API on top of Cascading

Scalding.io

But what is ?

Scalding.io

A few years ago I started on a fresh Big Data team…

Scalding.io

Story!!

How do we efficiently develop MapReduce jobs for our new hadoop cluster ?

Scalding.io

MapReduce Techs

Scalding.io

Java MapReduce

Hadoop

Java MapReduce Word count example

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

The promise of Cascading

Scalding.io

[1] A simple, high level java API for MapReduce easy to understand

and work with.

Scalding.io

[2] Extensions to

MANY platforms

Scalding.io

Cascading

NoSQL Databases

SQL Databases

Hadoop Filesystem

Local Filesystem

In memory systems

Search Platforms

ü  MongoDB ü  Cassandra ü  HBASE ü  Accumulo …

ü  ElasticSearch ü  Solr …

ü  Redis ü  Memcached …

How it works?

Scalding.io

A pipeline architecture

Scalding.io

data data

where tuples flow through pipes

Source tap

data data

Scalding.io

Log files

Customer Data

Log & Customer

Final Results

Log files

Customer Data

Results

Cascading Example

Scalding.io

Word count in Cascading 1.  public class WordCount {

2.  public static void main(String[] args) { 3.  Properties properties = new Properties(); 4.  FlowConnector.setApplicationJarClass (properties, WordCount.class); 5.  Scheme sourceScheme = new TextLine (new Fields(“line”)); 6.  Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7.   Tap source = new Hfs( sourceScheme, args[0]); 8.  Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE ); 9.  Pipe assembly = new Pipe(“ Word Count “); 10.  String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”; 11.  Function function = new RegexGenerator( new Fields(“word”), regex); 12.  assembly = new Each( assembly, new Fields(“line”), function ); 13.  assembly = new GroupBy( assembly, new Fields(“word”) ); 14.  Aggregator count = new Count(new Fields(“count”) ); 15.  assembly = new Every( assembly, count ); 16.  FlowConnector flowConnector = new FlowConnector( properties ); 17.  Flow flow = flowConnector.connect(“word-count”, source, sink,

assembly); 18.  flow.complete(); 19.  } 20.  }

Scalding.io

70% less boilerplate code

But still some infrastructure code

Scalding.io

ü No boilerplate code at all

ü Functional

ü Robust & Scalable

ü Run on JVM

Here it comes J

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

Scalding

The power of Scala on top of Cascading

Scalding.io

Scala fits naturally with data

Scalding.io

Word count in Scalding

Scalding.io

1.  import com.twitter.scalding._

2.  class WordCountJob(args : Args) extends Job(args) {

3.  TextLine("input.txt”).read 4.  .flatMap('line -> 'word) { line : String => line.split("\\s+") } 5.  .groupBy('word) { _.size } 6.  .write( Tsv(”results.tsv”) )

7.  }

Map phase

Reduce phase

Code that developers enjoy writing J

Who is using it?

Scalding.io

Many many others…

Scalding…

…open sourced by twitter at 2011 …has more than 100 open source contributors

…exposes the right abstractions …maximizes expressiveness

…promotes extensibility

…adds new capabilities to Cascading

Scalding.io

Core Concepts

Scalding.io

Sources & Sinks

1.   Tsv("data.tsv", ('productID,'price,'quantity)) 2.   .read 3.   .write(UnpackedAvroSource("data.avro”))

Scalding.io

ü Tsv ü Csv ü Osv ü Avro ü Parquet ü …

Map Operations

Scalding.io

1.  pipe1.filter ('age) { age:Int => age > 18 } 2.  pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 } 3.  pipe1.project('name, 'surname)

15 map

operations translated into map phases

Join operations 1.  pipe1.joinWithSmaller('productId -> 'productId, pipe2) 2.  pipe1.joinWithLarger ('productId -> 'productId, pipe2) 3.  pipe1.joinWithTiny ('productId -> 'productId, pipe2)

Scalding.io

Optimize by hinting the relative sizes

Supports Left, Right, Inner, Outer Joins

1.  pipe1 2.   .joinWithSmaller('productId -> 'productId, pipe2, 3.  joiner=new LeftJoin)

Group operations 1.  val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity)) 2.   .groupBy('shopId) { 3.  _.sum[Long]('quantity-> 'totalSoldItems) 4.  } 5.  .write(Tsv(“results.tsv”))

Scalding.io

Group by particular fields

.groupBy

.groupAll Group all data

Pipe operations 1.  val p = (pipe1 ++ pipe2) // Concatenate 2 pipes 2.  .debug // Print sample data to screen 3.   .addTrap(Tsv(“bogus_lines”) // dirty data are recorded

Scalding.io

Simple pipe operations

Connect with external systems

Scalding.io

Scalding + Hive 1.  class HiveExample (args: Args) extends Job(args) {

2.  val USER_SCHEMA = List('userId, 'username, 'photo)

3.  HiveSource("myHiveTable", SinkMode.KEEP) 4.  .withHCatScheme(osvInputScheme(fields = USER_SCHEMA)) 5.  .write(Tsv("outputFromHive")) 6.  }

Scalding.io

Define the schema Query Hcatalog

Read directly from HDFS

Scalding + ElasticSearch 1.  val schema = List('number, 'product, 'description)

2.  val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3.  val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Read from ElasticSearch in

one line! Also index new data in ES

Design patterns

Scalding.io

ü Dependency Injection ü Late bound ü External Operations

How about defining external operations?

Scalding.io

1.  val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA) 2.   .read 3.   .ETLOmnitureData 4.   .calculateOmnitureUserStats 5.   .joinWithCustomerDB('userId->'userId, customerPipe) 6.   .write(Tsv(“omniture-results.tsv”))

Custom operations: ü  Re-usable modular code ü  Single responsibility ü  Testability

Full-code http://bit.ly/1pNSUKf

Scalding Testing

Scalding.io

Testing challenges in the context of MR

Scalding.io

Acceptance Tests

Unit – Component Tests

System Tests

Integration Tests

Scalding enables

testing in every layer

example

Scalding.io

1.  class TsvWordCountJobTest extends FlatSpec 2.  with ShouldMatchers with TuppleConversions {

3.  “WordCountJob” should “count words” in { 4.  JobTest(new WordCountJob(_)) 5.  .args(“input”,”inFile”) 6.  .args(“output”,”outFile”) 7.  .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”)) 8.  .sink[(String,Int)](Tsv(“outFile”)) { out => 9.  out.toList should contain (“cool” -> 2) 10.  } 11.  .run 12.  .finish 13.  } 14.  }

Replaces taps with in-memory

collections and asserts the expected

output

Monitoring

Scalding.io

“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”

Scalding.io

http://driven.cascading.io

Scalding.io

Collects telemetry data and expose through a Web UI

Advanced Concepts

Scalding.io

Scalding adds § Typed API § Matrix API

§ Graphs § Machine Learning Algorithm

Scalding.io

What the future like?

Scalding.io

So far…

Scalding.io

Real Time Batch Hybrid

Scalding.io

Summingbird

A unified API for everything

Storm TEZ Spark

Enables the Lambda architecture

Scalding.io

Questions?

MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java...

Documents

Transcript of MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java...

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface.

Big Data Knowledge System in Healthcare...† MapReduce programming languages C++, Java or Python can be chosen by programmers developers † MapReduce programming model is an ability

Image Search by MapReduce - Santa Clara Universitymwang2/projects/Mining_imageSearch_15m.pdfLanguage Used Since Hadoop is written in Java and we only have to write a MapReduce program,

A Framework for Integrating Batch and Online MapReduce ... · Run on Scalding (Cascading/Hadoop)" Run on Storm" where data comes from" where data goes" “map”" “reduce”" read

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

CASCADING FAILURE

EE324 DISTRIBUTED SYSTEMS FALL 2015 MapReduce. Overview 2 MapReduce.

Python MapReduce Programming with Pydoop · MapReduce and Hadoop Hadoop Crash Course Pydoop: a Python MapReduce and HDFS API for Hadoop Python MapReduce Programming with Pydoop Simone

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

Fractional Cascading

MapReduce vs Pig | MapReduce Pig Integration

Cascading Brochure

Big Data: Massenserialisierung mit Apache Hadoop · PDF fileurn:epc:id:giai:4012345.667788 urn:epc:id: ... MapReduce Programme in Java oder Skript (Hadoop Streaming)

CSS - yangliang.github.io · Cascading Style Sheets • Õý Cascading • ]4¤MÎ

Cascading Thresholds

Processing with What is MapReduce? Hadoop/MapReduce ...

CSS Cascading Style Sheets Cascading Style Sheets 1.

Large-scale Data Mining: MapReduce and beyond · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase / Pig / Cascading / Hive

SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cascading