MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java...
Transcript of MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java...
MapReduce with Scalding Antonios Chalkiopoulos 24th Big Data London Meetup
Scalding.io
$ whoami
Scalding.io
http://scalding.io
http://github.com/scalding-io
@chalkiopoulos
My recent achievement..
Scalding.io
What are we gonna talk about..?
Scalding.io
Scalding.io
A Scala API on top of Cascading
Scalding.io
But what is ?
Scalding.io
A few years ago I started on a fresh Big Data team…
Scalding.io
Story!!
How do we efficiently develop MapReduce jobs for our new hadoop cluster ?
Scalding.io
??
MapReduce Techs
Scalding.io
Java MapReduce
Hadoop
ab
stra
cti
on
ws
Java MapReduce Word count example
MapReduce Techs
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
cti
on
The promise of Cascading
Scalding.io
[1] A simple, high level java API for MapReduce easy to understand
and work with.
Scalding.io
[2] Extensions to
MANY platforms
Scalding.io
Scalding.io
Cascading
NoSQL Databases
SQL Databases
Hadoop Filesystem
Local Filesystem
In memory systems
Search Platforms
ü MongoDB ü Cassandra ü HBASE ü Accumulo …
ü ElasticSearch ü Solr …
ü Redis ü Memcached …
How it works?
Scalding.io
A pipeline architecture
Scalding.io
Scalding.io
data
data data
where tuples flow through pipes
Source tap
data
data data
Sin
k tap
Scalding.io
Log files
Customer Data
Log & Customer
Final Results
Log files
Log files
Customer Data
Results
Results
Cascading Example
Scalding.io
Word count in Cascading 1. public class WordCount {
2. public static void main(String[] args) { 3. Properties properties = new Properties(); 4. FlowConnector.setApplicationJarClass (properties, WordCount.class); 5. Scheme sourceScheme = new TextLine (new Fields(“line”)); 6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]); 8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE ); 9. Pipe assembly = new Pipe(“ Word Count “); 10. String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”; 11. Function function = new RegexGenerator( new Fields(“word”), regex); 12. assembly = new Each( assembly, new Fields(“line”), function ); 13. assembly = new GroupBy( assembly, new Fields(“word”) ); 14. Aggregator count = new Count(new Fields(“count”) ); 15. assembly = new Every( assembly, count ); 16. FlowConnector flowConnector = new FlowConnector( properties ); 17. Flow flow = flowConnector.connect(“word-count”, source, sink,
assembly); 18. flow.complete(); 19. } 20. }
Scalding.io
70% less boilerplate code
But still some infrastructure code
Scalding.io
Scalding.io
ü No boilerplate code at all
ü Functional
ü Robust & Scalable
ü Run on JVM
Here it comes J
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
cti
on
Scalding
The power of Scala on top of Cascading
Scalding.io
Scala fits naturally with data
Scalding.io
Word count in Scalding
Scalding.io
1. import com.twitter.scalding._
2. class WordCountJob(args : Args) extends Job(args) {
3. TextLine("input.txt”).read 4. .flatMap('line -> 'word) { line : String => line.split("\\s+") } 5. .groupBy('word) { _.size } 6. .write( Tsv(”results.tsv”) )
7. }
Map phase
Reduce phase
4
Code that developers enjoy writing J
Who is using it?
Scalding.io
Many many others…
Scalding…
…open sourced by twitter at 2011 …has more than 100 open source contributors
…exposes the right abstractions …maximizes expressiveness
…promotes extensibility
…adds new capabilities to Cascading
Scalding.io
Core Concepts
Scalding.io
Sources & Sinks
1. Tsv("data.tsv", ('productID,'price,'quantity)) 2. .read 3. .write(UnpackedAvroSource("data.avro”))
Scalding.io
ü Tsv ü Csv ü Osv ü Avro ü Parquet ü …
Map Operations
Scalding.io
1. pipe1.filter ('age) { age:Int => age > 18 } 2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 } 3. pipe1.project('name, 'surname)
15 map
operations translated into map phases
Join operations 1. pipe1.joinWithSmaller('productId -> 'productId, pipe2) 2. pipe1.joinWithLarger ('productId -> 'productId, pipe2) 3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)
Scalding.io
Optimize by hinting the relative sizes
Supports Left, Right, Inner, Outer Joins
1. pipe1 2. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)
Group operations 1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity)) 2. .groupBy('shopId) { 3. _.sum[Long]('quantity-> 'totalSoldItems) 4. } 5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular fields
.groupBy
.groupAll Group all data
Pipe operations 1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes 2. .debug // Print sample data to screen 3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations
Connect with external systems
Scalding.io
Scalding + Hive 1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP) 4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA)) 5. .write(Tsv("outputFromHive")) 6. }
Scalding.io
Define the schema Query Hcatalog
Read directly from HDFS
Scalding + ElasticSearch 1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from ElasticSearch in
one line! Also index new data in ES
Design patterns
Scalding.io
ü Dependency Injection ü Late bound ü External Operations
How about defining external operations?
Scalding.io
1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA) 2. .read 3. .ETLOmnitureData 4. .calculateOmnitureUserStats 5. .joinWithCustomerDB('userId->'userId, customerPipe) 6. .write(Tsv(“omniture-results.tsv”))
Custom operations: ü Re-usable modular code ü Single responsibility ü Testability
Full-code http://bit.ly/1pNSUKf
Scalding Testing
Scalding.io
Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests
Scalding enables
testing in every layer
&
TDD
example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_)) 5. .args(“input”,”inFile”) 6. .args(“output”,”outFile”) 7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”)) 8. .sink[(String,Int)](Tsv(“outFile”)) { out => 9. out.toList should contain (“cool” -> 2) 10. } 11. .run 12. .finish 13. } 14. }
Replaces taps with in-memory
collections and asserts the expected
output
Monitoring
Scalding.io
“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”
Scalding.io
http://driven.cascading.io
Scalding.io
Collects telemetry data and expose through a Web UI
Advanced Concepts
Scalding.io
Scalding adds § Typed API § Matrix API
§ Graphs § Machine Learning Algorithm
Scalding.io
What the future like?
Scalding.io
So far…
Scalding.io
ab
stra
cti
on
Real Time Batch Hybrid
Scalding.io
ab
stra
cti
on
Summingbird
A unified API for everything
Storm TEZ Spark
Enables the Lambda architecture
Scalding.io
Questions?