Small intro to Big Data

Small intro to Big DataMichał Matłoka @mmatloka

Outline1. What is Big Data?2. Storing3. Batch & Streams processing4. Resource Managers5. Machine Learning6. Analysis & Visualization7. Other

What is Big Data?● Volume● Velocity● Variety

How big is Big?

ThoughtWorks: Big Data envy

Storing

CAP theorem(Brewer’s theorem)

In distributed system you can only

have two of three guarantees:

● Consistency

● Availability

● Partition Tolerance

Relational scaling(horizontal)

Example limitations:

● Max 48 nodes

● Read-only nodes

● Cross-shard joins…

● Auto-increments

● Distributed transactions,

possible, but…

It can work!

You don’t always need ACID

BASE might be enough

NoSQL(Not only SQL)

● Key-value (Redis, Dynamo, ...)

● Column (Cassandra, HBase, ...)

● Document (MongoDB, … )

● Graph (Neo4J, …

● Multi-model (OrientDB, …)

Apple - 115k Cassandra nodes with

over 10PB of data!

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

Batch Processing● Processes data from 1 or more

sources from bigger period of

time (e.g. day, month)

● Source: db, Apache Parquet, ...

● Not real-time

● Can take hours or more

Apache Hadoop● Based on Google paper

● First release in 2006!

● Map -> (shuffle) -> Reduce

● Was the beginning of many

projectsMapReduce

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Hadoop Wordcount - part I

Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Hadoop Wordcount - part II


public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Hadoop Wordcount - part III


Apache SparkRDD (Resilient Distributed

Dataset)

DAG (Directed acyclic graph)

● RDD - map, filter, count etc

● Spark SQL

● MLib

● GraphX

● Spark Streaming

● API: Scala, Java, Python, R*

val textFile = sc.textFile("hdfs://...")

val counts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Spark Wordcount

Source: http://spark.apache.org/examples.html

Apache FlinkDAGs with iterations

● Batch & native streaming

● FlinkML

● Table API & SQL (Beta)

● Gelly - graph analytics

● FlinkCEP - detect patterns in

data streams

● Compatible with Apache

Hadoop and Apache Storm APIs

● API: Scala, Java, Python*

Stream processing

● Near real-time/real time

● Processing (usually) does not

end

● Source: files, Apache Kafka,

Socket, Akka Actors, Twitter,

RabbitMQ etc

● Event time vs processing time

● Windows - fixed, sliding, session

● Watermarks

● State

Stream processing

● Native/micro-batch

● Latency

● Throughput

● Delivery guarantees

● Resources managers

● API - compositional/declarative

● Maturity

Differences

Stream processing● Apache Storm

● Apache Storm Trident

● Alibaba JStorm

● Twitter Heron

● Apache Spark Streaming

● Apache Flink

● Apache Beam

● Apache Kafka Streams

● Apache Samza

● Apache Gearpump

● Apache Apex

● Apache Ignite Streaming

● Apache S4

● ...

Resource management

● Apache YARN

● Apache Mesos (1.0.1!)

● Apache Slider - deploy existing

apps on YARN

● Apache Myriad - YARN on

Mesos

● DC/OS

Source: https://docs.mesosphere.com/wp-content/uploads/2016/04/[email protected]

AnalysisSQL Engines & Querying

● Apache Hive

● Apache Pig

● Apache HAWQ

● Apache Impala

● Apache Phoenix

● Apache Spark SQL

● Apache Drill

● Facebook Presto

● ...

Machine Learning

● Apache Mahout

● Apache Samoa

● Spark MLib

● FlinkML

● H2O

● TensorFlow

Notebooks ● IPython

● Jupyter

● Apache Zeppelin

Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png

Hadoop-related

● Apache Sqoop

● Apache Flume

● Apache Oozie

● Hue

● Apache HDFS

● Apache Ambari

● Apache Knox

● Apache ZooKeeper

Awesome Big Data https://github.com/onurakpolat

/awesome-bigdata

Conclusions● There is a lot of it!

● https://pixelastic.github.io/pokem

onorbigdata/

● If you want to learn, start with

SMACK stack (Spark, Mesos,

Akka, Cassandra, Kafka)

https://pixelastic.github.io/pokemonorbigdata/



Articles & references● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t

echnologies/

● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-

frameworks-part-1

● https://dcos.io/

● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn

● http://spark.apache.org/

● https://flink.apache.org/

● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-

big-data-analytics-frameworks

● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing

● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/



http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1



https://dcos.io/

https://dcos.io/

https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn

https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn

http://spark.apache.org/

http://spark.apache.org/

https://flink.apache.org/

https://flink.apache.org/

http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-big-data-analytics-frameworks



http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing

http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing

https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101




Thank you, Q&A?@mmatloka

http://www.slideshare.net/softwaremillhttps://softwaremill.com/blog/

http://www.slideshare.net/softwaremill

http://www.slideshare.net/softwaremill

https://softwaremill.com/blog/

https://softwaremill.com/blog/

Small intro to Big Data

Software

Transcript of Small intro to Big Data