MODELOS DE PROGRAMACIÓN BIG DATA: BATCH RESILIENT...

DAVID MARTINEZ REGO

!Asociación Española para laInteligencia Artificial (AEPIA)

MASTER DE INVESTIGACIÓN EN INTELIGENCIA ARTIFICIAL

MODELOS DE PROGRAMACIÓN BIG DATA: BATCH RESILIENT DISTRIBUTED DATASETS

Recordar• El modelo Map/Reduce introducido por Google e

implementado en código abierto por Apache Hadoop permite distribuir la computación sobre grandes conjuntos de datos.

• La computación debe ser una combinación de transformaciones map y reduce.

• El programador solo implementa las transformaciones/agrupaciones de datos que quiere hacer, el framework se encarga de ejecutar y distribuir de manera fiable el conjunto de operaciones en un cluster de máquinas.

Limitaciones de Hadoop

• Hadoop supuso un avance en el procesado de grandes conjuntos de datos

• Separación de lógica de negocio de la lógica de ejecución

• Armonización de la gestión de numerosos computadores de uso general para la construcción de un cluster de máquinas

Limitaciones de Hadoop• Hadoop tiene ciertas limitaciones que emanan de su

diseño original

• Conjunto de primitivas muy limitado (map/reduce).

• Optimización de trabajos es responsabilidad del programador

• La composición de varias fases de cómputo se hace muy laboriosa para algoritmos con una complejidad media/alta.

• Uso excesivo de disco para mitigar el fallo de nodos.

Hadoop: Ejemplo• Basta observar el código del algoritmo K-means en

Mahout para hacerse una idea de la complejidad de construir un proceso de mínima complejidad en Hadoop

• https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/clustering/kmeans

• https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/ClusterIterator.java

• https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/CIMapper.java

• https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/CIReducer.java

https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/clustering/kmeans

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/ClusterIterator.java

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/CIMapper.java

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/clustering/iterator/CIReducer.java

Idea 1: adaptarse a la tecnología

• Para comprender el diseño de Hadoop hay que ponerlo en su contexto

• Memoria y CPU costosas

• Fallos de máquinas muy habituales

• Uso de disco como punto de apoyo para resolver fallos: todos los resultados intermedios pasan por disco

pistoncloud.com/2013/04/storage-and-the-mobility-gap/

Rich Freitas, IBM Research

A Brief History: MapReduce

meanwhile, spinny disks haven’t changed all that much…

storagenewsletter.com/rubriques/hard-disk-drives/hdd-technology-trends-ibm/

Idea 1: adaptarse a la tecnología

• Con los precios de memoria y capacidad de computo cayendo mucho mas rápido que la efectividad de acceso a disco

• Mucho más efectivo mantener resultados intermedios en memoria

• En caso de fallos, computar de nuevo en lugar de leer de disco

• Si el compromiso entre inversión en memoria y CPU y fallos de máquinas evoluciona favorablemente, estos principios de diseño llevarían a un sistema mucho más efectivo.

pistoncloud.com/2013/04/storage-and-the-mobility-gap/

Rich Freitas, IBM Research

A Brief History: MapReduce

meanwhile, spinny disks haven’t changed all that much…

storagenewsletter.com/rubriques/hard-disk-drives/hdd-technology-trends-ibm/

Idea 2: think functional!

• La mayoría de las computaciones en Big Data se pueden diseñar sin side effects.

• La razón por la que Hadoop estaba encontrando limitaciones en su uso práctico reside en la limitación de primitivas y la dificultad a la hora de componer operaciones.

Idea 2: think functional!A Brief History: MapReduce

MapReduce

General Batch Processing

Pregel Giraph

Dremel Drill Tez

Impala GraphLab

Storm S4

Specialized Systems: iterative, interactive, streaming, graph, etc.

The State of Spark, and Where We're Going Next!Matei Zaharia Spark Summit (2013)!youtu.be/nU6vO2EJAb4

Idea 2: think functional!• Sería mucho mejor construir un framework funcional con un

conjunto de primitivas más rico: flatMap, join, …

• Estas primitivas se pueden componer para construir programas más complejos en un grafo dirigido (sin ciclos)

• La ejecución de una composición de operaciones sólo se ejecuta cuando es estrictamente necesario (lazy evaluation).

• Gracias a tener todo el pipeline definido en el momento de arrancar su ejecución, el framework estaría en mucha mejor disposición de aplicar optimizaciones que den lugar a un pipeline equivalente.

Resilient Distributed Datasets

• Esto es un Resilient Distributed Dataset (RDD)!

• Colección de datos inmutable mantenida en la memoria distribuida de un cluster, con un conjunto rico de operaciones funcionales

• La evaluación de transformaciones es lazy, de modo que el framework puede aplicar optimizaciones.

• La tolerancia a fallos se maneja manteniendo el linaje de cada operación y recomputando en caso de fallo

• Es la abstracción principal detrás de Apache Spark

2002

2002MapReduce @ Google

2004MapReduce paper

2006Hadoop @ Yahoo!

2004 2006 2008 2010 2012 2014

2014Apache Spark top-level

2010Spark paper

2008Hadoop Summit

A Brief History: Spark

Spark: Cluster Computing with Working Sets!Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica!USENIX HotCloud (2010) people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf!!Resilient Distributed Datasets: A Fault-Tolerant Abstraction for!In-Memory Cluster Computing!Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica!NSDI (2012)!usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Separación lógica/ejecución

Cluster ManagerDriver Program

SparkContext

Worker Node

Executor cache

tasktask

Worker Node

Executor cache

tasktask

spark.apache.org/docs/latest/cluster-overview.html

Spark Essentials: Master

Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, e.g., local file system, Amazon S3, Hypertable, HBase, etc.!

Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a glob (e.g. /data/201404*)

Spark Essentials: RDD

action value

RDDRDDRDD

transformations RDD

• Spark puede crear el RDD inicial de una computación a partir de cualquier sistema de almacenamiento en el ecosistema de Big Data: ficheros en HDFS, Hbase, Cassandra, … e incluso a partir de una simple colección en memoria

• Hay dos tipos de operaciones sobre un RDD: transformaciones (que generan otro RDD) y acciones (que generan un side effect)

• Las transformaciones solo se ejecutan cuando es necesario ejecutar una acción del pipeline.

Spark essentials

Spark essentials

Example 3-5. Python textFile examplelines = sc.textFile("/path/to/README.md")

Example 3-6. Scala textFile exampleval lines = sc.textFile("/path/to/README.md")

Example 3-7. Java textFile exampleJavaRDD<String> lines = sc.textFile("/path/to/README.md");

RDD OperationsRDDs support two types of operations, transformations and actions. Transformationsare operations on RDDs that return a new RDD, such as map and filter. Actions areoperations that return a result back to the driver program or write it to storage, and kickoff a computation, such as count and first. Spark treats transformations and actionsvery differently, so understanding which type of operation you are performing will beimportant. If you are ever confused whether a given function is a transformation or anaction, you can look at its return type: transformations return RDDs whereas actionsreturn some other data type.

TransformationsTransformations are operations on RDDs that return a new RDD. As discussed shortlyin the lazy evaluation section, transformed RDDs are computed lazily, only when youuse them in an action. Many transformations are element-wise, that is they work on oneelement at a time, but this is not true for all transformations.

As an example, suppose that we have a log file, log.txt, with a number of messages,and we want to select only the error messages. We can use the filter transformationseen before. This time though, we’ll show a filter in all three of Spark’s language APIs:

Example 3-8. Python filter exampleinputRDD = sc.textFile("log.txt")errorsRDD = inputRDD.filter(lambda x: "error" in x)

Example 3-9. Scala filter exampleval inputRDD = sc.textFile("log.txt")val errorsRDD = inputRDD.filter(line => line.contains("error"))

Example 3-10. Java filter exampleJavaRDD<String> inputRDD = sc.textFile("log.txt");JavaRDD<String> errorsRDD = inputRDD.filter( new Function<String, Boolean>() {

26 | Chapter 3: Programming with RDDs

public Boolean call(String x) { return x.contains("error"); }});

Note that the filter operation does not mutate the existing inputRDD. Instead, it returnsa pointer to an entirely new RDD. inputRDD can still be re-used later in the program,for instance, to search for other words. In fact, let’s use inputRDD again to search forlines with the word “warning” in them. Then, we’ll use another transformation, union,to print out the number of lines that contained either “error” or “warning”. We showPython here, but the union() function is identical in all three languages:

Example 3-11. Python union exampleerrorsRDD = inputRDD.filter(lambda x: "error" in x)warningsRDD = inputRDD.filter(lambda x: "warning" in x)badLinesRDD = errorsRDD.union(warningsRDD)

union is a bit different than filter, in that it operates on two RDDs instead of one.Transformations can actually operate on any number of input RDDs.

A better way to accomplish the same result would be to simply filterthe inputRDD once looking for either “error” or “warning”.

Finally, as you derive new RDDs from each other using transformations, Spark keepstrack of the set of dependencies between different RDDs, called the lineage graph. It usesthis information to compute each RDD on demand and to recover lost data if part of apersistent RDD is lost. Figure 3-1 shows a lineage graph for this example.

RDD Operations | 27

Figure 3-1. RDD lineage graph created during log analysis.

ActionsWe’ve seen how to create RDDs from each other with transformations, but at somepoint, we’ll want to actually do something with our dataset. Actions are the second typeof RDD operation. They are the operations that return a final value to the driver programor write data to an external storage system. Actions force the evaluation of the trans‐formations required for the RDD they are called on, since they are required to actuallyproduce output.

Continuing the log example from the previous section, we might want to print out someinformation about the badLinesRDD. To do that, we’ll use two actions, count(), whichreturns the count as a number, and take(), which collects a number of elements fromthe RDD.

Example 3-12. Python error count example using actionsprint "Input had " + badLinesRDD.count() + " concerning lines"print "Here are 10 examples:"for line in badLinesRDD.take(10): print line

Example 3-13. Scala error count example using actionsprintln("Input had " + badLinesRDD.count() + " concerning lines")println("Here are 10 examples:")badLinesRDD.take(10).foreach(println)


Crear un RDD

Transformar un RDD

Actuar sobre RDD

Spark essentials

Figure 3-1. RDD lineage graph created during log analysis.

ActionsWe’ve seen how to create RDDs from each other with transformations, but at somepoint, we’ll want to actually do something with our dataset. Actions are the second typeof RDD operation. They are the operations that return a final value to the driver programor write data to an external storage system. Actions force the evaluation of the trans‐formations required for the RDD they are called on, since they are required to actuallyproduce output.

Continuing the log example from the previous section, we might want to print out someinformation about the badLinesRDD. To do that, we’ll use two actions, count(), whichreturns the count as a number, and take(), which collects a number of elements fromthe RDD.

Example 3-12. Python error count example using actionsprint "Input had " + badLinesRDD.count() + " concerning lines"print "Here are 10 examples:"for line in badLinesRDD.take(10): print line

Example 3-13. Scala error count example using actionsprintln("Input had " + badLinesRDD.count() + " concerning lines")println("Here are 10 examples:")badLinesRDD.take(10).foreach(println)


Linaje lógico

errorsRDDerrorsRDDerrorsRDDwarningsRDDwarningsRDDwarningsRDD

inputRDDinputRDDinputRDD

badlinesRDDbadlinesRDDbadlinesRDD

count take

Ejecución

Conjunto de operacionesSpark Essentials: Transformations

transformation description

map(func)return a new distributed dataset formed by passing each element of the source through a function func

filter(func)

return a new dataset formed by selecting those elements of the source on which func returns true!

flatMap(func)

similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

sample(withReplacement, fraction, seed)

sample a fraction fraction of the data, with or without replacement, using a given random number generator seed

union(otherDataset)return a new dataset that contains the union of the elements in the source dataset and the argument

distinct([numTasks]))return a new dataset that contains the distinct elements of the source dataset

Agrupación: Pair RDD• En la introducción a map/reduce vimos que que el sistema se

encargaba de ejecutar un agrupamiento paralelo entre las dos fases

• El agrupamiento es configurable y precisaba de una clave a través de la cual saber que elementos corresponden a cada grupo

• Spark utiliza el mismo principio internamente a través de la abstracción del Pair RDD

Creating Pair RDDsThere are a number of ways to get Pair RDDs in Spark. Many formats we explore loadingfrom in Chapter 5 will directly return Pair RDDs for their key-value data. In other caseswe have a regular RDD that we want to turn into a Pair RDD. To illustrate creating aPair RDDs we will key our data by the first word in each line of the input.

In Python, for the functions on keyed data to work we need to make sure our RDDconsists of tuples.

Example 4-1. Python create Pair RDD using the first word as the keyinput.map(lambda x: (x.split(" ")[0], x))

In Scala, for the functions on keyed data to be available, we simply need to return a tuplefrom our function. An implicit conversion on RDDs of tuples exists to provide theadditional Key-Value functions.

Example 4-2. Scala create Pair RDD using the first word as the keyinput.map(x => (x.split(" ")(0), x))

Java doesn’t have a built-in tuple type, so Spark’s Java API has users create tuples usingthe scala.Tuple2 class. This class is very simple: Java users can construct a new tupleby writing new Tuple2(elem1, elem2) and can then access the elements withthe ._1() and ._2() methods.

Java users also need to call special versions of Spark’s functions when creating Pair RDDs.For instance, the mapToPair function should be used in place of the basic map function.This is discussed in more detail in converting between RDD types, but lets look at asimple example below.

Example 4-3. Java create Pair RDD using the first word as the keyPairFunction<String, String, String> keyData = new PairFunction<String, String, String>() { public Tuple2<String, String> call(String x) { return new Tuple2(x.split(" ")[0], x); }};JavaPairRDD<String, String> rdd = input.mapToPair(keyData);

When creating a Pair RDD from an in memory collection in Scala and Python we onlyneed to make sure the types of our data are correct, and call parallelize. To create aPair RDD in Java from an in memory collection we need to make sure our collectionconsists of tuples and also call SparkContext.parallelizePairs instead of SparkContext.parallelize.

48 | Chapter 4: Working with Key-Value Pairs

AgrupacionesSpark Essentials: Transformations


groupByKey([numTasks])when called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs

reduceByKey(func, [numTasks])

when called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function

sortByKey([ascending], [numTasks])

when called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument

join(otherDataset, [numTasks])

when called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key

cogroup(otherDataset, [numTasks])

when called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples – also called groupWith

cartesian(otherDataset)when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements)

Manual en Hadoop{

AccionesSpark Essentials: Actions

action description

reduce(func)

aggregate the elements of the dataset using a function func (which takes two arguments and returns one), and should also be commutative and associative so that it can be computed correctly in parallel

collect()

return all the elements of the dataset as an array at the driver program – usually useful after a filter or other operation that returns a sufficiently small subset of the data

count() return the number of elements in the dataset

first()return the first element of the dataset – similar to take(1)

take(n)return an array with the first n elements of the dataset – currently not executed in parallel, instead the driver program computes all the elements

takeSample(withReplacement, fraction, seed)

return an array with a random sample of num elements of the dataset, with or without replacement, using the given random number generator seed

AccionesSpark Essentials: Actions

action description

saveAsTextFile(path)

write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file

saveAsSequenceFile(path)

write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. Only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

countByKey()only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key

foreach(func)run a function func on each element of the dataset – usually done for side effects such as updating an accumulator variable or interacting with external storage systems

PersistenciaSpark Essentials: Persistence


MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SERStore RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SERSimilar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc

Same as the levels above, but replicate each partition on two cluster nodes.

RDD: ejemplos

RDD: Ejemplos

RDD: K-means

• Al principio del modulo veíamos como el API de Hadoop complicaba la implementación de un algoritmo tan sencillo como K-means

• Esta es la implementación nativa en Spark incluida en el modulo MLlib

• https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

RDD: hardware

The State of Spark, and Where We're Going Next!Matei Zaharia Spark Summit (2013)!youtu.be/nU6vO2EJAb4

A Brief History: Spark

MODELOS DE PROGRAMACIÓN BIG DATA: BATCH RESILIENT...

Documents

Transcript of MODELOS DE PROGRAMACIÓN BIG DATA: BATCH RESILIENT...