Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014
-
Upload
stratio -
Category
Technology
-
view
335 -
download
3
description
Transcript of Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014
![Page 2: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/2.jpg)
• Introducción
• Conceptos básicos
• Ecosistema Spark
• Instalación del entorno
• Errores comunes
Agenda
2
![Page 3: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/3.jpg)
3
VIEWER DISCRETION IS ADVISED
All elephants are innocent until proven guilty in a court of development
Opinions expressed are solely my own and do not express the views or opinions of my employer.
![Page 4: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/4.jpg)
Introducción
4
![Page 5: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/5.jpg)
o ¿Qué es Spark?
o Framework de procesamiento paralelo
o Historia
Introducción
Apache Spark Madrid Meetup 5
https://spark.apache.org/
Apache Software Foundation
![Page 6: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/6.jpg)
o Concepto de programación funcional
o Popularizado por Google
Map-reduce
6
(map 'list (lambda (x) (+ x 10)) '(1 2 3 4)) => (11 12 13 14) (reduce #'+ '(1 2 3 4)) => 10
Jeff Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." OSDI (2004)
Apache Spark Madrid Meetup
![Page 7: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/7.jpg)
Map-Reduce
7
Input data
Map
Map
Map
Map
Reduce
Reduce
Reduce
result
Apache Spark Madrid Meetup
![Page 8: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/8.jpg)
o Mayor flexibilidad en la definición de transformaciones
o Menor uso de almacenamiento en disco
o Aprovechamiento de la memoria
o Tolerancia a fallos
o Tracción de la comunidad
Ventajas de Spark
Apache Spark Madrid Meetup 8
![Page 9: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/9.jpg)
Conceptos básicos
9
![Page 10: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/10.jpg)
o Abstracción básica en Spark
o Contiene las transformaciones que se van a realizar sobre un conjunto de datos
• Inmutable
• Lazy evaluation
• En caso de fallo se puede recuperar el estado
• Control de persistencia y particionado
RDD
Apache Spark Madrid Meetup 10
![Page 11: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/11.jpg)
o Representa la conexión a un cluster Spark
o Permite crear distintos tipos de variables
• RDD
• Acumuladores
• Broadcast
SparkContext
Apache Spark Madrid Meetup 11
new SparkContext(master: String, appName: String, conf: SparkConf)
![Page 12: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/12.jpg)
Ecosistema
12
![Page 13: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/13.jpg)
Ecosistema Spark
13 Apache Spark Madrid Meetup
© databricks
![Page 14: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/14.jpg)
o Proporciona las abstracciones básicas y se encarga del scheduling
Spark core engine
14 Apache Spark Madrid Meetup
RDD DAG Scheduling
Cluster manager
Threads
Block manager
Task scheduling
Worker
![Page 15: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/15.jpg)
o Permite transformar una fuente streaming en un conjunto de mini-batch
• Definición de una ventana
Temporal
Spark Streaming
15 Apache Spark Madrid Meetup
![Page 16: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/16.jpg)
Spark Streaming
16 Apache Spark Madrid Meetup
Window = 5
batch0 batch1 batch2 batch3 batch4 batch5 batch6 batch7
tiempo
tiempo
![Page 17: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/17.jpg)
o Librería para Machine Learning
o Abstracciones útiles para cómputo
o Vectores, Matrices dispersas
o Implementación de algoritmos conocidos
o Clasificación, regresión, collaborative filtering y clustering
MLlib
17 Apache Spark Madrid Meetup
![Page 18: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/18.jpg)
o Capa de acceso SQL para ejecutar operaciones sobre RDD
o SchemaRDD
SparkSQL
18 Apache Spark Madrid Meetup
sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name)
© databricks
![Page 19: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/19.jpg)
SparkSQL (II)
19 Apache Spark Madrid Meetup
val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") val teenagers = sqlContext .sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
![Page 20: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/20.jpg)
o GraphX
• Soporte para grafos
o SparkR
• Permite conectar R con Spark
o BlinkDB
• Base de datos que ofrece funciones aproximadas
Otros
20 Apache Spark Madrid Meetup
graph = Graph(vertices, edges) messages = spark.textFile("hdfs://...") graph2 = graph.joinVertices(messages) { (id, vertex, msg) => ... }
![Page 21: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/21.jpg)
Errores comunes
21
![Page 22: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/22.jpg)
o URL del master
o No distribuir los JAR entre los workers
o Funciones con clases no serializables
o Funciona en local -> funciona en distribuido
o Memory leaks y eficiencia GC en operadores
o Confusión operadores (reduce vs group-by)
Errores comunes
#CassandraSummit 2014 22
![Page 23: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/23.jpg)
Certificaciones
23
![Page 24: Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014](https://reader035.fdocuments.us/reader035/viewer/2022062406/558b05bcd8b42a850f8b4693/html5/thumbnails/24.jpg)
o Distribuciones certificadas
o Certificación de desarrolladores
o Centros de formación certificados
Certificaciones
#CassandraSummit 2014 24