Introduction spark

Introduction à!

Nantes - 08/07/2014!Ludwine Probst - @nivdul

développeuse!maths lover!

machine learning & big data

Leadeuse de Duchess France

@nivdulnivdul.wordpress.com

Etat des lieux

Mais…

• analytics pour de gros datasets et données en mémoire!

• Resilient Distributed Datasets (RDD)!

• principe de lineage!

• compatible avec Hadoop / InputFormats!

• meilleures performances que Hadoop!

• plus de flexibilité d’implémentation

Interrogation de Spark

shell scala/python!supporte les lambdas expressions (Java8)

compatible avec NumPy

Vue globale

todo schéma

SparkContext

SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local"); ! //.setMaster(« spark://192.168.1.11:7077") !!JavaSparkContext sc = new JavaSparkContext(sparkConf);

Resilient Distributed Datasets (RDD)

• créé au démarrage!

• traitement en parallèle possible / partitionnement sur les différents noeuds du cluster!

• opérations sur les RDDs = transformations + actions!

• contrôle sur la persistance : MEMORY, DISK…!

• resistance à la panne (principe de lineage avec le DAG)

Définition : collections distribuées fault-tolerant et ! immutable

Créer un RDD

// sc est le SparkContext !// à partir d’un fichier texte JavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements-sportifs-de-lile-de-france.csv"); !// à partir d’un fichier venant d’Hadoop sc.hadoopFile(path, inputFormatClass, keyClass, valueClass); !

Opérations sur les RDDsJavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line -> line.split(";")) // suppression de la 1ère ligne .filter(line -> !line[1].equals("ins_com")); !lines.count(); !// nombre par type d'équipement rangé par ordre alphabétique lines.mapToPair(line -> new Tuple2<>(line[3], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " -> " + t._2)); !

Persistance des RDDs

// lines est un RDD !// persistance par défaut MEMORY_ONLY lines.cache(); !// spécifié lines.persist(StorageLevel.DISK_ONLY()); lines.persist(StorageLevel.MEMORY_ONLY()); lines.persist(StorageLevel.MEMORY_AND_DISK()); !// avec réplication lines.persist(StorageLevel.apply(1, 3));

*Spark est fault-tolerant grâce au graphe d’exécution qui enregistre la suite des opérations effectuées sur un RDD

Côté performances

Ecosystème Spark

Streaming

Introduction spark

Data & Analytics

Transcript of Introduction spark

Running Apache Spark Applications - Cloudera · Apache Spark Introduction Introduction You can run Spark interactively or from a client program: • Submit interactive statements

Introduction to spark 2

Introduction to Spark Streaming

Introduction to Apache Spark 2.0

Introduction to Spark with Scala

Introduction to Spark - University of Arkansas · Introduction to Spark. Outlines • A brief history of Spark ... • Spark will not begin to execute until it sees an action •

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

An Introduction to Spark and to its Programming Model€¦ · 2 Introduction to Spark Introduction to Apache Spark 2 Fast, expressive cluster computing system compatible with Apache

Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark.

SPARK! Introduction Decksparkdallas.org/wp-content/uploads/2019/04/SPARK-Introduction-De… · •SPARK! is bursting at the seams • 11,000 sq. ft. space, 2 huge rooms • Added

Introduction to Spark

Introduction to Spark Internals

Apache Spark Introduction - CloudxLab

Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Introduction to Spark Training

Introduction to Apache Spark

Introduction to Spark - DataFactZ

Hadoop Spark Introduction-20150130

Spark Plugs Introduction - Denso · Spark Plugs Printed in Belgium DESP10-UK10 DENSO Spark Plugs Market leading technology and performance Introduction The complete DENSO family Our

Spark introduction RDD Building and running Spark applications · 2018-04-17 · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing