[OracleCode SF] In memory analytics with apache spark and hazelcast

@gamussa @hazelcast #oraclecode

IN-MEMORY ANALYTICS with APACHE SPARK and

HAZELCAST

Solutions ArchitectDeveloper Advocate

@gamussa in internetz

Please, follow me on TwitterI’m very interesting ©

Who am I?

What’s Apache Spark?Lightning-Fast Cluster Computing

Run programs up to 100xfaster than Hadoop

MapReduce in memory, or 10x faster on disk.

When to use Spark?

Data Science Taskswhen questions are unknown

Data Processing Taskswhen you have to much data

You’re tired of Hadoop

Spark Architecture

Resilient Distributed Datasets (RDD)

are the primary abstraction in Spark –

a fault-tolerant collection of elements that can be

operated on in parallel

RDD Operations

operations on RDDs: transformations and actions

transformations are lazy (not computed immediately)

the transformed RDD gets recomputed when an action is run on it (default)

RDD Transformations

RDD Actions

RDD Fault Tolerance

RDD Construction

parallelized collections

take an existing Scala collection and run functions on it in parallel

Hadoop datasets

run functions on each record of a file in Hadoop distributed file system or any other storage system supported by

Hadoop

What’s Hazelcast IMDG?The Fastest In-memory Data Grid

Hazelcast IMDGis an operational,

in-memory, distributed computing platform

that manages data using in-memory storage, and

performs parallel execution for breakthrough application speed

and scale

High-DensityCaching

In-Memory Data Grid

Web Session Clustering

MicroservicesInfrastructure

What’s Hazelcast IMDG?

In-memory Data Grid

Apache v2 Licensed

Distributed Caches (IMap, JCache)

Java Collections (IList, ISet, IQueue)

Messaging (Topic, RingBuffer)

Computation (ExecutorService, M-R)

GreenPrimary

GreenBackup

GreenShard

final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");

final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);

final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);

final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");

LIMITATIONS

DATA SHOULD NOT BE UPDATED WHILE READING

FROM SPARK

MAP EXPANSION SHUFFLES THE DATA INSIDE THE BUCKET

CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,

DUPLICATE OR MISSINGENTRIES COULD OCCUR

github.com/hazelcast/hazelcast-spark

THANKS!Any questions?You can find me at@gamussaviktor@hazelcast.com

[OracleCode SF] In memory analytics with apache spark and hazelcast

Technology

Transcript of [OracleCode SF] In memory analytics with apache spark and hazelcast

Apache Spark and Distributed Programming - CS-E4110 ... · Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA · NEW ARCHITECTURES FOR APACHE SPARK AND BIG DATA The Apache Spark Platform for Big Data The Apache Spark platform is an open-source

Using Apache Spark

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

State of Security: Apache Spark & Apache Zeppelin

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Apache Spark - Yandex

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Distributed Systems Using Hazelcast - Jfokus · PDF file• Working for Hazelcast! ... • Vert.x! • Apache Camel! • ... Creating Hazelcast Cluster @PeterVeentjer

Apache Spark RDDs

Apache Spark Briefing

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

Apache Spark 2.0

Apache Spark - LMU

Easy Distributed Systems Using Hazelcast - JEEConfjeeconf.com/wp-content/uploads/slides/hazelcast-2014.pdf · • Working for Hazelcast! ... • Vert.x! • Apache Camel! • ...

KNIME Extension for Apache Spark Installation Guide · Apache Livy (recommended) Spark Job Server (deprecated) Supported Spark and Hadoop distributions KNIME Extension for Apache

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Budapest Spark Meetup - Apache Spark @enbrite.ly