Post on 12-Apr-2017
@gamussa @hazelcast #oraclecode
Solutions ArchitectDeveloper Advocate
@gamussa in internetz
Please, follow me on TwitterI’m very interesting ©
Who am I?
@gamussa @hazelcast #oraclecode
Run programs up to 100xfaster than Hadoop
MapReduce in memory, or 10x faster on disk.
@gamussa @hazelcast #oraclecode
When to use Spark?
Data Science Taskswhen questions are unknown
Data Processing Taskswhen you have to much data
You’re tired of Hadoop
@gamussa @hazelcast #oraclecode
Resilient Distributed Datasets (RDD)
are the primary abstraction in Spark –
a fault-tolerant collection of elements that can be
operated on in parallel
@gamussa @hazelcast #oraclecode
transformations are lazy (not computed immediately)
the transformed RDD gets recomputed when an action is run on it (default)
@gamussa @hazelcast #oraclecode
parallelized collections
take an existing Scala collection and run functions on it in parallel
@gamussa @hazelcast #oraclecode
Hadoop datasets
run functions on each record of a file in Hadoop distributed file system or any other storage system supported by
Hadoop
@gamussa @hazelcast #oraclecode
Hazelcast IMDGis an operational,
in-memory, distributed computing platform
that manages data using in-memory storage, and
performs parallel execution for breakthrough application speed
and scale
@gamussa @hazelcast #oraclecode
High-DensityCaching
In-Memory Data Grid
Web Session Clustering
MicroservicesInfrastructure
@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
In-memory Data Grid
Apache v2 Licensed
Distributed Caches (IMap, JCache)
Java Collections (IList, ISet, IQueue)
Messaging (Topic, RingBuffer)
Computation (ExecutorService, M-R)
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf().set("hazelcast.server.addresses", "localhost").set("hazelcast.server.groupName", "dev").set("hazelcast.server.groupPass", "dev-pass").set("hazelcast.spark.readBatchSize", "5000").set("hazelcast.spark.writeBatchSize", "5000").set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-cache");
@gamussa @hazelcast #oraclecode
CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSINGENTRIES COULD OCCUR