Spark Summit - Stratio Streaming
-
Upload
stratio -
Category
Engineering
-
view
991 -
download
8
description
Transcript of Spark Summit - Stratio Streaming
Stratio is the only Big Data platform able to combine, in one query, stored data withstreaming data in real-time (in less than 30 seconds).
We are polyglots as well: Weuse Spark over two noSQLdatabases, Cassandra & Mongo DB.
•
•
•
•
•
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data, and in fact is represented as a sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
Shark(SQL)
SparkStreaming
Mllib(machine learning)
GraphX(graph)
• map(func), flatMap(func), filter(func), count()
• repartition(numPartitions)
• union(otherStream)
• reduce(func),countByValue(), reduceByKey(func, [numTasks])
• join(otherStream, [numTasks]), cogroup(otherStream, [numTasks])
• transform(func)
• updateStateByKey(func)
• window(windowLength, slideInterval)
• countByWindow(windowLength, slideInterval)
• reduceByWindow(func, windowLength, slideInterval)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• countByValueAndWindow(windowLength, slideInterval, [numTasks])
• print()
• foreachRDD(func)
• saveAsObjectFiles(prefix, [suffix])
• saveAsTextFiles(prefix, [suffix])
• saveAsHadoopFiles(prefix, [suffix])
•
Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances
CEP as a technique helps discover complex events by analyzing and correlating other events
•
A CEP engine should provide operators over streams, keeping in mind that events and streams in a CEP are first-class citizens. In CEP, we think in terms of event streams: event stream is a sequence of events that arrives over time.
Users provide queries to the CEP engine whose main mission is matching those queries against events coming through event streams.
A CEP engine thus has notion of time and it allows working with temporal queries that reason in terms of temporal concepts, such as “time windows” or “before and after” event relationships… among others
• Filter
• Join
• Aggregation (Avg, Sum , Min, Max, Custom)
• Group by
• Having
• Conditions and Expressions (and, or, not, true/false, ==,!=, >=, >, <=, <),
• Data types (boolean, string, int, long, float, double)
• Pattern processing
• Sequence processing (zero to many, one to many, and zero to one)
• You still have to integrate it in your code
• There is nothing like an interactive console
• If you want to do something with the streams, you guessed it, you have to code it!
• There is no way to remotely listen to a stream
• There are no solution patterns ready-to-use with the engine
• No statistics, no auditing
• Hard to integrate with other tools (dashboarding, log stream, batch processing)
With this solution you can use our API in order to request commands to StratioStreaming engine in your code.
And you can also work with the interactive shell in order to test your queries or interact with the engine on demand.
Both tools, in fact, hide that you are sending messages to a complex engine, built with Zookeeper, Kafka, Spark Streaming and Siddhi CEP Engine.
kafka
zookeeper
requests
events
CASSANDRA
Kafka
CASSANDRA
CASSANDRA
Kafka
CASSANDRA
•
•
•
•
•
•
•
• create --stream testStream --definition
• "name.string,data.double“
• insert --stream testStream --values
• "name.Temperature, field.testValue,data.33“
• save cassandra start --stream testStream
• alter --stream testStream --definition
"field.string"
CREATE --stream testStream –definition(name.string, data.double,data2.int, data3.float, data4.double, trueorfalse.boolean)
Filtering
Projection
In-built functions
Windows (time and length)
Join
Event Sequences
There are a lot of CEP operators that you can use in your queries:
Event Patterns
Output rate limiting
Custom windows, customfunctions
from sensor_grid #window.length(10) select name, ind, avg(data) as data group by name insert into sensor_grid_avg for current-events
1. >, <, ==, >=, <=, !=
2. contains, instanceof
3. and, or, not
1. sum, avg, max, min, count: when aggregated (group by, having)
2. Field Type Conversion
3. Coalesce: if field null then takeanother field
4. IsMatch: true or false if match regex
from orders[price >= 20 and price < 100]…
from orders select * insert into ordersB…from orders select client, price insert into ordersB…
1. Length window - a sliding window that keeps the last N events.
2. Time window - a sliding window that keeps events that have arrived within the last T time period.
3. Time and Length batch window : same concept but outputs events only at the end of the given window
4. Unique window - keeps only the latest events that are unique according to the given unique attribute.
5. First unique window - keeps the first events that are unique according to the given unique attribute.
6. External Time Window - a sliding window that processes according to timestamps defined externally
from payments[channel == ‘Paypal']#window.time( 1 min )
• With “on <condition>” joins only the events that matches the condition
• With “within <time>”, joins only the events that are within said time of each other
from errorStream#window.length(1) as errorStream joinallStream#window.length(1) as allStreamon errorStream.numberOfErrors > allStream.totalNumberOfEvents*0.05 select * insert into alarmByThreshold;
from every (a1 = infoStock[action == "buy"]-> a2 = confirmOrder[command == "OK"] )-> b1 = StockExchangeStream [price > infoStock.price]
within 3000select a1.action as action, b1.price as priceinsert into StockQuote
from every a1 = infoStock[action == "buy"]+,b1 = StockExchangeStream[price > 70]?,
b2 = StockExchangeStream[price >= 75]select a1[0].action as action, b1.price as priceA, b2.price as priceBJoininsert into StockQuote
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Stratio Ingestion is an ETL for Big Data product, based on Flume.
Design your workflows (wysiwyg) with useful and improved sources and sinks, transform your data on the fly
• Create the stream if it doesn’t exist
• It is possible to send filtered event-flows only to streaming engine
• Built on the StratioStreaming API.
• Call-center Real-time monitoring
Real-time detection of client churn riskNatural Language Processing Analysis to detect incidents in real-timeAnomaly detection in the service based on patterns
• IT services monitoring
DoS attack detection, hotlinking, etc in real-timeWarnings in monitoring of heterogeneous servicesPreventive detection of downtime based on patterns
• Sensor grid monitoring
Alarms when thresholds are reachedComplex alarms involving several sensorsReal-time monitoring (landing support devices in an airport, for example)
Data Machine Intelligence
SELECT sum(order.quantity), company_data.countryFROM streaming.order WITH WINDOW 15 minutes INNER JOIN batch.company_dataON order.company = company_data.company_name;
.
• With an powerful query planner
• Able to perform mixed queries with streaming and batch data
SQL query example, mixing real-time data (coming from Stratio Streaming Engine) and batch data (stored in a noSQL database)
We are first going to use
the Shell to create
streams and queries.