IntegratingData-ParallelAnalyticsintoStream-ProcessingUsinganIn-MemoryDataGrid
DR.WILLIAML.BAIN
SCALEOUT SOFTWARE,INC.
• Dr.WilliamBain,Founder&CEOofScaleOutSoftware:• Email:[email protected]
• Ph.D.inElectricalEngineering(RiceUniversity,1978)
• Careerfocusedonparallelcomputing– BellLabs,Intel,Microsoft
• 3priorstart-ups,lastacquiredbyMicrosoftandproductnowshipsasNetworkLoadBalancinginWindowsServer
• ScaleOutSoftwaredevelopsandmarketsIn-MemoryDataGrids,softwarefor:• Scalingapplicationperformancewithin-memorydatastorage
• Analyzinglivedatainrealtimewithin-memorycomputing
• Thirteen+yearsinthemarket;450+customers,12,000+servers
AbouttheSpeaker
2In-MemoryComputingSummitEurope2018
Agenda
HowIn-MemoryComputingCreatestheNextGenerationinStream-Processing• Goalsandchallengesforstream-processing• Addingcontext:statefulstream-processing• Overviewofin-memorydatagrids(IMDGs)• Digitaltwinmodelforstream-processing• WhyuseanIMDG:integratedeventprocessinganddata-parallelanalysis• Exampleusecases• Detailedcodesample:runnerswithsmartwatches• Performancebenefits
©ScaleOutSoftware,Inc.CompanyConfidential
3In-MemoryComputingSummit2017
GoalsforStream-Processing
• Goals:• Processincomingdatastreamsfrommany(1000s)ofsources.• Analyzeeventsforpatternsofinterest.• Providetimely(real-time)feedbackandalerts.• Providedata-parallelanalyticsforaggregatestatisticsandfeedback.
• Manyapplications:• InternetofThings(IoT)• Medicalmonitoring• Logistics• Financialtradingsystems• Ecommercerecommendations
4
EventSources
In-MemoryComputingSummitEurope2018
Example:EcommerceRecommendations
1000sofonlineshoppers:• Eachshoppergeneratesaclickstreamofproductssearched.• Stream-processingsystemmust:
• Correlateclicksforeachshopper.• Maintainahistoryofclicksduringashoppingsession.
• Analyzeclickstocreatenewrecommendationswithin100msec.
• Analysismust:• Takeintoaccounttheshopper’spreferencesanddemographics.
• Useaggregatefeedbackoncollaborativeshoppingbehavior.
5In-MemoryComputingSummitEurope2018
ProvidingRecommendationsinRealTime
• Requiresscalablestream-processingtoanalyzeeachclickandrespondin<100ms:• Acceptinputwitheacheventonshopper’spreferences.• Provideaggregatefeedbackonbest-sellingproducts.
6In-MemoryComputingSummitEurope2018
ProvidingAggregateMetrics
• Mustaggregatestatisticsforallshoppers:• Trackreal-timeshoppingbehavior.• Chartkeypurchasingtrends.• Enablemerchandizertocreatepromotionsdynamically.
• Aggregatestatisticscanbesharedwithshoppers:• Allowsshopperstoobtaincollaborativefeedback.
• Examplesincludemostviewedandbestsellingproducts.
7In-MemoryComputingSummitEurope2018
ChallengesforStream-ProcessingArchitectures• Basicstream-processingarchitecture:
• Challenges:• Howefficientlycorrelateeventsfromeachdatasource?• Howcombineeventswithrelevantstateinformationtocreatethenecessarycontextforanalysis?• Howembedapplication-specificanalysisalgorithmsinthepipeline?• Howgeneratefeedback/alertswithlowlatency?• Howperformdata-parallelanalyticstodetermineaggregatetrends?
8In-MemoryComputingSummitEurope2018
AddingContexttoStream-Processing
• Statefulstream-processingplatformsadd“unmanaged”datastoragetothepipeline:• Pipelinestagesperformtransformationsinasequenceofstagesfromdatasourcestosinks.• Datastorage(distributedcache,database)isaccessedfromthepipelinebyapplicationcodeinanunspecifiedmanner.
• Examples:Apama (CEP),ApacheFlink,Storm
• Problems:• Thereisnosoftwarearchitectureformanagingstateinformation.
• Thisaddscomplexitytotheapplication.
• Createsanetworkbottleneck.• Doesnotaddressneedfordata-parallelanalytics.
9In-MemoryComputingSummitEurope2018
LambdaArchitecture:BatchParallelAnalytics
• Lambdaarchitectureseparatesstream-processing(“speedlayer”)fromdata-parallelanalytics(“batchlayer”).• Createsqueryable state,but:
• Doesnotenhancecontextforstatefulstreamprocessing.
• Doesnotperformdata-parallelanalyticsonlineforimmediatefeedback.
• Doesnotleadtoa“HybridTransactionalandAnalyticsProcessing”(HTAP)architecture.
10
https://commons.wikimedia.org/w/index.php?curid=34963987
Howcombinestream-processingwithstatetosimplifydesign,maximizeperformance,andenablefastdata-parallelanalytics?
In-MemoryComputingSummitEurope2018
In-MemoryDataGrid(IMDG)
IMDGprovidesapowerfulplatformforstatefulstream-processing.WhatisanIMDG?• IMDGstoreslive,object-orienteddata:
• Usesakey/valuestoragemodelforlargeobjectcollections.
• Mapsobjectstoaclusterofcommodityserverswithlocationtransparency.
• Haspredictablyfast(<1msec.)dataaccessandupdates.• Designedfortransparent scalingandhighavailability
• IMDGintegratesin-memorycomputingwithdatastorage:• Usesobject-orientedexecutionmodel.• Leveragesthecluster’scomputingpower.• Computeswherethedatalivestoavoidnetworkbottlenecks.
Logicalview
Physicalview
11
IMDGStorageModel
In-MemoryComputingSummitEurope2018
HowanIMDGCanIntegrateComputation• Eachgridhostrunsaworkerprocesswhichexecutesapplication-definedmethodsonstoredobjects.• Thesetofworkerprocessesiscalledaninvocationgrid(IG).
• IGusuallyrunslanguage-specificruntimes(JVM,.NET).
• IMDGcanshipcodetotheIGworkers.
• KeyadvantagesforIGs:• Followsobject-orientedmodel.• Avoidsnetworkbottlenecksbymovingcomputingtothedata.
• LeveragesIMDG’scores&servers.
InvocationGrid
12
In-MemoryDataGrid
In-MemoryComputingSummitEurope2018
IMDGRunsEventHandlersforStream-Processing
Eventhandlersrunindependentlyforeachincomingevent:• IMDGdirectseventtoaspecificobjectusingReactiveX forlowlatency.• IMDGexecutesmultipleeventhandlersinparallelforhighthroughput.
13
Object
In-MemoryComputingSummitEurope2018
IMDGExecutesData-ParallelComputationsMethodexecutionimplementsabatchjobonanobjectcollection:• Clientrunsasinglemethodonallobjectsinacollection.• Executionrunsinparallelacrossthegrid.• Resultsaremergedandreturnedtotheclient.
14
Object
In-MemoryComputingSummitEurope2018
ABasicData-ParallelExecutionModel
Afundamentalmodelfromparallelsupercomputing:• Runonemethod(“eval”)inparallelacrossmanydataobjects.• Optionallymerge theresults.• Binarycombiningisaspecialcase,but…
• ItrunsinlogN timetoenablescalablespeedup
15In-MemoryComputingSummitEurope2018
MapReduceBuildsonThisModel
• Implements“group-by”computations.• Example:“DetermineaverageRPMforallwindmillsbyregion(NE,NW,SE,SW).”• Runsintwodata-parallelphases(map,reduce):• Map phaserepartitionsandoptionallycombinessourcedata.
• Reduce phaseanalyzeseachdatapartitioninparallel.
• Returnsresultsforeachpartition.
partitions
16In-MemoryComputingSummitEurope2018
DistributedForEach:AnotherData-ParallelModel
17In-MemoryComputingSummitEurope2018
ReducedGCTimewithDistributedForEach
PMI DistributedForEach
18In-MemoryComputingSummitEurope2018
Stream-ProcessingwiththeDigitalTwinModel
• CreatedbyMichaelGrieves;popularizedbyGartner• RepresentseachdatasourcewithanIMDGobjectthatholds:• Aneventcollection• Stateinformationaboutthedatasource• Logicforanalyzingevents,updatingstate,andgeneratingalerts
• Benefits:• Offersastructuredapproachtostatefulstream-processing.• Automaticallycorrelatesincomingeventsbydatasource.• Integratesallrelevantcontext(events&state).• Enableseasydeploymentofapplication-specificlogic(e.g.,ML,rulesengine,etc.)foranalysisandalerting.
• Providesdomainforaggregateanalysisandfeedback.
19In-MemoryComputingSummitEurope2018
Data-parallelanalysis
SomeApplicationsforDigitalTwins
Adigitaltwincorrelatesincomingeventswithcontextusingdomain-specificalgorithmstogeneratealerts:
20
Application Context Events Logic Alerts
IoT devices Devicestatus&history Devicetelemetry Analyzetopredictmaintenance.
Maintenancerequests
Medicalmonitoring
Patienthistory&medications
Heart-rate,blood-pressure,etc.
Evaluatemeasurementsovertimewindowswithrulesengine.
Alertstopatient&physician
CableTV Viewerpreferences&history,set-topboxstatus
Channelchangeevents,telemetry
Cleanse&mapchanneleventsforreco.engine;predictboxfailure.
Viewerrecom-mendations,repairalerts
Ecommerce Shopperpreferences&buyinghistory
Clickstreameventsfromwebsite
UseMLtomakeproductrecommendations.
Productlistforwebsite
Frauddetection
Customerstatus&history
Transactions Analyzepatternstoidentifyprobablefraud.
Alertstocustomer&bank
In-MemoryComputingSummitEurope2018
WhyUseanIMDGtoHostDigitalTwins?
IMDGprovidesanexcellentDTplaftorm:• Scalable,object-orienteddatastorage:
• Offersanaturalmodelforhostingdigitaltwins.• Cleanlyseparatesdomainlogicfromdata-parallelorchestration.
• Integrated,In-memorycomputing:• Automaticallycorrelatesincomingeventsforanalysis.
• Enablesbothstreamanddata-parallelprocessing.
• Highperformance:• Avoidsdatamotionandassociatednetworkbottlenecks.
• Fastandscalestohandlelargeworkloads.• Integratedhighavailability:
• Usesdatareplicationdesignedforlivesystems.• Canensurethatcomputationishighav.
21In-MemoryComputingSummitEurope2018
ScalingEventIngestionwithKafka
22
• IMDGpartitionsdigitaltwinobjectsacrossservers.• Kafkaofferspartitionstoscaleouthandlingofeventmessages.• Partitionsaredistributedacrossbrokers.• Brokersprocessmessagesinparallel.
• IMDGcanmapKafkapartitionstogridpartitions:• IMDGspecifiesevent-mappingalgorithmtoKafka.
• IMDGlistenstoappropriateKafkapartitions.
• Thisminimizeseventhandlinglatency.• Avoidsstore-and-forwardwithinIMDG.
In-MemoryComputingSummitSiliconValley2017
IntegratingEventandData-ParallelProcessingTheIMDG:• Postsincomingeventstoitsrespectivedigitaltwinobject.• Runsthetwin’seventhandlermethodwithlowlatency.• Eventhandlermanagestheeventcollectionandcanusetimewindowsforanalysis.
• Eventhandlerusesandupdatesin-memorystate.• Eventhandlercanuse/updateoff-linestate.• Eventhandleroptionallygeneratesalertsandfeedbacktoitsdigitaltwin.
• Runsdata-parallelmethodstoanalyzealldigitaltwinsinreal-time.• Resultscanbeusedforbothalertingandfeedback.
23
OfflineState Data-ParallelAnalysis
In-MemoryComputingSummitEurope2018
Example:EcommerceShoppingSite
Trackswebshoppersandprovidesreal-timerecommendations:• EachDTobjectholdsclickstreamofbrowsedproducts,preferences,anddemographics.• Eventhandleranalyzesthisdataandupdatesrecommendations.• Periodicdata-parallel,batchanalyticsacrossallshoppersdetermineaggregatetrends:• Examplesincludebestsellingproducts,averagebasketsize,etc.
• Usedforanalysisandreal-timefeedback
24In-MemoryComputingSummitEurope2018
Example:TrackingaFleetofVehicles
• Goal:Tracktelemetryfromafleetofcarsortrucks.• Eventsindicatespeed,position,andotherparameters.
• Digitaltwinobjectstoresinformationaboutvehicle,driver,anddestination.
• Eventhandleralertsonexceptionalconditions(speeding,lostvehicle).
• Periodicdata-parallelanalyticsdeterminesaggregatefleetperformance:• Computesoverallfuelefficiency,driverperformance,vehicleavailability,etc.
• Canprovidefeedbacktodriverstooptimizeoperations.
25In-MemoryComputingSummitEurope2018
UsingDigitalTwinsinaHierarchy
Trackscomplexsystemsashierarchyofdigitaltwinobjects:• Leafnodesreceivetelemetryfromphysicalendpoints.• Higherlevelnodesrepresentsubsystems:• Receivetelemetryfromlower-levelnodes.
• Supplytelemetrytohigher-levelnodesasalerts.
• Allowsuccessiverefinementofreal-timetelemetryintohigher-levelabstractions.
26
Example:HierarchyofDigitalTwinsforaWindmill
In-MemoryComputingSummitEurope2018
DetailedExample:Heart-RateWatchMonitoring
Goal:Trackheart-rateforalargepopulationofrunners.• Heart-rateeventsflowfromsmartwatchestotheirrespectivedigitaltwinobjectsforanalysis.• Theanalysisuseswearer’shistory,activity,andaggregatestatisticstodeterminefeedbackandalerts.
27In-MemoryComputingSummitEurope2018
DigitalTwinObject(Java)• Holdseventcollectionanduser’scontext(age,medicalhistory,currentstatus,etc.):
28
public class User implements Serializable {private int _id;private double _height;private double _bodyWeight;private Gender _gender;private int _age;private int _averageHr;private WorkoutProgress _status;private int _sessionAverageMax;private List<Medication> _medications;private List<Long> _heartIncidents;private List<HeartRate> _runningHeartRateTelemetry;private long _alertTime;private boolean _alerted;...}
In-MemoryComputingSummitEurope2018
Eventcollection
User’scontext
Events&Alerts
• EventholdsperiodictelemetrysentfromwatchtoIMDG:
• Alertholdsdatatobesentbacktowearerand/ortomedicalpersonnel:
29
public class HeartRateEvent {private int _userId;private int _heartRate;private long _timestamp;private WorkoutType _workoutType;private WorkoutProgress _workoutProgress;private Event _event;...}
public class HeartRateAlert {private int _userId; private String _alertType;private String _params;...}
In-MemoryComputingSummitEurope2018
SettingUpaReactiveX PipelineontheIMDG
• DefineaReactiveX observerthatrunsoneveryserverintheIMDG:
• CreateaninvocationgridthatInitializestheReactiveX observeratstartup:
30
public class HeartRateObserver implements Observer<Event>, Serializable {@Override public void onNext(Event event) {
HeartRateEvent hre = HeartRateEvent.fromBytes(event.getPayload());hre.setEvent(event);User.processRunningEvent(hre);} ...}
In-MemoryComputingSummitEurope2018
Pipeline pipeline = new Pipeline(“userCache”, “userGrid”);GridAction action = pipeline.createRemoteObserverAction("userObserver",
new HeartRateObserver());InvocationGrid grid = new InvocationGridBuilder(“userGrid”)
.addJar("./bin/appcode.jar")
.addStartupAction(action)
.load();
Callapplication
Initializeobserver
EventHandlerandEventPosting
• PostinganeventtotheReactiveX observer:• Thekeydetermineswhichserverreceivestheeventforposting.
• HandlinganeventpostedtotheReactiveX observeronDTtwin’sserver:
31
private static void processRunningEvent(HeartRateEvent hre) {CachedObjectId id = hre.getId();User u = (User)cache.retrieve(id, false);...executeRunningWorkoutAnalytics(hre, u);...cache.update(id, u);}
In-MemoryComputingSummitEurope2018
RetrieveDTobject
Analyzeevent
UpdateDTobject
pipeline.postEvent(makeKey(UserId),"heartRateEvent", HeartRateEvent.toBytes(new HeartRateEvent(last, System.nanoTime(),WorkoutType.Running, WorkoutProgress)));
EventAnalysis• Handlesaneventforanactiveuserdoingarunningworkout:
32
private static void executeRunningWorkoutAnalytics(HeartRateEvent hre, User u) {long start = twoWeeksAgo();long sessionTimeout = threeHours();SessionWindowCollection<HeartRate> swc = new
SessionWindowCollection<>(u.getRunningHeartRateTelemetry(),heartRate -> heartRate.getTimestamp(), start, sessionTimeout);
swc.add(new HeartRate(hre.getHeartRate(), hre.getTimestamp()));
int total = 0; int windowCount = 0;for(TimeWindow<HeartRate> window : swc) {
int avg = 0;for(HeartRate hr : window) {avg += hr.getHeartRate();}total += (avg/window.size());windowCount++;}
u.setAverageHr(total/windowCount);u.analyzeAndCheckForAlert(hre);}
In-MemoryComputingSummitEurope2018
Analyzeeventhistory
Analyzeuser’scontext
Createtimewindows
Addevent
AnalysisTechniquesEnabledbyDigitalTwin
Enabledetailedheart-ratemonitoringforahighintensityexerciseprogram:• Exampleofdatatobetracked:
• Exercisespecifics:typeofexercise,exercise-specificparameters(distance,strides,altitudechange,etc.)
• Participantbackground/history:age,height,weighthistory,heart-relatedmedicalconditionsandmedications,injuries,previousmedicalevents
• Exercisetracking:sessionhistory,average#sessionsperweek,averageandpeakheartrates,frequencyofexercisetypes
• Aggregate statistics:average/max/minexercisetrackingstatisticsforallparticipants
• Exampleoflogictobeperformed:• Notifyparticipantifsessionhistoryacrosstimewindowsindicatesneedtochangemix.• Notifyparticipantifheartratetrendsdeviatesignificantlyfromaggregatestatistics.• Alertparticipant/medicalpersonnelifheartrateanalysisacrosstimewindowsindicatesanimminentthreattohealth.
• Report aggregatestatisticstoanalystsand/orusers.
33In-MemoryComputingSummitEurope2018
DataParallelAnalysisAcrossallDigitalTwins
• UsesIMDG’sin-memorycomputeenginetocreateaggregatestatisticsinrealtime.
• Resultscanbereportedtoanalystsandupdatedeveryfewseconds.
• Resultscanbeusedasfeedbacktoeventanalysisindigitaltwinobjectsand/orreportedtousers.
34In-MemoryComputingSummitEurope2018
ComputingAggregateData• Performsadata-parallelcomputationusingtheIMDG’sEval andMergemethods:
35
public class AggregateStatsInvokable implements Invokable<User, Integer,AggregateStats> {@Overridepublic AggregateStats eval(User u, Integer numUsers) {
AggregateStats userStats = new AggregateStats(numUsers);userStats.merge(u);return userStats ;
}
@Overridepublic AggregateStats merge(AggregateStats mergedStats,
AggregateStats u) {mergedStats.merge(u);return mergedStats;
}}
In-MemoryComputingSummitEurope2018
Evalmethod
Binarymergemethod
ComputingAggregateData(2)• Computesrunningaverageofheart-ratebycategories:
36
public void merge(AggregateStats user) {numEvents += user.getNumEvents();totalHeartRate18to34 += user.getTotalHeartRate18to34();totalHeartRate35to50 += user.getTotalHeartRate35to50();totalHeartRateOver50 += user.getTotalHeartRateOver50();count18to34 += user.getCount18to34();count35to50 += user.getCount35to50();countOver50 += user.getCountOver50();
totalHeartRateBmiUnderWeight += user.getTotalHeartRateBmiUnderWeight();totalHeartRateBmiNormalWeight += user.getTotalHeartRateBmiNormalWeight();totalHeartRateBmiOverweight += user.getTotalHeartRateBmiOverweight();countUnderweight += user.getCountUnderweight();countNormalWeight += user.getCountNormalWeight();countOverWeight += user.getCountOverWeight();
}
In-MemoryComputingSummitEurope2018
CreatesGroups
RunningtheData-ParallelComputation
• Usesasinglemethodtorunadata-parallelcomputationandreturnresults.• PublishesmergedresultstoanIMDGobjectforaccessbyuserobjectsand/oranalysts.
37
public void run() { NamedCache usersCache = CacheFactory.getCache(“userCache”);NamedCache statsCache = CacheFactory.getCache(“statsCache”);AggregateStats stats;
InvokeResult<AggregateStats> result =usersCache.invoke(AggregateStatsInvokable.class, null, _numUsers,
TimeSpan.fromMilliseconds(10000));
stats = result.getResult();statsCache.put(“globalStats”, stats);
}
In-MemoryComputingSummitEurope2018
Invokedata-parallelop
StoreresultinIMDG
Data-ParallelExecutionSteps
• Eval phase:eachserverquerieslocalobjectsandrunseval andmergemethods:• Accessinglocalobjectsavoidsdatamotion.• Completeswithoneresultobjectperserver.
• Mergephase:allserversperformbinary,distributedmergetocreatefinalresult:• Mergerunsinparalleltominimizecompletiontime.
• Returnsfinalresultobjecttoclient.
38In-MemoryComputingSummitEurope2018
Predictable,ScalablePerformance
• DigitaltwinmodelenablestheIMDGtoscalebothevent-handlingandintegrateddata-parallelanalysis.• Correlatingeventstodigitaltwinobjectscreatesanautomaticbasisforperformancescaling:
• Foreventanalysis• Fordata-parallelanalysis
• Itenablesaccesstoeachevent’scontextwithoutrequiringanetworkaccess.• Italsoco-locatesandencapsulatesapplication-specificcodeusingo-otechniques.
39In-MemoryComputingSummitEurope2018
AvoidsNetworkBottlenecks
• DigitaltwinmodelavoidsnetworkbottlenecksassociatedwithusinganIMDGasanetworkedcacheinastream-processingpipeline.• Externaldatastoragerequiresnetworkaccesstoobtainanevent’scontext.• Networkbottleneckpreventsscalablethroughput.
40In-MemoryComputingSummitEurope2018
Wrap-Up
DigitalTwins:TheNextGenerationinStatefulStream-Processing• Challenge:Currenttechniquesforstatefulstream-processing:
• Lackacoherentsoftwarearchitectureformanagingcontext.• Cansufferfromperformanceissuesduetonetworkbottlenecks.
• Thedigitaltwinmodel:• Offersaflexible,powerful,scalablearchitectureforstatefulstream-processing:
• Associateseventswithcontextabouttheirphysicalsourcesfordeeperintrospection.• Enablesflexible,object-orientedencapsulationofanalysisalgorithms.
• Providesabasisforaggregateanalysisandfeedback.
• Scalable,data-parallelcomputingwithanIMDG:• Automaticallycorrelatesincomingeventsandprocessestheminparallel.• Implementsintegrated(real-time),aggregateanalysisforimmediatefeedback.
41In-MemoryComputingSummitEurope2018
www.scaleoutsoftware.com
In-MemoryComput ing for Operat ional Inte l l igence
Top Related