Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
-
Upload
spark-summit -
Category
Data & Analytics
-
view
234 -
download
2
Transcript of Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
![Page 1: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/1.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
February 9, 2017
Shubham ChopraSoftware Engineer
Spark and Online AnalyticsSpark Summit East 2017
![Page 2: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/2.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Agenda• DataandAnalyticsatBloomberg• TheroleofSpark• TheBloombergSparkServer• Sparkforonlineusecases
![Page 3: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/3.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Data and Analytics are our Business
![Page 4: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/4.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Analytics at Bloomberg• Human-time,interactiveanalytics• Scalability
• Handleincreasinglysophisticatedclientanalyticworkflows• Ad-hocandcross-domainaggregations,filtering
• Heterogeneousdatastores• Analyticsoftenrequiresdatafrommultiplestores
• Low-latencyupdates,inadditiontoqueries
![Page 5: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/5.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark for Bloomberg Analytics• Distributedcomputescaleswellfor:
• Largesecurityuniverses• Multi-universecross-domainqueries
• Abstractawayheterogeneousdatasourcesandpresentconsistentinterfaceforefficientdataaccess• Sparkasatoolforsystemsintegration
• Connectorsandprimitivestodealwithincomingstreams• Cacheintermediatecomputeforfastqueries
![Page 6: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/6.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark as a Service?• Stand-aloneSparkAppsonisolatedclustersposechallenges:
• Redundancyin:
• CraftingandmanagingRDDs/DFs
• Coding of thesameorsimilar types oftransforms/actions
• Managementofclusters,replicationofdata,etc.
• Analyticsareconfinedtospecificcontentsetsmakingcross-assetanalyticsmuchharder
• Needtohandlereal-timeingestionineachApp
![Page 7: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/7.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Bloomberg Spark Server• Asinglelong-runningSparkapplication
• AnalyticsdeployedasRequestProcessorsandservedviaaRESTAPI
• CanbedeployedonYARNorMESOSorstandalone
• IngesttimetransformstoloaddatainSparkfromabackingstore
• QuerytimetransformstorunanalyticsontheingesteddatainSpark
![Page 8: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/8.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Bloomberg Spark Server
![Page 9: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/9.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark Server: Content Caching• Dataaccesshaslongtailcharacteristics
• Highvaluedatasub-settedwithinSpark
• Specifiedasafilterpredicateattimeofregistration
• SeamlessunificationofdatainSparkandbackingstore
• Reliability?
![Page 10: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/10.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark HA: State of the World• ExecutionlineageinDriver
• RecoveryfromlostRDDs• RDDReplication
• Lowlatency,evenwithlostexecutors• Supportfor“MEMORY_ONLY”,“MEMORY_ONLY_2”,“MEMORY_ONLY_SER”,“MEMORY_ONLY_SER_2”modesforin-memorypersistence.Easilyextensibletomorereplicasifneeded.
• Speculativeexecution• Minimizingperformancehitfromstragglers
• Off-heapdata• MinimizingGCstalls
![Page 11: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/11.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark Architecture
![Page 12: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/12.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
RDD Block ReplicationExecutor-1 Executor-2Driver
ComputeRDD
Computationcomplete GetPeersforreplication
ListofPeers
ReplicateblocktoPeer
BlockstoredlocallyResultsofcomputation
![Page 13: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/13.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
RDD Block Replication: Challenges• LostRDDpartitionscostlytorecover
• Datareplenishedatquerytime
• RDDreplicatedtorandomexecutors• OnYARN,multipleexecutorscanbebroughtuponthesamenodeindifferentcontainers• Hencemultiplereplicaspossibleonthesamenode/rack,susceptibletonode/rackfailure• Lostblockreplicasnotrecoveredproactively
![Page 14: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/14.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Topology Aware Replication (SPARK-15352)• MakingPeerselectionforreplicationpluggable
• Drivergetstopologyinformationforexecutors• Executorsinformedaboutthistopologyinformation• Executorsuseprioritizationlogictoorderpeersforblockreplication• PluggableTopologyMapper andBlockReplicationPrioritizer• DefaultimplementationreplicatescurrentSparkbehavior
![Page 15: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/15.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Topology Aware Replication (SPARK-15352)• Customizableprioritizationstrategiestosuitdifferentdeployments• Varietyofreplicationobjectives– ReplicateToDifferentHost,
ReplicateBlockWithinRack,ReplicateBlockOutsideRack• Optimizertofindaminimumnumberofpeerstomeetthe
objectives• Replicatetothesepeerswithahigherpriority
• Proactivereplenishmentoflostreplicas• BlockManagerMasterEndpoint triggeredreplenishmentwhenan
executorfailureisdetected.
![Page 16: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/16.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark HA: Challenges• HighAvailabilityofSparkDriver• Highbootstrapcosttoreconstructingclusterandcachedstate• NaïveHAmodels(suchasmultipleactiveclusters)surfacequeryinconsistency
• HighAvailabilityandLowTailLatencycloselyrelated
![Page 17: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/17.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark HA – A Strawman• MultipleSparkServersinLeader-Standbyconfiguration
• EachSparkServerbackedbyadifferentSparkCluster
• EachSparkServerrefreshedwithup-to-datedata
• Queriestostandbysredirectedtoleader• Onlyleaderrespondstoqueries- Dataconsistency
• RDDPartitionlossintheleaderstillaconcern• Performancestillgatedbyslowestexecutorinleader
• ResourceusageamplifiedbythenumberofSparkServers
![Page 18: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/18.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark Driver State• SparkDriver isanarbitraryJavaapplication• Onlyasubsetofthestateisinterestingorexpensive toreconstruct• Foronline-use cases,onlyRDDs/DFscreatedduringingestionareofinterest• Expressing ingestionusingDFshasbetterdecouplingofdata/statethanRDDs
![Page 19: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/19.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Spark Driver State*• BlockManagerMasterEndpoint holdsBlock<->Executorassignment• CacheManagerholdsLogicalPlanandDataFrame references
• Usedtoshort-circuitquerieswithpre-cachedqueryplans,ifpossible• JobScheduler
• Keepsatrackofvariousstagesandtasksbeingscheduled• Executorinformation
• Hostnameandportsofliveexecutors
*Illustrative,notexhaustive
![Page 20: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/20.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Externalizing Driver StateBenefits:• Quickerrecoveries• Noneedtorestartexecutors• Stateaccessible frommultipleActive-Activedrivers
Solutions:• Off-heapstorageforRDDs• Residualbook-keepingdriverstateexternalizedtoZooKeeper
![Page 21: Spark and Online Analytics: Spark Summit East talky by Shubham Chopra](https://reader033.fdocuments.us/reader033/viewer/2022042722/58abca2e1a28ab68068b5835/html5/thumbnails/21.jpg)
©2017Bloomberg Finance L.P.All rights reserved.
Quorum of Drivers