Building large scale, job processing systems with Scala Akka Actor framework

25
Building massive scale, fault tolerant, job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012

description

The Akka Actor framework is designed to be a fast message processing system. In this talk, we will explain how, at Box, we have used this framework to develop a large scale job processing system that works on billions of data files and achieves a high degree of throughput and fault tolerance. Over the course of the talk, we will explore the usage of Akka framework’s Supervisor functionality to provide a more controllable fault-tolerance strategy, and how we can use Futures to manage asynchronous jobs.

Transcript of Building large scale, job processing systems with Scala Akka Actor framework

Page 1: Building large scale, job processing systems with Scala Akka Actor framework

Building massive scale, fault tolerant,

job processing systems with Scala Akka

frameworkVignesh Sukumar

SVCC 2012

Page 2: Building large scale, job processing systems with Scala Akka Actor framework

About me

• Storage group, Backend Engineering at Box• Love enterprise software! • Interested in Big Data and building distributed

systems in the cloud

Page 3: Building large scale, job processing systems with Scala Akka Actor framework

About Box

• Leader in enterprise cloud collaboration and storage

• Cutting-edge work in backend, frontend, platform and engineering services

• A really fun place to work – we have a long slide!

Page 4: Building large scale, job processing systems with Scala Akka Actor framework

Talk outline• Job processing requirements• Traditional & new models for job processing

• Akka actors framework• Achieving and controlling high IO throughput• Fine-grained fault tolerance

Page 5: Building large scale, job processing systems with Scala Akka Actor framework

Typical architecture in a cloud storage environment

Page 6: Building large scale, job processing systems with Scala Akka Actor framework

Practical realities

•Storage nodes are usually of varying configurations (OS, processing power, storage capacity, etc) mainly because of rapid evolution in provisioning operations•Some nodes are more over-worked than the others (for ex, accepting live uploads)•Billions of files; petabytes

Page 7: Building large scale, job processing systems with Scala Akka Actor framework

Job processing requirements

• Iterate over all files (billions, petabyte scale): for ex, check consistency of all files

• High throughput

• Fault tolerant

• Secure

Page 8: Building large scale, job processing systems with Scala Akka Actor framework

Traditional job processing model

Microsoft Office User
Page 9: Building large scale, job processing systems with Scala Akka Actor framework

Why traditional models fail in cloud storage environments

• Not scalable: petabyte scale, billions of files• Insecure: cannot move files out of storage

nodes• No performance control: easy to overwhelm

any storage node• No fine grained fault tolerance

Page 10: Building large scale, job processing systems with Scala Akka Actor framework

Compute on Storage

• Move job computation directly to storage nodes

• Utilize abundant CPU on storage nodes• Metadata store still stays in a highly available

system like a RDBMS • Results from operations on a file are

completely independent

Page 11: Building large scale, job processing systems with Scala Akka Actor framework

Master – slave architecture

Page 12: Building large scale, job processing systems with Scala Akka Actor framework

Benefits

• High IO throughput: Direct access; no transfer of files over a network

• Secure: files do not leave storage nodes• Better performance control: compute can

easily monitor system load and back off• Better fault tolerance handling: finer grained

handling of errors

Page 13: Building large scale, job processing systems with Scala Akka Actor framework

Master node

• Responsible for accepting job submissions and splitting them to tasks for slave nodes

• Stateful: keeps durable copy of jobs and tasks in Zookeeper

• Horizontally scalable: service can be run on multiple nodes

Page 14: Building large scale, job processing systems with Scala Akka Actor framework

Agent

• Runs directly on the storage nodes on a machine-independent JVM container

• Stateless: no task state is maintained• Monitors system load with back-off• Reports results directly to master without

synchronizing with other agents

Page 15: Building large scale, job processing systems with Scala Akka Actor framework

Implementation with thethe Scala Akka Actor

framework

Page 16: Building large scale, job processing systems with Scala Akka Actor framework

Actors

• Concurrent threads abstraction with no shared state

• Exchange messages• Asynchronous, non-blocking• Multiple actors can map to a single OS thread• Parent-children hierarchical relationship

Page 17: Building large scale, job processing systems with Scala Akka Actor framework

Actors and messages• Class MyActor extends Actor { def receive = { case MsgType1 => // do something }}

// instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor))actorRef ! MsgType1

Page 18: Building large scale, job processing systems with Scala Akka Actor framework

Agent Actor System

Page 19: Building large scale, job processing systems with Scala Akka Actor framework

Achieving high IO throughput• Parallel, asynchronous IO through “Futures” val fileIOResult = Future { // issue high latency tasks like file IO } val networkIOResult = Future { // read from network }

Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)fileIOResult onSuccess { // do something } networkIOResult onFailure { // retry }

Page 20: Building large scale, job processing systems with Scala Akka Actor framework

Controlling system throughput

• The problem: agents need to throttle themselves as storage nodes serve live traffic

• Adjust number of parallel workers dynamically through a monitoring service

Page 21: Building large scale, job processing systems with Scala Akka Actor framework

Controlling throughput: Examples

•Parallelism parameters can be gotten from a separate configuration service on a per node basis•Some machines can be speeded up and others slowed down this way•The configuration can be updated on a cron schedule to speed up during weekends

Page 22: Building large scale, job processing systems with Scala Akka Actor framework

Fine grained fault tolerance with Supervisors

• Parents of child actors can define specific fault-handling strategies for each failure scenario in their children

• Components can fail gracefully without affecting the entire system

Page 23: Building large scale, job processing systems with Scala Akka Actor framework

Supervision strategy: Examples

Class TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on}

Page 24: Building large scale, job processing systems with Scala Akka Actor framework

Unit testing

• Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw

FileNotFoundException• Mocks for network and database calls val mockHttp = mock[HttpExecutor] TaskActorTest ! doHttpPost there was atLeastOne(mockHttp).POST

• Extensive testing of failure injection scenarios

Page 25: Building large scale, job processing systems with Scala Akka Actor framework

Takeaways• Keep your architecture simple by modeling

actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors)

• Design and implement for component failures• Write unit tests extensively: we did not have

any fundamental level functionality breakage• Box Engineering is awesome!