Migratory Workloads Across Clouds with Nomad

20
MIGRATORY WORKLOADS ACROSS CLOUDS WITH NOMAD Phil Watts DevOps Artificer @ REĀN Cloud

Transcript of Migratory Workloads Across Clouds with Nomad

MIGRATORY WORKLOADS ACROSS CLOUDS WITH NOMAD

Phil Watts DevOps Artificer @ REĀN Cloud

PROBLEM STATEMENT

“FLEXING BETWEEN THE CLOUDS”

▸ Goals of Virtualization seem universally applicable

▸ != Vendor Lock-in

▸ Not all workloads are valued equally

=>=>

IT Magic Anywhere

SUCCESS CRITERIA

WIN CONDITIONS‣ Availability of compute resources are independent of the

cloud provider

‣ Batch jobs can be allocated based on point in time cost metrics

‣ Work segregation based on compliance qualifications

TOOLCHAIN

MY “FAVORITE” TOYSResources

Image Creation

Infrastructure Provisioning

Service Discovery

Scheduler

Driver

DEFINITIONS: RESOURCE CONTEXT

THE BANE OF TECHNICAL UNDERSTANDING (AKA WORDS):

▸ Region: The isolation boundary of a Nomad Cluster

▸ Datacenter: Low latency, high bandwidth, private network

▸ Resources: The available capacity provided by a node

Region Datacenter

AWS Continental AWS_Region

GCE Continental GCE_Region

Azure Location Location

Region Datacenter

AWS Global AWS_Region

GCE Global GCE_Region

Azure Global Sets of Locations

Common / Comfortable Pattern Ideal Pattern

NOMAD ARCHITECTURE - SINGLE REGION VIEW

BDFL FOR WORKLOAD DECISIONS

‣ In Nomad, Data Centers can speak to Region Aware Servers

‣ DataCenters don’t need to be the same platform

‣ Default Region is “global”

ARCHITECTURE OF SOLUTION

▸ Nomad Clients potentially provide Resources for Jobs

▸ Communication between Data Centers may need secured

▸ Nodes run a Consul Agent and Nomad Client

▸ Servers“Bin Pack” task groups onto nodes

THREE PICTURES OF THE SAME THINGSingle Region / Multi DataCenter

(different Clouds)

DEFINITIONS: TASK CONTEXT

WORDS: THE SEQUEL

▸ Task: Desired state declaration of workload

▸ Constraints: Rules limiting where a job can run

▸ Evaluations: Queued request to compare desired and present state of work over the region

▸ Caused by a state change event

▸ Job Complete/Failure

▸ Node Add/Failure

▸ Job Scheduled

▸ Allocations: Mapping of tasks to resources within constraints

JOB TYPES: SERVICE

KEEPING THE SITE UP

▸ Long running jobs that should always be available

▸ Scheduling decisions favor QoS

▸ Example: Ensuring a front end web service is always available

JOB TYPES: BATCH

WHAT TO DO WITH ALL THIS DATA?

▸ A set of work spanning a few minutes to a few days

▸ Based on the Berkley Sparrow Two Choices model

▸ http://people.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

▸ Probes a set of nodes which meet constraints and sends work to the "least loaded" nodes

▸ Example: Tasks to manipulate a queue of data when present

JOB TYPES: SYSTEM

KEEPING THE LIGHTS ON

▸ A unique job type used to declare jobs which should run on every node which meets the job constraints

▸ Are re-evaluated whenever a node joins the cluster

▸ Example: distributing common tasks, which can benefit from rolling updates, job updates, service discovery, etc

NOMAD SCHEDULING INTERNALS

GETTING FROM WORK AND RESOURCES TO ACCOMPLISHMENTS

▸ Evaluations read the job spec and find constraints

▸ Evaluation Brokers maintain the pending queue, priority, and at least once delivery

▸ Schedulers submit an Allocation Plan, evaluated for feasibility, followed by priority

▸ Allocations set jobs against resources

LIKE TETRIS FOR WORKLOADS

▸ Tasks require resources

▸ Nodes have “dimensions” of resources

▸ Allocation fits Tasks inside Nodes

BIN PACKING

TASK GROUPS

PREVENTING TASK SEPARATION ANXIETY

▸ Task Groups allow for multiple Jobs to require they are scheduled on the same node

▸ A task group is created implicitly for single tasks in isolation

▸ Can be used to enforce compliance elements required to run together

▸ Example: Requiring log shipping co-processes

CONSTRAINTS

JUST BECAUSE YOU CAN, DOESN’T MEAN YOU SHOULD

▸ Job Constraints limit the resources available for a particular job group

▸ Constraints can map workloads directly to Customized Hardware such as AWS Placement Groups

CONSTRAINTS AND COMPLIANCE

SATISFYING COMPLIANCE REQUIREMENTS

▸ Constraints on datacenter can be used for Data Isolation inside National Boundaries.

▸ Healthcare workload that must say within the EU

▸ Metadata attributes can allow for custom declarations.

▸ Ex. PCI DSS Compliance:

▸ Maintain network firewall

▸ Protect run Anti-Malware/Anti-Virus

▸ Monitor and Log Access

▸ Regularly Test Security systems and procedures.

1 job "sample_service" { 2 ... 3 meta { 4 pci_dss = true 5 } 6 group "webservice" { 7 constraint { 8 attribute = "meta.pci_dss" 9 value = true 10 } 11 } 12 }

Constraint Snippet

CONSTRAINTS: SATISFYING SPECIAL NEEDS

DIFFERENT THINGS ARE DIFFERENT

▸ Not all platforms are created equal

▸ Platform attributes for specifying Cloud Platforms

▸ ${attr.platform} = aws May be relevant if yourtask needs a vpc restrictedlambda

1 job "sample_service" { 2 ... 3 constraint { 4 attribute = attr.platform 5 value = aws 6 } 7 }

RAW EXECS

CHEKHOV’S TASK DRIVER

▸ Unconstrained, Un-isolated, Disabled by Default

“IT SEEMS TO BE A DEEP INSTINCT IN HUMAN BEINGS FOR MAKING EVERYTHING COMPULSORY THAT ISN'T FORBIDDEN”

▸ Runs as the user Nomad is running as

▸ Disabled by default

client { options = { driver.raw_exec.enable = 1 } }

~Robert A. Heinlein

OPERATOR INTERACTION

RELIABLE MAGIC = OPERATIONS

1 $ nomad run jobfile.nomad -address=$nomad_server

‣ Operators schedule jobs against a server

‣ Nomad figures out how/where/when to run tasks

‣ Complex solution through iteration

THANK YOUPhil Watts

DevOps Artificer @ REĀN Cloud @pwattstbd

github.com/marsupermammal [email protected] www.reancloud.com