Re:platforming // the/Datacenter// with/Apache/Mesos/ · CPU power 124%( 107%( 116%( 109%( 115%(...

Post on 22-May-2020

7 views 0 download

Transcript of Re:platforming // the/Datacenter// with/Apache/Mesos/ · CPU power 124%( 107%( 116%( 109%( 115%(...

Christos(Kozyrakis(

Re:platforming//the/Datacenter//with/Apache/Mesos/

Why$your$ASF$project$should$run$on$Mesos$

O(10K)/commodity/servers/

HighAspeed/networking/

Distributed/storage/(HDD,/Flash)/

x10/MWatt/

x100/M$/

(

ops$developers$

automation(

performance((

automation(

efficiency((

①   Datacenter/past/

Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

ops$developers$

automation(

performance((

automation(

efficiency((

!

" "

"

Static/Partitioning/

②   Datacenter/present/

Apache/Mesos/

The(datacenter(OS(kernel(

Aggregates(all(resources(into(a(single(shared(pool(

Dynamically(allocates(resources(to(distributed(apps(

Container(management(at(scale((cgroups,(docker,(…)(

Mesos/Architecture/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Scales(to(10s(of(thousands(of(servers(

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(master(((

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(framework(

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(slave(process(

Scheduling/in/Mesos/

No(single(scheduler(fits(all(needs(

LongLrunning(services(need(

(scale(up/down,(fault(tolerance(

(

Analytics(services(need(

(fast(task(launching(

2Alevel/Scheduling/

Mesos(master((single(API)(Resource(allocation((offers)((

Task(health(checks(

Task(isolation(

Mesos(frameworks((domainLspecific(APIs)(Scale(up/down(

Fault(tolerance(

Task(grouping(

Task(dependencies(

Queuing(&(priorities((

2Alevel/Scheduling/Benefits/

Multiple(APIs(to(Mesos(through(frameworks(Marathon,(Aurora,(Singularity(

Storm,(Spark,(Hadoop,(Chronos(

Cassandra,(Elasticsearch(

Multiple(task(scheduling(approaches(Spark(fineLgrain(Vs(Spark(coarseLgrain((

Simple,(stable,(and(scalable(Mesos(master(

Dynamic/Resource/Allocation//

Resource(allocation(based(on(framework(roles((

Dominant(resource(fairness((DRF)(Weighted(fair(share(calculated(based(on(dominant(resource(

Frameworks(do(no(worse(than(having(a(weightLsized(cluster((

Resource(reservations(Resources(can(be(allocated(to(specific(frameworks(if(needed((

Dynamic/Resource/Allocation//

Rails/

Hadoop/

memcached/buy$less$machines$

or$run$more$applications!$

Service/Discovery/

Mesos(Master(

Slave( Slave( Slave( Slave( Slave(…

Mesos(DNS(

①  Watch(ZK(for(((master(changes(

②  Pull(task(state((Generate(DNS(records(

③  DNS(&(HTTP((based(discovery(

(nginx.marathon.mesos(#(10.13.17.95(

_nginx._tcp.marathon.mesos(#10.13.17.95:8181((

③   Datacenter/future/

Utilization/Reality/

Twitter (Mesos) Google (Borg)

[Barroso’09] [Delimitrou’14]

The/Curse/of/Overprovisioning/

[Delimitrou’14]

Bloated(reservations(to(deal(with(diurnal(load(patterns,(load(spikes,(software(&(platform(changes(

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① offer<s1,(4cores,(…>(

② offer<s1,(4cores,(…>(

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

③ launch<tasks,(s1,(4cores,(…>(

④ launch<tasks,(4cores,(…>(

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① offer<s1,(BE,2cores,(…>(

② offer<s1,(BE,(2cores,(…>(

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

③ launch<tasks,(BE,(s1,(2cores,(…>(

④ launch<tasks,(BE,(2cores,(…>(

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① Status<task,(killed,(…>(

②  Status<task,(killed,(…>(

Interference/#/Performance/Loss/

L3 Cache >300%( >300%( >300%( >300%( >300%( >300%( >300%( 264%( 123%(

DRAM >300%( >300%( >300%( >300%( >300%( >300%( >300%( 270%( 122%(

HyperThread 110%( 107%( 114%( 115%( 105%( 117%( 120%( 136%( >300%(

CPU power 124%( 107%( 116%( 109%( 115%( 105%( 101%( 100%( 100%(

Network 36%( 36%( 37%( 37%( 39%( 42%( 48%( 55%( 64%(

10% 20% 30% 40% 50% 60% 70% 80% 90%

Impact of interference on websearch’s latency

Load

0%

100%

300%

OK

BAD

[Lo’15]

Isolators/

Mesos(slave(invokes(isolators(Modules(that(monitor(&(isolate(resources(for(executors(

Isolator(modules(CPU((cgroups(cpushares,(cpusets)(

Memory((cgroups)(

Disk(

Network(

Cache(

Power(

…((

Isolators/#/Performance/QoS/

[Lo et al’15]

+/bestAeffort/task/>90%/HW/utilization/ No/latency/SLO/problems/

Hybrid/Datacenters/

Mesos/Slaves/

Mesos/Master/

Marathon(

Hybrid/Datacenters/

Mesos/Slaves/

Mesos/Master/

Marathon(

?/instance(++(

Scheduling(based(on(workload(type,(data(locality,(pricing,…((

Future/Directions/

Oversubscription((((

Container(&(application(right(sizing((((

Hybrid(datacenters((((

Power(aware(scheduling((((

Locality(aware(scheduling(

ops$developers$

automation(

performance((

automation(

efficiency((

!

! !

!

Mesos/Datacenter/

Want/to/Learn/More?/

(

(

((

Mo(3pm(–(Cracking(the(Container(Scale(Problem(with(Apache(Mesos(

Tue(4.20pm(–(The(Emergence(of(the(Datacenter(Developer(

Wed(4.15pm(–(Mesos(+(Yarn(=(Myriad(

Questions?/

(

(

((

References/

•  http://mesos.apache.org/(

•  http://www.mesosphere.com/(

•  https://github.com/mesosphere/mesosLdns(

•  https://www.cs.berkeley.edu/~alig/papers/mesos.pdf(

•  https://www.cs.berkeley.edu/~alig/papers/drf.pdf(

•  http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024(

•  http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf(

•  http://web.stanford.edu/~davidlo/resources/2014.heracles.isca.pdf(