Nabla containers: a new approach to container isolation · 2019-05-15 · Making and running a...
Transcript of Nabla containers: a new approach to container isolation · 2019-05-15 · Making and running a...
Nablacontainers:anewapproachtocontainerisolationBrandonLum,RicardoKoller ,DanWilliams,Sahil Suneja
IBMResearchhttps://nabla-containers.github.io
Kubecon China2018
ContainersarenotsecurelyIsolated
2
ContainersarenotsecurelyIsolated
3
- Whatdoesthisexactlymean?
- WhyareVMsconsideredsecurebutnotcontainers?
- Howdoweimprovecontainerisolation?
Overview
• ThreatModel:Isolation• Isolationthroughsurfacereduction• Ourapproach:Nabla• MeasuringIsolation• Nabla vsVMs?
4
Whatdoesitmeantobeisolated?
• Containersthatareco-locatedshouldnotbeabletoaccessdataofanother
• Scenarios:• Horizontalattacksfromvulnerableservices
• Container-nativemulti-tenantcloud
Kernel
attacker
ServiceA
secret
containers
ContainerIsolationReality• Containers==namespacedprocessesà Kernelexploitsmostlywork
• Sep2018:CVE-2018-14634• DirtyCOW (CVE-2016-5195)• Manymore(CVEdatabase),2018:Codexec (3),Mem.Corrupt(8)
• Horizontalattackpossibleviasharedprivilegedcomponent(kernel) Kernel
attacker
ServiceA
secret
containers
attacker
Exploitviasyscalls
DirtyCOW
• DityCow ExploitSketch:• mmap apage• Createathreadthatinvokesmadvise
• CreateathreadthatinvokesRead/Write procfs
• TriggersraceconditioninKernelMem.managementcode
// FROM: https://dirtycow.ninja/
map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0); printf("mmap %zx\n\n",(uintptr_t) map);
/* You have to do it on two threads. */ pthread_create(&pth1,NULL,madviseThread,argv[1]); //madvisepthread_create(&pth2,NULL,procselfmemThread,argv[2]); // R/W procfs
/* You have to wait for the threads to finish. */ pthread_join(pth1,NULL); pthread_join(pth2,NULL); return 0;
7
ContainerIsolationReality
Kernel
attacker
ServiceA
secret
containers
attacker
Application
Kernel
KernelFootprint
>300Syscalls
disk
FS
• Exploitstargetvulnerablepartofkernelviasyscalls.
• Ifwerestrictthenumberofsyscalls• à Lessreachablekernelfunctions• à Lesspotentialvulnerabilities• à Lesspossibleexploits
Application
Kernel
DockerDefaultSeccomp Policy
~280Syscalls
disk
FS
• Dockerdefaultseccomp policy• disablesaround44systemcallsoutof300+.
• Genericseccomp policies– hardtocreates.t. itissecure
• Syscall profilingismostlyheuristicbased
44Syscallsseccomp (Whitelistingpolicy)
Greyed– unreachablefunctions
Application
Kernel
Nabla
7Syscalls
disk
FS
• Deterministic andgenericseccomp policy
• Only7syscalls!• UsesLibOS techniques
seccompLibOS
Original300+Syscall interface*
Nabla
• Takingunikernel ideasandputtingitintocontainers
• Usingtools/technologiesfromtherumprun andsolo5community
• Modifyunikernel toworkasaprocess
12
“Unikernels asProcesses”(ACMSoCC ’18)
(https://dl.acm.org/citation.cfm?id=3267845)
MakingandrunningaNabla
• Buildapp.withcustombuildprocess*
• Nabla runtime,runnc loadsthenabla binariesandsetsupseccompprofiles
13
Application
7Syscalls
seccompLibOS
*currentlimitationofbuildprocess,weareinvestigatingwaystoconsiderremovingacustombuildprocess
Application
>300SyscallsBuildprocess* Nabla
Binary
ContainerRuntime
runc
Application Application
runnc
Application
7Syscalls
seccomp
LibOS
Application
7Syscalls
seccomp
LibOS
Demo
14
strace/ftracemeasurements(Lowisgood)
15
Application
Kernel
>300Syscalls
disk
FS
ftracemeasuresnumberofboxestouched.
stracemeasuressyscallsinvoked.
ftracemeasurements(lowerisbetter)
16
Kata-containers(VMs)
Nabla
WhatdoesthissayaboutourisolationvsVMs?
HavewesurpassedVMisolation?
• Weexploredandcontestedthisideainourpaper:
“SayGoodbyetoVirtualizationforaSaferCloud”(USENIXHotCloud 2018)
(https://www.usenix.org/conference/hotcloud18/presentation/williams)
• Maybe… Butseveralquestions:• Implementationspecificcomparisons?KVMvsotherhypervisors• Hardwareinclusivethreatmodel(Spectre/Meltdown,etc.)• Othermetrics
17
What’sNext?
• Wewanttoengagethecommunity:
• Developmentworkforrunnc/nabla-base-build/nabla-demo-apps• Removeneedtorebuildnabla containers(SupportfordynamiclinkingLibOS)• Createnewimagesandmorelanguagesupportforapplications
• ChimeinonImprovingSecurityAnalysis/Metrics• https://github.com/nabla-containers/nabla-measurements
18
19
ThankYou!https://nabla-containers.github.io
BrandonLum (@lumjjb)– [email protected]
#NablaContainers
Backup
20
ftracemeasurements(lowerisbetter)
21
Application
Kernel
>300Syscalls
disk
FS
MeasuringnumberofboxesTouched.
Throughput(higherisbetter)
22
Demo
23
ContainerRuntime
Kubelet
containerd
CNIPlugin
Cri-containerd
CRI
CNI runnc
IMAGEREGISTRY
Imagepull(OCIimagespec)
RunContainer(OCIRuntimeSpec)
OtherConfigfrompodSpeci.e.mounts,security,etc.
runc
InsideaNabla container
• Unmodifiedusercode(e.g.,Node.js,redis,nginx,etc.)
• Rumprun libraryOS• UnmodifiedNetBSD code+someglue• RunsonthinSolo5unikernel interface
• Nabla Tender• Setupofseccomp policy• TranslatesSolo5callstosystemcalls
Libc
Rumprun glue
NetBSD
Solo5
FSTCP/IP
…
Application
𝛁 Tender
OriginalContainer
Backup:ContainersvsVMs
25
Overview
• ThreatModel:Isolation• WhatmakesVMsisolated?• Nabla:Howdowegetthoseisolationpropertieswithoutoverhead?
26
Disclaimer:Inthistalk,wearedoinga1:1comparison.Defenseindepthisavaliddiscussionwithadifferentsetoftrade-offs.
ContainersVMs
27
Hypervisor(+HostKernel(root))
GuestOS ☠
HostKernel
Pro-cess ☠
HighLevel- Syscalls:Filesysteminterface,socketinterface,etc.
LowLevel– VT:BlockDev.Interface,TAPinterface,etc.
ContainersVMs
28
Infra
Interface
FS
GuestApplication Process
disk
ALOTmoreexploitablecodeintheinfrastructure!!!
Infra
Interface
Guest .OS .
disk
FS
Lower level interface
Less code
Fewer vulnerabilities
Stronger isolation
30
Kernelfunctionsaccessedbyapplications
• Comparedtostandardcontainers
• 5-6xlesskernelfunctionsaccessed
• 8-14xfewersyscalls
• AbouthalfthenumberofkernelfunctionsaccessedasVMs!
0 200 400 600 800
1000 1200 1400 1600
nginxnginx-large
node-express
redis-get
redis-set
Uni
que
kern
elfu
nctio
ns a
cces
sed process
ukvmnabla
ContainerVM
nabla
AccessiblekernelfunctionsunderNabla policy
0
100
200
300
400
500
600
700
0 50 100 150 200 250 300
Uni
que
kern
el fu
nctio
ns
acceptnablablock
0
30
0 10
• Trinitykernelfuzztestertotrytoaccessasmuchofkernelaspossible
• Nabla policyreducesamountofaccessiblekernelfunctionsby98%
Unikernel isolationcomesfromtheinterface
• Directmappingbetween10hypercalls andsystemcall/resourcepairs
33
Hypercallwalltime
puts
poll
blkinfo
blkwrite
blkread
netinfo
netwrite
netread
halt
• 6forI/O• Network:packetlevel• Storage:blocklevel
• vs.>350syscalls
SystemCall Resourceclock_gettime
write stdoutppoll net_fd
pwrite64 blk_fdpread64 blk_fd
write net_fdread net_fdexit_group
SOCC
34
Implementation:nabla 𝛁
35
• ExtendedSolo5unikernelecosystemandukvm
• Prototypesupports:• MirageOS• IncludeOS• Rumprun
• https://github.com/solo5/solo5
Measuringisolation:commonapplications
0 200 400 600 800
1000 1200 1400 1600
nginxnginx-large
node-express
redis-get
redis-set
Uni
que
kern
elfu
nctio
ns a
cces
sed process
ukvmnabla
36
• Codereachablethroughinterfaceisametricforattacksurface
• Usedkernelftrace
• Results:• Processes:5-6xmore• VMs:2-3xmore
Measuringisolation:fuzztesting
37
0
100
200
300
400
500
600
700
0 50 100 150 200 250 300
Uni
que
kern
el fu
nctio
ns
acceptnablablock
0
30
0 10
• Usedkernelftrace• Usedtrinitysystemcallfuzzer totrytoaccessmoreofthekernel
• Results:• Nabla policyreducesby98%overa“normal”process
Measuringperformance:throughput
80%
100%
120%
140%
160%
180%
200%
py_tornado
py_chameleon
node_fib
mirage_H
TTP
py_2to3
node_express
nginx_large
redis_get
redis_set
includeos_TCP
nginx
includeos_UD
P
Nor
mal
ized
thro
ughp
ut
245
no I/O with I/O
ukvmnablaQEMU/KVM
38
• Applicationsinclude:• Webservers• Pythonbenchmarks• Redis• etc.
• Results:• 101%-245%higherthroughputthanukvm
Measuringperformance:CPUutilization
0 20 40 60 80
100 120
(a)
CPU
%
0 20 40 60 80
100
(b)
VM
exits
/ms
0
0.5
1
1.5
0 5000 10000 15000 20000(c
) IP
C (i
ns/c
ycle
)Requests/sec
nablaukvm
39
• vmexits haveaneffectoninstructionspercycle
• ExperimentwithMirageOSwebserver
• Results:• 12%reductionincpuutilizationoverukvm
Measuringperformance:startuptime
0
250
500
750
Hello
world
QEMU/KVM
0
10
20
30ukvm
0
10
20
30nabla
0
10
20
30process
0
500
QEMU/KVM
ukvm
nabla
process
2 4 6 8 10 12 14 160
500
1000
1500
HTTP
POST
2 4 6 8 10 12 14 160
50
100
150
200
2 4 6 8 10 12 14 16
Number of cores
0
50
100
150
200
2 4 6 8 10 12 14 160
50
100
150
200
0 2 4 6 8 10 12 14
0
500
1000
1500
40
• Startuptimeisimportantforserverless,NFV
• Results:• Ukvm has30-370%higherlatencythannabla
• MostlydueavoidingKVMoverheads
Helloworld
HTTPPost