FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Image Layout

48
Copyright©2017 NTT Corp. All Rights Reserved. Akihiro Suda ( @_AkihiroSuda_ ) NTT Software Innovation Center FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Image Layout github.com/AkihiroSuda/filegrain Open Source Summit North America (September 11, 2017) Last update: September 11, 2017

Transcript of FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Image Layout

Copyright©2017 NTT Corp. All Rights Reserved.

Akihiro Suda ( @_AkihiroSuda_ )NTT Software Innovation Center

FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable

Container Image Layout

github.com/AkihiroSuda/filegrain

Open Source Summit North America (September 11, 2017) Lastupdate:September11,2017

2Copyright©2017 NTT Corp. All Rights Reserved.

• Software Engineer at NTT• Largest telco in Japan

• Docker Moby Core maintainer

• BuildKit initial maintainer (github.com/moby/buildkit)• Next-generation `docker build`• Executes DAG vertexes of Dockerfile-equivalent concurrently

In April, Docker [ as a project ] transited into Moby.Now Docker [ as a product ] has been developed as one of downstream of Moby.

: ≒ :RHEL Fedora

Who am I

3Copyright©2017 NTT Corp. All Rights Reserved.

•New image format that allows doing `docker run` before `docker pull` 100% finishes

•More benefits: deduplication, etc.•OCI compatible

/bin/shMinimal JREMinimal JDKOptional stuff

You just need to pull 20% forrunning the official OpenJDK 8 image!

(The rest 80% can be pulled on demand)

672 MB (total)

137 MB

535 MB

Summary

4Copyright©2017 NTT Corp. All Rights Reserved.

The current status of Docker / OCI format

5Copyright©2017 NTT Corp. All Rights Reserved.

• In July 2017, Open Containers Initiative (OCI) announced the first release of its industry-standard container image spec• CoreOS's appc/ACI spec maintainers has also joined OCI

• The data structure is almost identical to Docker Image Manifest V2.2, but made separate from Docker Registry HTTP API

• Agnostic to distribution protocol; Any protocol with directory-like semantic will work. e.g. HTTP, NFS, IPFS• Some extra coordinator is needed though (e.g. lock for the image index)• I'm +1 for IPFS (P2P filesystem)

• Note: the spec is unrelated to Dockerfile syntax / instructions

OCI Image Spec

6Copyright©2017 NTT Corp. All Rights Reserved.

• Composed of TAR archives of AUFS layers• AUFS-style tar format is used as the lingua franca across different snapshot drivers

(e.g. OverlayFS, Devicemapper, BtrFS, ZFS)

Structure of Docker / OCI image format

TAR

TAR

FROM ubuntu:17.04RUN apt-get install foobarCOPY foobar.conf /etc

mount –t overlay –o lowerdir=0,upperdir=1 ..

mount –t overlay –o lowerdir=1,upperdir=2 ..

base layer • added & modified files• file deletion info ("whiteout")TAR

/bin/bash/bin/ls ...

/usr/bin/foobar/usr/lib/libfoobar.so .../var/lib/.wh.baz

/etc/foobar.conf

7Copyright©2017 NTT Corp. All Rights Reserved.

• Blobs such as TARs are stored with a Merkle tree structure•Generally, distributed via Docker Registry, with (e.g.) S3/Swift backend

Structure of Docker / OCI image format

{"schemaVersion": 2,"manifests": [

{"mediaType": "application/vnd.oci.image.manifest.v1+json","size": 7143,"digest":

"sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f"}

}

/index.json

/blobs/sha256/e692418e...... (Next slide)

JSON

JSON

8Copyright©2017 NTT Corp. All Rights Reserved.

/blobs/sha256/e692418e...{"schemaVersion": 2,"config": {"mediaType": "application/vnd.oci.image.config.v1+json","size": 7023,"digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7"

},"layers": [{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 33554432,"digest": "sha256:61be55a8e2f6b4e172338bddf184d6dbee29c98853e0a0485ecee7f27b9af0b4"

},{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 1073741824,"digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b"

},{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 73109,"digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736"

}]

}

/blobs/sha256/b5b2b2c5...Environment variables and so on

/blobs/sha256/61be55a8...Base layer

/blobs/sha256/3c3a4604...Diff layer

/blobs/sha256/ec4b8955...Diff layer

TAR

TAR

TAR

JSON

JSON

9Copyright©2017 NTT Corp. All Rights Reserved.

•Merkle tree ensures reproducibility of an environment(`docker pull foobar@sha256:e692418e..`)

Structure of Docker / OCI image format

/index.json

/blobs/sha256/e692418e...

/blobs/sha256/b5b2b2c5.../blobs/sha256/61be55a8.../blobs/sha256/3c3a4604.../blobs/sha256/ec4b8955...

Index

Manifest JSON

Config JSON(env vars and so on) AUFS layer TARs

Merkle tree

10Copyright©2017 NTT Corp. All Rights Reserved.

• TAR = Tape ARchiver, appeared in circa. 1979 (UNIX 7th Edition)• TAR was originally designed for magnetic tapesà Not suitable for Docker/OCI workloads in 2017

Problems of Docker / OCI image format

PDP-11, the target architecture of UNIX in 1970s,with TU56 DECtape drives

https://en.wikipedia.org/wiki/PDP-11

11Copyright©2017 NTT Corp. All Rights Reserved.

Without scanning the whole "tape"...

• Files cannot be listed upàCan't be mounted as a filesystem

• File offsets cannot be detectedà Files can't be accessed• We could create a separate index, but it is useless

when a TAR is gzipped, as it can't be seeked

Problem 1: TAR requires scanning the whole "tape"

Metadata 0File 0

Metadata 1File 1

Metadata {n-1}File {n-1}

Terminal zero bytes

...File name, permission, ...

Content

12Copyright©2017 NTT Corp. All Rights Reserved.

Problem 1: TAR requires scanning the whole "tape"

• If we could solve this problem, we can mount(2) an image and start a container without pulling the whole image• Only metadata are needed for mount(2)• Files are lazily pulled on demand

•à Shorter start-up time & Less network traffic

New containers can start immediately on newly added hosts

Faster testing anddeployment cycle

13Copyright©2017 NTT Corp. All Rights Reserved.

Problem 1: TAR requires scanning the whole "tape"

Detailed usecases:•Web apps with huge number of HTML files and graphic files

• Jupyter Notebook with various big data samples• Academic papers will be immediately reproducible by just running

`docker run some-single-huge-image@sha256:deadbeef..`

• Full GNOME/KDE/Windows(potentially) Desktop

• Java / dotNET runtimes

• Integration testing environment...

14Copyright©2017 NTT Corp. All Rights Reserved.

• A registry might contain very similar images• Different versions• Different architectures• Different configurations

• TARs of these images are likely to contain identical files, but waste storage without any data deduplication

Problem 2: No deduplication

FROM ubuntu:17.04RUN apt-get install foo

FROM ubuntu:17.04RUN apt-get install foo bar

FROM debian:9RUN apt-get install foo

FROM ubuntu:17.04RUN echo … > /etc/apt/source.listRUN apt-get install foo

15Copyright©2017 NTT Corp. All Rights Reserved.

• It might be good to pull a large TAR layer with multiple connections • Especially when multiple servers are available

• But not all protocol allows that• RFC7233 says HTTP/1.1 Range Requests are OPTIONAL

Problem 3: No concurrency

Range0-1023 Range1024-2047

16Copyright©2017 NTT Corp. All Rights Reserved.

1. TAR requires scanning the whole "tape"

2. No deduplication

3. No concurrency

Problems of Docker / OCI image format

Image: ht tps:/ /en.wik ipedia.org/wik i /Magnet ic_tape

17Copyright©2017 NTT Corp. All Rights Reserved.

New image format: FILEgrainhttps://github.com/AkihiroSuda/filegrain

18Copyright©2017 NTT Corp. All Rights Reserved.

• Single large TAR blob à Many small blobs•Metadata blob with content-addressability•Only metadata blob is needed for mounting the image• File blobs can be lazily pulled on demand

FILEgrain overview

Metadata 0File 0

Metadata 1File 1

Metadata {n-1}File {n-1}

Terminal zero bytes

...

Metadata 0 File 0Metadata 1

File 1

Metadata {n-1} File {n-1}

...

TARcontinuity File name, permission bits,

SHA256 digest...

content-addressable

19Copyright©2017 NTT Corp. All Rights Reserved.

• continuity: filesystem metadata manifest system used in containerd / Moby community (github.com/containerd/continuity)• Serializes file names, permission bits, XAttrs, and digest values as a

ProtocolBuffers message structure

• Similar format: mtree(8)

Metadata format: continuity

message Resource {repeated string path;int64 uid;int64 gid;uint32 mode;uint64 size;repeated string digest;repeated XAttr xattr;...

}

20Copyright©2017 NTT Corp. All Rights Reserved.

• A single OCI image can contain both FILEgrain manifests and traditional OCI manifests

Highly compatible with traditional OCI spec

/index.json

/blobs/sha256/e692...

/blobs/sha256/b5b2... /blobs/sha256/61be.../blobs/sha256/3c3a.../blobs/sha256/ec4b...

IndexTraditional OCI Manifest

Common config, such as env vars

AUFS layer TARs

/blobs/sha256/a8e3...

FILEgrain Manifest

/blobs/sha256/de81...continuity

/blobs/sha256/583f.../blobs/sha256/3af1.../blobs/sha256/5c2a.../blobs/sha256/39c1.../blobs/sha256/12ea... Bunch of files

`docker pull foo:v1-filegrain` `docker pull foo:v1`

{"v1-filegrain": "sha256:a8e3..", "v1":"sha256:e692.."}

21Copyright©2017 NTT Corp. All Rights Reserved.

• FILEgrain layers can be put on top of traditional OCI layers• Traditional OCI layers might be still suitable for frequently used files or large

number of small files

Highly compatible with traditional OCI spec

{..."layers": [{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 33554432,"digest": "sha256:e692..."

},{"mediaType": "application/vnd.continuity.layer.v0+pb","size": 16724,"digest": "sha256:a18b..."

}]

}

TAR

continuity

Traditional OCI layer(Needs to be completely

pulled before starting a container)

FILEgrain layer(Can be lazy-pulled on demand)

22Copyright©2017 NTT Corp. All Rights Reserved.

1. TAR requires scanning the whole "tape"à Only metadata (continuity) blob is needed for mounting.

Other stuff can be lazily pulled on demand.

2. No deduplicationà Deduplication in the granularity of files

3. No concurrencyà Concurrency in the granularity of files

The problems are solved!

23Copyright©2017 NTT Corp. All Rights Reserved.

1. Larger number of blobs• /blobs/sha256 directory will contain huge number of files when laid out on

filesystem, and may make readdir(3) slow• Generally, this issue could be solved by "sharding" /blobs/sha256/deadbeef.. like

/blobs/sha256/de/deadbeef.., but not in this case, for compatibility with OCI.

2. More RPC overhead• Single traditional TAR is still suitable for small files so as to reduce number of RPCs

3. Less compression rate• Single traditional TAR is still suitable when the layer contains similar files

But anyway, these cons can be easily mitigated by composing FILEgrain layers on top of traditional OCI layers

Cons

24Copyright©2017 NTT Corp. All Rights Reserved.

Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.

Alternative ideas?

Related work:

Harter, Tyler, et al. "Slacker: Fast Distribution with Lazy Docker Containers." FAST. 2016.

• Loopback-mount an ext4 image located on NFS

• Support lazy-pulling and deduplication in granularity of blocks

25Copyright©2017 NTT Corp. All Rights Reserved.

Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.

Alternative ideas?

Related work:Lestaris, George. "Alternatives to layer-based image distribution: using CERN filesystem for images." Container Camp UK. 2016.

Blomer, Jakob, et al. "A Novel Approach for Distributing and Managing Container Images: Integrating CernVM File System and Mesos." MesosCon NA. 2016.

• CernVM FS• CernVM FS has its own Merkle tree

• Support lazy-pulling and deduplication in granularity of files (as in FILEgrain)

26Copyright©2017 NTT Corp. All Rights Reserved.

Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.

Alternative ideas?

• IPFS is also attractive protocol• P2P, content-addressable

• FILEgrain is made agnostic to transportation protocol as in OCI image spec;it works with HTTP / NFS / CernVM FS / IPFS / whatever

27Copyright©2017 NTT Corp. All Rights Reserved.

Alternative 2: Seek TAR?à No• Compressed TAR is not seekable• Even uncompressed TAR is sometimes not seekable, depending on transportation

protocol• Larger number of request for fetching the whole metadata

Alternative ideas?

Metadata 0File 0

Metadata 1File 1

Metadata {n-1}File {n-1}

Terminal zero bytes

...

Metadata can be pre-loadedwithout reading file contents? (No)

TAR

28Copyright©2017 NTT Corp. All Rights Reserved.

Alternative 3: Use (e.g.) ZIP instead of TAR?à No• still unseekable depending on transporation protocol• poor metadata support

Alternative ideas?

ZIP

Extra metadata 0Compressed file 0Extra metadata 1Compressed file 1

Extra metadata {n-1}Compressed file {n-1}

...

Footer

Metadata 0Metadata 1

...

Metadata {n-1}Metadata can be pulled at once firstly? (No)

29Copyright©2017 NTT Corp. All Rights Reserved.

Evaluation

30Copyright©2017 NTT Corp. All Rights Reserved.

• Implemented as a read-only FUSE filesystem

• No support for write operations• "Cattle" containers should be immutable; it is anti-pattern to do any write operation

against rootfs• /tmp and /run should be mounted as tmpfs• persistent data should be written to bind-mounted Ext4/XFS

• Actually, even Docker built-in storage drivers (overlayfs, AUFS) don't fully support write operations (github.com/AkihiroSuda/issues-docker)• e.g. Yum is known not to work on overlayfs, without installing yum-plugin-ovl

(neither a bug of overlayfs nor yum!)

Implementation

31Copyright©2017 NTT Corp. All Rights Reserved.

• Docker doesn't support running a container with a rootfs that is not managed by the Docker daemon

• So current FILEgrain POC is evaluated using runc

• TODO: reimplement as a containerd plugin

Implementation

32Copyright©2017 NTT Corp. All Rights Reserved.

See https://github.com/AkihiroSuda/filegrain/issues/17 for detailed information

Images for evaluation

Image Description rootfs size(after tar+gzip expansion)

openjdk:8

sha256:5da842d59f76009fa27ffde06888ebd560c7ad17607d7cd1e52fc0757c9a45fb

Debian 9.1, OpenJDK8u141

671.7MB

kdeneon/all

sha256:e3e7f216a5f8f1fdcff4eab8807d7afcd291c050099ab3e8a8355b7b28a19247

Ubuntu 16.04, KDE Plasma 5.10, Firefox 54..

4.8GB

kaggle/python

sha256:335103c998aea22a5608c2eeca7dcf109e0828ed233b75f5098182c5b058fe98

Debian 8.5, Variousmachine learning frameworks, NLTK (natural language toolkit) dataset..

8.3GB

33Copyright©2017 NTT Corp. All Rights Reserved.

• openjdk:8 (blobs in total = 671.7MB + meta 5.4MB)• mount: 5.4MB ( 2 blobs)

• only metadata are needed (manifest and continuity)• `sh`: 7.3MB ( 8 blobs) in total• `java –version`: 88.2MB (30 blobs) in total• `javac HelloWorld.java`: 137.3MB (50 blobs) in total

• kdeneon/all (4.8GB + 34.5MB)• mount: 34.5MB ( 2 blobs)• `sh`: 36.7MB ( 8 blobs)• `startkde`: 742.7MB (4,267 blobs)• `firefox`: 866.6MB (4,506 blobs)

note: commands are executed sequentially

Evaluation: blobs needed to start a container (uncompressed)

1/5

lessthan1/5

The evaluat ion data in the abstract text (633MB Java image) is an old data

34Copyright©2017 NTT Corp. All Rights Reserved.

• kaggle/python (8.3GB + 38.2MB)• mount: 38.2MB ( 2 blobs)• `sh`: 40.1MB ( 8 blobs)• `ipython –c ‘echo(“hello”)’`: 75.4MB (1033 blobs)• `ipython –c ‘import nltk’`: 352.0MB (2799 blobs)

Evaluation: blobs needed to start a container (uncompressed)

lessthan1/24

35Copyright©2017 NTT Corp. All Rights Reserved.

Evaluation: compression

Image rootfs TAR at once + gzip -9

FILEgrain + gzip -9 against each of blobs

openjdk:8 671.7MB 261.3MB 260.7MB(-645,604B)

kdeneon/all 4.8GB 2.1GB 2.1GB(-489,361B)

kaggle/python 8.3GB 3.6GB 3.6GB(+4,345,701B)

No more than a rounding error in these cases(If an image contained similar files, its compression rate would get worse)

36Copyright©2017 NTT Corp. All Rights Reserved.

Evaluation: deduplication

kdeneon/all(4.8GB)

kaggle/python(8.3GB)

75.4MB deduplication(KDE and Kaggle are mutually unrelated,

but have some common Debian files)

37Copyright©2017 NTT Corp. All Rights Reserved.

Evaluation: FUSE overhead

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10

Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)

Docker (overlay2) FILEgrain FUSEDocker 17.06 / Fedora 26 / 2 vCPUs, 2GB RAM, 2GB swap (VMware Fusion, MacBook Pro 2016)

38Copyright©2017 NTT Corp. All Rights Reserved.

Evaluation: FUSE overhead

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10

Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)

Docker (overlay2) FILEgrain FUSE

For FILEgrain, the first data contains the time for pulling some blobsneeded to start a container

(But no network overhead, as localhost is the remote in this evaluation)

But the Docker data doesn't contain the time for pulling,as a container can be started only after pulling the whole image.

39Copyright©2017 NTT Corp. All Rights Reserved.

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10

Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)

Docker (overlay2) FILEgrain FUSE

Evaluation: FUSE overhead

Getting faster due to in-kernel caching

For some implementation reason,in-kernel cache doesn't seem working..

But this cache should be easily enabled,as the filesystem is always read-only by design

40Copyright©2017 NTT Corp. All Rights Reserved.

•Overhead of larger number of RPC for pulling blobs• Depends on the transportation protocol• TODO: implement Docker Registry API client and evaluate the overhead

• Current POC just uses a local directory as a mock registry

Evaluation: others

41Copyright©2017 NTT Corp. All Rights Reserved.

Future work

42Copyright©2017 NTT Corp. All Rights Reserved.

• Even finer granularity (CHUNKgrain?)• For large files, it might be good compute SHA256 digests against each of chunks,

and serialize the digests in some format• Probably, this serialization would be separate from continuity manifest itself.

continuity#85

• Use traditional OCI TAR for files that are very likely to be needed immediately to start a container• We can easily detect such files by starting a container and capture FUSE calls

before pushing the image

Future work

43Copyright©2017 NTT Corp. All Rights Reserved.

• containerd is the next standard runtime in the container industry

Integration to the ecosystem

Docker

(Kubernetes)

containerd v0.2

runc

Moby Core(successor to Docker daemon)

Kubernetes(via CRI-containerd)

containerd v1.0

runc

44Copyright©2017 NTT Corp. All Rights Reserved.

• containerd plugin system allows new feature to be added without modifying the containerd upstream

Integration to the ecosystem

containerd v1.0

runtime service snapshot service differ service content serviceoverlayfs plugin

btrfs plugin

runc plugin generic plugin generic plugin

45Copyright©2017 NTT Corp. All Rights Reserved.

• FILEgrain will be reimplemented as set of containerd plugins• Can be easily integrated to the higher-level engines without

modification (Moby/Docker, Kubernetes, ..)

Integration to the ecosystem

containerd v1.0

runtime service snapshot service differ service content serviceoverlayfs plugin

btrfs plugin

runc plugin generic plugin generic plugin

FILEgrain plugins

46Copyright©2017 NTT Corp. All Rights Reserved.

Recap

47Copyright©2017 NTT Corp. All Rights Reserved.

•New image format that allows doing `docker run` before `docker pull` 100% finishes

•More benefits: deduplication, etc.•OCI compatible

/bin/shMinimal JREMinimal JDKOptional stuff

You just need to pull 20% forrunning the official OpenJDK 8 image!

(The rest 80% can be pulled on demand)

672 MB (total)

137 MB

535 MB

Recap

48Copyright©2017 NTT Corp. All Rights Reserved.

Demo