Files
container.training/slides/containers/Copy_On_Write.md
Jerome Petazzoni 018282f392 slides: rename directories
This was discussed and agreed in #246. It will probably break a few
outstanding PRs as well as a few external links but it's for the
better good long term.
2018-08-21 04:03:38 -05:00

340 lines
7.7 KiB
Markdown

# Copy-on-write filesystems
Container engines rely on copy-on-write to be able
to start containers quickly, regardless of their size.
We will explain how that works, and review some of
the copy-on-write storage systems available on Linux.
---
## What is copy-on-write?
- Copy-on-write is a mechanism allowing to share data.
- The data appears to be a copy, but is only
a link (or reference) to the original data.
- The actual copy happens only when someone
tries to change the shared data.
- Whoever changes the shared data ends up
using their own copy instead of the shared data.
---
## A few metaphors
--
- First metaphor:
<br/>white board and tracing paper
--
- Second metaphor:
<br/>magic books with shadowy pages
--
- Third metaphor:
<br/>just-in-time house building
---
## Copy-on-write is *everywhere*
- Process creation with `fork()`.
- Consistent disk snapshots.
- Efficient VM provisioning.
- And, of course, containers.
---
## Copy-on-write and containers
Copy-on-write is essential to give us "convenient" containers.
- Creating a new container (from an existing image) is "free".
(Otherwise, we would have to copy the image first.)
- Customizing a container (by tweaking a few files) is cheap.
(Adding a 1 KB configuration file to a 1 GB container takes 1 KB, not 1 GB.)
- We can take snapshots, i.e. have "checkpoints" or "save points"
when building images.
---
## AUFS overview
- The original (legacy) copy-on-write filesystem used by first versions of Docker.
- Combine multiple *branches* in a specific order.
- Each branch is just a normal directory.
- You generally have:
- at least one read-only branch (at the bottom),
- exactly one read-write branch (at the top).
(But other fun combinations are possible too!)
---
## AUFS operations: opening a file
- With `O_RDONLY` - read-only access:
- look it up in each branch, starting from the top
- open the first one we find
- With `O_WRONLY` or `O_RDWR` - write access:
- if the file exists on the top branch: open it
- if the file exists on another branch: "copy up"
<br/>
(i.e. copy the file to the top branch and open the copy)
- if the file doesn't exist on any branch: create it on the top branch
That "copy-up" operation can take a while if the file is big!
---
## AUFS operations: deleting a file
- A *whiteout* file is created.
- This is similar to the concept of "tombstones" used in some data systems.
```
# docker run ubuntu rm /etc/shadow
# ls -la /var/lib/docker/aufs/diff/$(docker ps --no-trunc -lq)/etc
total 8
drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
-r--r--r-- 2 root root 0 Jan 27 15:36 .wh.shadow
```
---
## AUFS performance
- AUFS `mount()` is fast, so creation of containers is quick.
- Read/write access has native speeds.
- But initial `open()` is expensive in two scenarios:
- when writing big files (log files, databases ...),
- when searching many directories (PATH, classpath, etc.) over many layers.
- Protip: when we built dotCloud, we ended up putting
all important data on *volumes*.
- When starting the same container multiple times:
- the data is loaded only once from disk, and cached only once in memory;
- but `dentries` will be duplicated.
---
## Device Mapper
Device Mapper is a rich subsystem with many features.
It can be used for: RAID, encrypted devices, snapshots, and more.
In the context of containers (and Docker in particular), "Device Mapper"
means:
"the Device Mapper system + its *thin provisioning target*"
If you see the abbreviation "thinp" it stands for "thin provisioning".
---
## Device Mapper principles
- Copy-on-write happens on the *block* level
(instead of the *file* level).
- Each container and each image get their own block device.
- At any given time, it is possible to take a snapshot:
- of an existing container (to create a frozen image),
- of an existing image (to create a container from it).
- If a block has never been written to:
- it's assumed to be all zeros,
- it's not allocated on disk.
(That last property is the reason for the name "thin" provisioning.)
---
## Device Mapper operational details
- Two storage areas are needed:
one for *data*, another for *metadata*.
- "data" is also called the "pool"; it's just a big pool of blocks.
(Docker uses the smallest possible block size, 64 KB.)
- "metadata" contains the mappings between virtual offsets (in the
snapshots) and physical offsets (in the pool).
- Each time a new block (or a copy-on-write block) is written,
a block is allocated from the pool.
- When there are no more blocks in the pool, attempts to write
will stall until the pool is increased (or the write operation
aborted).
- In other words: when running out of space, containers are
frozen, but operations will resume as soon as space is available.
---
## Device Mapper performance
- By default, Docker puts data and metadata on a loop device
backed by a sparse file.
- This is great from a usability point of view,
since zero configuration is needed.
- But it is terrible from a performance point of view:
- each time a container writes to a new block,
- a block has to be allocated from the pool,
- and when it's written to,
- a block has to be allocated from the sparse file,
- and sparse file performance isn't great anyway.
- If you use Device Mapper, make sure to put data (and metadata)
on devices!
---
## BTRFS principles
- BTRFS is a filesystem (like EXT4, XFS, NTFS...) with built-in snapshots.
- The "copy-on-write" happens at the filesystem level.
- BTRFS integrates the snapshot and block pool management features
at the filesystem level.
(Instead of the block level for Device Mapper.)
- In practice, we create a "subvolume" and
later take a "snapshot" of that subvolume.
Imagine: `mkdir` with Super Powers and `cp -a` with Super Powers.
- These operations can be executed with the `btrfs` CLI tool.
---
## BTRFS in practice with Docker
- Docker can use BTRFS and its snapshotting features to store container images.
- The only requirement is that `/var/lib/docker` is on a BTRFS filesystem.
(Or, the directory specified with the `--data-root` flag when starting the engine.)
---
class: extra-details
## BTRFS quirks
- BTRFS works by dividing its storage in *chunks*.
- A chunk can contain data or metadata.
- You can run out of chunks (and get `No space left on device`)
even though `df` shows space available.
(Because chunks are only partially allocated.)
- Quick fix:
```
# btrfs filesys balance start -dusage=1 /var/lib/docker
```
---
## Overlay2
- Overlay2 is very similar to AUFS.
- However, it has been merged in "upstream" kernel.
- It is therefore available on all modern kernels.
(AUFS was available on Debian and Ubuntu, but required custom kernels on other distros.)
- It is simpler than AUFS (it can only have two branches, called "layers").
- The container engine abstracts this detail, so this is not a concern.
- Overlay2 storage drivers generally use hard links between layers.
- This improves `stat()` and `open()` performance, at the expense of inode usage.
---
## ZFS
- ZFS is similar to BTRFS (at least from a container user's perspective).
- Pros:
- high performance
- high reliability (with e.g. data checksums)
- optional data compression and deduplication
- Cons:
- high memory usage
- not in upstream kernel
- It is available as a kernel module or through FUSE.
---
## Which one is the best?
- Eventually, overlay2 should be the best option.
- It is available on all modern systems.
- Its memory usage is better than Device Mapper, BTRFS, or ZFS.
- The remarks about *write performance* shouldn't bother you:
<br/>
data should always be stored in volumes anyway!