container.training/www/htdocs/index.html

<!DOCTYPE html>
<html>
  <head>
    <base target="_blank">
    <title>Docker Orchestration Workshop</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <style type="text/css">
      @import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
      @import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
      @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);

      body { font-family: 'Droid Serif'; }

      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: normal;
        margin-top: 0.5em;
      }
      a {
        text-decoration: none;
        color: blue;
      }
      .remark-slide-content { padding: 1em 2.5em 1em 2.5em; }

      .remark-slide-content { font-size: 25px; }
      .remark-slide-content h1 { font-size: 50px; }
      .remark-slide-content h2 { font-size: 50px; }
      .remark-slide-content h3 { font-size: 25px; }
      .remark-code { font-size: 25px; }
      .small .remark-code { font-size: 16px; }

      .remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
      .red { color: #fa0000; }
      .gray { color: #ccc; }
      .small { font-size: 70%; }
      .big { font-size: 140%; }
      .underline { text-decoration: underline; }
      .pic {
        vertical-align: middle;
        text-align: center;
        padding: 0 0 0 0 !important;
      }
      img {
        max-width: 100%;
        max-height: 550px;
      }
      .title {
        vertical-align: middle;
        text-align: center;
      }
      .title h1 { font-size: 100px; }
      .title p { font-size: 100px; }
      .quote {
        background: #eee;
        border-left: 10px solid #ccc;
        margin: 1.5em 10px;
        padding: 0.5em 10px;
        quotes: "\201C""\201D""\2018""\2019";
        font-style: italic;
      }
      .quote:before {
        color: #ccc;
        content: open-quote;
        font-size: 4em;
        line-height: 0.1em;
        margin-right: 0.25em;
        vertical-align: -0.4em;
      }
      .quote p {
        display: inline;
      }
      .warning {
        background-image: url("warning.png");
        background-size: 1.5em;
        background-repeat: no-repeat;
        padding-left: 2em;
      }
      .exercise {
        background-color: #eee;
        background-image: url("keyboard.png");
        background-size: 1.4em;
        background-repeat: no-repeat;
        background-position: 0.2em 0.2em;
        border: 2px dotted black;
      }
      .exercise::before {
        content: "Exercise";
        margin-left: 1.8em;
      }
      li p { line-height: 1.25em; }
    </style>
  </head>
  <body>
    <textarea id="source">

class: title

Docker <br/> Orchestration <br/> Workshop

---

## Logistics

- Hello! We're `jerome at docker dot com` and `aj at soulshake dot net`

<!--
Reminder, when updating the agenda: when people are told to show
up at 9am, they usually trickle in until 9:30am (except for paid
training sessions). If you're not sure that people will be there
on time, it's a good idea to have a breakfast with the attendees
at e.g. 9am, and start at 9:30.

- Agenda:

.small[
- 08:00-09:00 hello and breakfast
- 09:00:10:25 part 1
- 10:25-10:35 coffee break
- 10:35-12:00 part 2
- 12:00-13:00 lunch break
- 13:00-14:25 part 3
- 14:25-14:35 coffee break
- 14:35-16:00 part 4
]

-->

- The tutorial will run from 1:20pm to 4:40pm

- There will be a break from 3:00pm to 3:15pm

- This will be FAST PACED, but DON'T PANIC!

- All the content is publicly available (slides, code samples, scripts)

<!--
Remember to change:
- the Gitter link below
- the "tweet my speed" hashtag in DockerCoins HTML
-->

- Live feedback, questions, help on
  [Gitter](http://container.training/chat)

---


<!--
grep '^# ' index.html | grep -v '<br' | tr '#' '-'
-->

## Chapter 1: getting started

- Pre-requirements
- VM environment
- Our sample application
- Running the application
- Identifying bottlenecks
- Scaling out
- Connecting to containers on other hosts
- Abstracting remote services with ambassadors

---

## Chapter 2: Swarm setup and deployment

- Dynamic orchestration
- Deploying Swarm
- Picking a key/value store
- Running containers on Swarm
- Resource allocation
- Multi-host networking
- Building images with Swarm
- Deploying a local registry
- Scaling web services with Compose on Swarm

---

## Chapter 3: Docker for Ops

- Logs
- Setting up ELK to store container logs
- Network traffic analysis
- Backups
- Controlling Docker from a container
- Docker events stream
- Security upgrades

---

## Chapter 4: high availability (additional content)

- Distributing Machine credentials
- Highly available Swarm managers
- Highly available containers
- Conclusions

---

# Pre-requirements

- Computer with network connection and SSH client

  - on Linux, OS X, FreeBSD... you are probably all set

  - on Windows, get [putty](http://www.putty.org/),
  [Git BASH](https://msysgit.github.io/), or
  [MobaXterm](http://mobaxterm.mobatek.net/)

- Basic Docker knowledge
  <br/>(but that's OK if you're not a Docker expert!)

---

## Nice-to-haves

- [GitHub](https://github.com/join) account
  <br/>(if you want to fork the repo; also used to join Gitter)

- [Gitter](https://gitter.im/) account
  <br/>(to join the conversation during the workshop)

- [Docker Hub](https://hub.docker.com) account
  <br/>(it's one way to distribute images on your Swarm cluster)

---

## Hands-on sections

- The whole workshop is hands-on

- I will show Docker in action

- I invite you to reproduce what I do

- All hands-on sections are clearly identified, like the gray rectangle below

.exercise[

- This is the stuff you're supposed to do!
- Go to [container.training](http://container.training/) to view these slides
- Join the chat room on
  [Gitter](http://container.training/chat)

]

---

# VM environment

- Each person gets 5 private VMs (not shared with anybody else)
- They'll be up until tonight
- You have a little card with login+password+IP addresses
- You can automatically SSH from one VM to another

.exercise[

<!--
```bash
for N in $(seq 1 5); do
  ssh -o StrictHostKeyChecking=no node$N true
done
for N in $(seq 1 5); do
  (.
  docker-machine rm -f node$N
  ssh node$N "docker ps -aq | xargs -r docker rm -f"
  ssh node$N sudo rm -f /etc/systemd/system/docker.service
  ssh node$N sudo systemctl daemon-reload
  echo Restarting node$N.
  ssh node$N sudo systemctl restart docker
  echo Restarted node$N.
  ) &
done
wait
```
-->

- Log into the first VM (`node1`)
- Check that you can SSH (without password) to `node2`:
  ```bash
  ssh node2
  ```
- Type `exit` or `^D` to come back to node1

<!--
```meta
^D
```
-->

]

---

## We will (mostly) interact with node1 only

- Unless instructed, **all commands must be run from the first VM, `node1`**

- We will only checkout/copy the code on `node1`

- When we will use the other nodes, we will do it mostly through the Docker API

- We will use SSH only for a few "out of band" operations (mass-removing containers...)

---

## Terminals

Once in a while, the instructions will say:
<br/>"Open a new terminal."

There are multiple ways to do this:

- create a new window or tab on your machine, and SSH into the VM;

- use screen or tmux on the VM and open a new window from there.

You are welcome to use the method that you feel the most comfortable with.

---

## Tmux cheatsheet

- Ctrl-b c → creates a new window
- Ctrl-b n → go to next window
- Ctrl-b p → go to previous window
- Ctrl-b " → split window top/bottom
- Ctrl-b % → split window left/right
- Ctrl-b Alt-1 → rearrange windows in columns
- Ctrl-b Alt-2 → rearrange windows in rows
- Ctrl-b arrows → navigate to other windows
- Ctrl-b d → detach session
- tmux attach → reattach to session

---

## Brand new versions!

- Engine 1.11
- Compose 1.7
- Swarm 1.2
- Machine 0.6

.exercise[

- Check all installed versions:
  ```bash
  docker version
  docker-compose -v
  docker run --rm swarm -version
  docker-machine -v
  ```

]

---

## Why are we not using the latest version of Machine?

- The latest version of Machine is 0.7

- The way it deploys Swarm is different from 0.6

- This causes a regression in the strategy that we will use later

- More details later!

---

# Our sample application

- Visit the GitHub repository with all the materials of this workshop:
  <br/>https://github.com/jpetazzo/orchestration-workshop

- The application is in the [dockercoins](
  https://github.com/jpetazzo/orchestration-workshop/tree/master/dockercoins)
  subdirectory

- Let's look at the general layout of the source code:

  there is a Compose file [docker-compose.yml](
  https://github.com/jpetazzo/orchestration-workshop/blob/master/dockercoins/docker-compose.yml) ...

  ... and 4 other services, each in its own directory:

  - `rng` = web service generating random bytes
  - `hasher` = web service computing hash of POSTed data
  - `worker` = background process using `rng` and `hasher`
  - `webui` = web interface to watch progress

---

## Compose file format version

*Particularly relevant if you have used Compose before...*

- Compose 1.6 introduced support for a new Compose file format (aka "v2")

- Services are no longer at the top level, but under a `services` section

- There has to be a `version` key at the top level, with value `"2"` (as a string, not an integer)

- Containers are placed on a dedicated network, making links unnecessary

- There are other minor differences, but upgrade is easy and straightforward

---

## Links, naming, and service discovery

- Containers can have network aliases (resolvable through DNS)

- Compose file version 2 makes each container reachable through its service name

- Compose file version 1 requires "links" sections

- Our code can connect to services using their short name

  (instead of e.g. IP address or FQDN)

---

## Example in `worker/worker.py`

![Service discovery](service-discovery.png)

---

## What's this application?

---

class: pic

![DockerCoins logo](dockercoins.png)

(DockerCoins 2016 logo courtesy of @XtlCnslt and @ndeloof. Thanks!)

---

## What's this application?

- It is a DockerCoin miner! 💰🐳📦🚢

- No, you can't buy coffee with DockerCoins

- How DockerCoins works:

  - `worker` asks to `rng` to give it random bytes
  - `worker` feeds those random bytes into `hasher`
  - each hash starting with `0` is a DockerCoin
  - DockerCoins are stored in `redis`
  - `redis` is also updated every second to track speed
  - you can see the progress with the `webui`

---

## Getting the application source code

- We will clone the GitHub repository

- The repository also contains scripts and tools that we will use through the workshop

.exercise[

<!--
```bash
[ -d orchestration-workshop ] && mv orchestration-workshop orchestration-workshop.$$
```
-->

- Clone the repository on `node1`:
  ```bash
  git clone git://github.com/jpetazzo/orchestration-workshop
  ```

]

(You can also fork the repository on GitHub and clone your fork if you prefer that.)

---

# Running the application

Without further ado, let's start our application.

.exercise[

- Go to the `dockercoins` directory, in the cloned repo:
  ```bash
  cd ~/orchestration-workshop/dockercoins
  ```

- Use Compose to build and run all containers:
  ```bash
  docker-compose up
  ```

]

Compose tells Docker to build all container images (pulling
the corresponding base images), then starts all containers,
and displays aggregated logs.

---

## Lots of logs

- The application continuously generates logs

- We can see the `worker` service making requests to `rng` and `hasher`

- Let's put that in the background

.exercise[

- Stop the application by hitting `^C`

<!--
```meta
^C
```
-->

]

- `^C` stops all containers by sending them the `TERM` signal

- Some containers exit immediately, others take longer
  <br/>(because they don't handle `SIGTERM` and end up being killed after a 10s timeout)

---

## Restarting in the background

- Many flags and commands of Compose are modeled after those of `docker`

.exercise[

- Start the app in the background with the `-d` option:
  ```bash
  docker-compose up -d
  ```

- Check that our app is running with the `ps` command:
  ```bash
  docker-compose ps
  ```

]

`docker-compose ps` also shows the ports exposed by the application.

---

## Viewing logs

- The `docker-compose logs` command works like `docker logs`

.exercise[

- View all logs since container creation and exit when done:
  ```bash
  docker-compose logs
  ```

- Stream container logs, starting at the last 10 lines for each container:
  ```bash
  docker-compose logs --tail 10 --follow
  ```

<!--
```meta
^C
```
-->

]

Tip: use `^S` and `^Q` to pause/resume log output.

???

## Upgrading from Compose 1.6

.warning[The `logs` command has changed between Compose 1.6 and 1.7!]

- Up to 1.6

  - `docker-compose logs` is the equivalent of `logs --follow`

  - `docker-compose logs` must be restarted if containers are added

- Since 1.7

  - `--follow` must be specified explicitly

  - new containers are automatically picked up by `docker-compose logs`

---

## Connecting to the web UI

- The `webui` container exposes a web dashboard; let's view it

.exercise[

- Open http://[yourVMaddr]:8000/ (from a browser)

]

- The app actually has a constant, steady speed (3.33 coins/second)

- The speed seems not-so-steady because:

  - the worker doesn't update the counter after every loop, but up to once per second

  - the speed is computed by the browser, checking the counter about once per second

  - between two consecutive updates, the counter will increase either by 4, or by 0

---

## Scaling up the application

- Our goal is to make that performance graph go up (without changing a line of code!)

- Before trying to scale the application, we'll figure out if we need more resources

  (CPU, RAM...)

- For that, we will use good old UNIX tools on our Docker node

<!-- FIXME add reference to cadvisor, snap, ...? -->

---

## Looking at resource usage

- Let's look at CPU, memory, and I/O usage

.exercise[

- run `top` to see CPU and memory usage (you should see idle cycles)

- run `vmstat 3` to see I/O usage (si/so/bi/bo)
  <br/>(the 4 numbers should be almost zero, except `bo` for logging)

]

We have available resources.

- Why?
- How can we use them?

---

## Scaling workers on a single node

- Docker Compose supports scaling
- Let's scale `worker` and see what happens!

.exercise[

- Start one more `worker` container:
  ```bash
  docker-compose scale worker=2
  ```

- Look at the performance graph (it should show a x2 improvement)

- Look at the aggregated logs of our containers (`worker_2` should show up)

- Look at the impact on CPU load with e.g. top (it should be negligible)

]

---

## Adding more workers

- Great, let's add more workers and call it a day, then!

.exercise[

- Start eight more `worker` containers:
  ```bash
  docker-compose scale worker=10
  ```

- Look at the performance graph: does it show a x10 improvement?

- Look at the aggregated logs of our containers

- Look at the impact on CPU load and memory usage

<!--
```bash
sleep 5
killall docker-compose
```
-->

]

---

# Identifying bottlenecks

- You should have seen a 3x speed bump (not 10x)

- Adding workers didn't result in linear improvement

- *Something else* is slowing us down

--

- ... But what?

--

- The code doesn't have instrumentation

- Let's use state-of-the-art HTTP performance analysis!
  <br/>(i.e. good old tools like `ab`, `httping`...)

---

## Measuring latency under load

We will use `httping`.

.exercise[

- Check the latency of `rng`:
  ```bash
  httping -c 10 localhost:8001
  ```

- Check the latency of `hasher`:
  ```bash
  httping -c 10 localhost:8002
  ```

]

`rng` has a much higher latency than `hasher`.

---

## Let's draw hasty conclusions

- The bottleneck seems to be `rng`

- *What if* we don't have enough entropy and can't generate enough random numbers?

- We need to scale out the `rng` service on multiple machines!

Note: this is a fiction! We have enough entropy. But we need a pretext to scale out.
<br/>(In fact, the code of `rng` uses `/dev/urandom`, which doesn't need entropy.)

---

class: title

# Scaling out

---

# Connecting to containers on other hosts

- So far, our whole stack is on a single machine

- We want to scale out (across multiple nodes)

- We will deploy the same stack multiple times

- But we want every stack to use the same Redis
  <br/>(in other words: Redis is our only *stateful* service here)

--

- And remember: we're not allowed to change the code!

  - the code connects to host `redis`
  - `redis` must resolve to the address of our Redis service
  - the Redis service must listen on the default port (6379)

???

## Using custom DNS mapping

- We could setup a Redis server on its default port

- And add a DNS entry mapping `redis` to this server

.exercise[

- See what happens if we run:
  ```bash
  docker run --add-host redis:1.2.3.4 alpine ping redis
  ```

<!--
```meta
^C
```
-->

]

There is a Compose file option for that: `extra_hosts`.

---

# Abstracting remote services with ambassadors

<!--

- What if we can't/won't run Redis on its default port?

- What if we want to be able to move it easily?

-->

- We will use an ambassador

- Redis will be started independently of our stack

- It will run at an arbitrary location (host+port)

- In our stack, we replace `redis` with an ambassador

- The ambassador will connect to Redis

- The ambassador will "act as" Redis in the stack

---

class: pic

![Ambassador principle](static-orchestration-1-node-a.png)

---

class: pic

![Ambassador principle](static-orchestration-1-node-b.png)

---

class: pic

![Ambassador principle](static-orchestration-1-node-c.png)

---

class: pic

![Ambassador principle](static-orchestration-2-nodes.png)

---

class: pic

![Ambassador principle](static-orchestration-3-nodes.png)

---

class: pic

![Ambassador principle](static-orchestration-4-nodes.png)

---

class: pic

![Ambassador principle](static-orchestration-5-nodes.png)

---

## Start redis

- Start a standalone Redis container

- Let Docker expose it on a random port

.exercise[

- Run redis with a random public port:
  <br/>`docker run -d -P --name myredis redis`

- Check which port was allocated:
  <br/>`docker port myredis 6379`

]

- Note the IP address of the machine, and this port

---

## Introduction to `jpetazzo/hamba`

- General purpose load balancer and traffic director

- [Source code is available on GitHub](
  https://github.com/jpetazzo/hamba)

- [Public image is available on the Docker Hub](
  https://hub.docker.com/r/jpetazzo/hamba/)

- Generates a configuration file for HAProxy, then starts HAProxy

- Parameters are provided on the command line; for instance:
  ```bash
  docker run -d -p 80 jpetazzo/hamba 80 www1:1234 www2:2345
  docker run -d -p 80 jpetazzo/hamba 80 www1 1234 www2 2345
  ```
  Those two commands do the same thing: they start a load balancer
  listening on port 80, and balancing traffic across www1:1234 and www2:2345

---

## Update `docker-compose.yml`

.exercise[

- Replace `redis` with an ambassador using `jpetazzo/hamba`:
  ```yaml
    redis:
      image: jpetazzo/hamba
      command: 6379 `AA.BB.CC.DD:EEEEE`
  ```

<!--
```edit
cat docker-compose.yml-ambassador | sed "s/AA.BB.CC.DD/$(curl myip.enix.org/REMOTE_ADDR)/" | sed "s/EEEEE/$(docker port myredis 6379 | cut -d: -f2)/" > docker-compose.yml
```
-->

]

Shortcut: `docker-compose.yml-ambassador`
<br/>(But you still have to update `AA.BB.CC.DD:EEEEE`!)

---

## Start the stack on the first machine

- Compose will detect the change in the `redis` service

- It will replace `redis` with a `jpetazzo/hamba` instance

.exercise[

- Just tell Compose to do its thing:
  <br/>`docker-compose up -d`

- Check that the stack is up and running:
  <br/>`docker-compose ps`

- Look at the web UI to make sure that it works fine

]

---

## Controlling other Docker Engines

- Many tools in the ecosystem will honor the `DOCKER_HOST` environment variable

- Those tools include (obviously!) the Docker CLI and Docker Compose

- Our training VMs have been setup to accept API requests on port 55555
  <br/>(without authentication - this is very insecure, by the way!)

- We will see later how to setup mutual authentication with certificates

---

## Setting the `DOCKER_HOST` environment variable

.exercise[

- Check how many containers are running on `node1`:
  ```bash
  docker ps
  ```

- Set the `DOCKER_HOST` variable to control `node2`, and compare:
  ```bash
  export DOCKER_HOST=tcp://node2:55555
  docker ps
  ```

]

You shouldn't see any container running on `node2` at this point.

---

## Start the stack on another machine

- We will tell Compose to bring up our stack on the other node

- It will use the local code (we don't need to checkout the code on `node2`)

.exercise[

- Start the stack:
  ```bash
  docker-compose up -d
  ```

]

Note: this will build the container images on `node2`, resulting
in potentially different results from `node1`. We will see later
how to use the same images across the whole cluster.

---

## Run the application on every node

- We will repeat the previous step with a little shell loop

  ... but introduce parallelism to save some time

.exercise[

- Deploy one instance of the stack on each node:

  ```bash
    for N in 3 4 5; do
      DOCKER_HOST=tcp://node$N:55555 docker-compose up -d &
    done
    wait
  ```

]

Note: again, this will rebuild the container images on each node.

---

## Scale!

- The app is built (and running!) everywhere

- Scaling can be done very quickly

.exercise[

- Add a bunch of workers all over the place:

  ```bash
    for N in 1 2 3 4 5; do
      DOCKER_HOST=tcp://node$N:55555 docker-compose scale worker=10
    done
  ```

- Admire the result in the web UI!

]

---

## A few words about development volumes

- Try to access the web UI on another node

--

- It doesn't work! Why?

--

- Static assets are masked by an empty volume

--

- We need to comment out the `volumes` section

---

## Why must we comment out the `volumes` section?

- Volumes have multiple uses:

  - storing persistent stuff (database files...)

  - sharing files between containers (logs, configuration...)

  - sharing files between host and containers (source...)

- The `volumes` directive expands to an host path:

  `/home/docker/orchestration-workshop/dockercoins/webui/files`

- This host path exists on the local machine (not on the others)

- This specific volume is used in development (not in production)

---

## Stop the app

- Let's use `docker-compose down`

- It will stop and remove the DockerCoins app (but leave other containers running)

.exercise[

- We can do another simple parallel shell loop:
  ```bash
    for N in $(seq 1 5); do
      export DOCKER_HOST=tcp://node$N:55555
      docker-compose down &
    done
    wait
  ```

]

---

## Clean up the redis container

- `docker-compose down` only removes containers defined with Compose

.exercise[

- Check that `myredis` is still there:
  ```bash
  unset DOCKER_HOST
  docker ps
  ```

- Remove it:
  ```bash
  docker rm -f myredis
  ```

]

---

## Considerations about ambassadors

"Ambassador" is a design pattern.

There are many ways to implement it.

Others implementations include:

- [interlock](https://github.com/ehazlett/interlock);
- [registrator](http://gliderlabs.com/registrator/latest/);
- [smartstack](http://nerds.airbnb.com/smartstack-service-discovery-cloud/);
- [zuul](https://github.com/Netflix/zuul/wiki);
- and more!

<!--

We will present three increasingly complex (but also powerful)
ways to deploy ambassadors.

-->

???

## Single-tier ambassador deployment

- One-shot configuration process

- Must be executed manually after each scaling operation

- Scans current state, updates load balancer configuration

- Pros:
  <br/>- simple, robust, no extra moving part
  <br/>- easy to customize (thanks to simple design)
  <br/>- can deal efficiently with large changes

- Cons:
  <br/>- must be executed after each scaling operation
  <br/>- harder to compose different strategies

- Example: this workshop

???

## Two-tier ambassador deployment

- Daemon listens to Docker events API

- Reacts to container start/stop events

- Adds/removes back-ends to load balancers configuration

- Pros:
  <br/>- no extra step required when scaling up/down

- Cons:
  <br/>- extra process to run and maintain
  <br/>- deals with one event at a time (ordering matters)

- Hidden gotcha: load balancer creation

- Example: interlock

???

## Three-tier ambassador deployment


- Daemon listens to Docker events API

- Reacts to container start/stop events

- Adds/removes scaled services in distributed config DB (Zookeeper, etcd, Consul…)

- Another daemon listens to config DB events,
  <br/>adds/removes backends to load balancers configuration

- Pros:
  <br/>- more flexibility

- Cons:
  <br/>- three extra services to run and maintain

- Example: registrator

---

## Ambassadors and overlay networks

- Overlay networks allow direct multi-host communication

- Ambassadors are still useful to implement other tasks:

  - load balancing;

  - credentials injection;

  - instrumentation;

  - fail-over;

  - etc.

---

class: title

# Dynamic orchestration

---

## Static vs Dynamic

- Static

  - you decide what goes where

  - simple to describe and implement

  - seems easy at first but doesn't scale efficiently

- Dynamic

  - the system decides what goes where

  - requires extra components (HA KV...)

  - scaling can be finer-grained, more efficient

---

class: pic

## Hands-on Swarm

![Swarm Logo](swarm.png)

---

## Swarm (in theory)

- Consolidates multiple Docker hosts into a single one

- You talk to Swarm using the Docker API

  → you can use all existing tools: Docker CLI, Docker Compose, etc.

- Swarm talks to your Docker Engines using the Docker API too

  → you can use existing Engines without modification

- Dispatches (schedules) your containers across the cluster, transparently

- Open source and written in Go (like the Docker Engine)

- Initial design and implementation by [@aluzzardi](https://twitter.com/aluzzardi) and [@vieux](https://twitter.com/vieux),
  who were also the authors of the first versions of the Docker Engine

---

## Swarm (in practice)

- Stable since November 2015

- Easy to setup (compared to other orchestrators)

- Tested with 1000 nodes + 50000 containers
  <br/>.small[(without particular tuning; see DockerCon EU opening keynotes!)]

- Requires a key/value store for advanced features

- Can use Consul, etcd, or Zookeeper

---

# Deploying Swarm

- Components involved:

  - cluster discovery mechanism
    <br/>(so that the manager can learn about the nodes)

  - Swarm manager
    <br/>(your frontend to the cluster)

  - Swarm agent
    <br/>(runs on each node, registers it with service discovery)

---

## Cluster discovery

- Possible backends:

  - dynamic, self-hosted
    <br/>(requires to run a Consul/etcd/Zookeeper cluster)

  - static, through command-line or file
    <br/>(great for testing, or for private subnets, see [this article](
    https://medium.com/on-docker/docker-swarm-flat-file-engine-discovery-2b23516c71d4#.6vp94h5wn)

  - external, token-based
    <br/>(dynamic; nothing to operate; relies on external service operated by Docker Inc.)

---

## Swarm agent

- Used only for dynamic discovery (ZK, etcd, Consul, token)

- Must run on each node

- Every 20s (by default), tells to the discovery system:

  *"Hello, there is a Swarm node at A.B.C.D:EFGH"*

- Must know the node's IP address

  (It cannot figure it out by itself, because it doesn't know whether to use public or private addresses)

- The node continues to work even if the agent dies

---

## Swarm manager

- Accepts Docker API requests

- Communicates with the cluster nodes

- Performs healthchecks, scheduling...

---

# Picking a key/value store

- We are going to use a key/value store, and use it for:

  - cluster membership discovery

  - overlay networks backend

  - resilient storage of important credentials

  - Swarm leader election

- We are going to use Consul, and run one Consul instance on each node

  (That way, we can always access Consul over localhost)

---

## Do we really need a key/value store?

- Cluster membership discovery doesn't *require* a key/value store

  (We could use the token mechanism instead)

- Network overlays don't *require* a key/value store

  (We could use a plugin like Weave instead)

- Credentials can be distributed through other mechanisms

  (E.g. copying them to a private S3 bucket)

- Swarm leader election, however, requires a key/value store

---

## Why are we using a key/value store, then?

- Each aforementioned mechanism requires some reliable, distributed storage

- If we don't use our own key/value store, we end up using *something else*:

  - Docker Inc.'s centralized token discovery service

  - [Weave's CRDT protocol](https://github.com/weaveworks/weave/wiki/IP-allocation-design)

  - AWS S3 (or your cloud provider's equivalent, or some other file storage system)

- Each of those is one extra potential point of failure

- See for instance [Kyle Kingsbury's analysis of Chronos](https://aphyr.com/posts/326-jepsen-chronos) for an illustration of this problem

- By operating our own key/value store, we have 1 extra service instead of 3 (or more)

---

## Should we always use a key/value store?

--

- No!

--

- If you don't want to operate your own key/value store, don't do it

- You might be more comfortable using tokens + Weave + S3, for instance

- You can also use static discovery

- Maybe you don't even need overlay networks

---

## Why Consul?

- Consul is not the "official" or best way to do this

- This is an arbitrary decision made by Truly Yours

- I *personally* find Consul easier to setup for a workshop like this

- ... But etcd and Zookeper will work too!

---

## Setting up our Swarm cluster

We need to:

- create certificates,

- distribute them on our nodes,

- run the Swarm agent on every node,

- run the Swarm manager on `node1`,

- reconfigure the Engine on each node to add extra flags (for overlay networks).

That's a lot of work, so we'll use Docker Machine to automate this.

---

## Using Docker Machine to setup a Swarm cluster

- Docker Machine has two primary uses:

  - provisioning cloud instances running the Docker Engine

  - managing local Docker VMs within e.g. VirtualBox

- It can also create Swarm clusters, and will:

  - create and manage certificates

  - automatically start swarm agent and manager containers

- It comes with a special driver, `generic`, to (re)configure existing machines

---

## Setting up Docker Machine

- Install `docker-machine` (single binary download)

 (This is already done on your VMs!)

- Set a few environment variables (cloud credentials)
  ```bash
  export AWS_ACCESS_KEY_ID=AKI...
  export AWS_SECRET_ACCESS_KEY=...
  export AWS_DEFAULT_REGION=eu-west-2
  export DIGITALOCEAN_ACCESS_TOKEN=...
  export DIGITALOCEAN_SIZE=2gb
  export AZURE_SUBSCRIPTION_ID=...
  ```

  (We already have 5 nodes, so we don't need to do this!)

---

## Creating nodes with Docker Machine

- The only two mandatory parameters are the driver to use, and the machine name:
  ```bash
  docker-machine create -d digitalocean node42
  ```

- *Tons* of parameters can be specified; see [Docker Machine driver documentation](https://docs.docker.com/machine/drivers/)

- To list machines and their status:
  ```bash
  docker-machine ls
  ```

- To destroy a machine:
  ```bash
  docker-machine rm node42
  ```

---

## Communicating with nodes managed by Docker Machine

- Select a machine for use:
  ```bash
  eval $(docker-machine env node42)
  ```
  This will set a few environment variables (at least `DOCKER_HOST`).

- Execute regular commands with Docker, Compose, etc.

  (They will pick up remote host address from environment)

- If you need to go under the hood, you can get SSH access:
  ```bash
  docker-machine ssh node42
  ```

---

## Docker Machine `generic` driver

- Most drivers work the same way:

  - use cloud API to create instance

  - connect to instance over SSH

  - install Docker

- The `generic` driver skips the first step

- It can install Docker on any machine, as long as you have SSH access

- We will use that!

---

## Setting up Swarm with Docker Machine

When invoking Machine, we will provide three sets of parameters:

- the machine driver to use (`generic`) and the SSH connection information

- Swarm-specific options indicating the cluster membership discovery mechanism

- Extra flags to be passed to the Engine, to enable overlay networks

---

## Provisioning the first node

.exercise[

- Use the following command to provision the manager node:

  <!--
  ```placeholder
  AA.BB.CC.DD $(getent hosts node1 | awk '{print $1}')
  ```
  -->

  ```bash
    docker-machine create --driver generic \
      --engine-opt cluster-store=consul://localhost:8500 \
      --engine-opt cluster-advertise=eth0:2376 \
      --swarm --swarm-master --swarm-discovery consul://localhost:8500 \
      --generic-ssh-user docker --generic-ip-address `AA.BB.CC.DD` node1
  ```

]

---

## Provisioning the other nodes

- The command is almost the same, but without the `--swarm-master` flag

- We will use a shell snippet for convenience

.exercise[

```bash
  grep node[2345] /etc/hosts | grep -v ^127 |
  while read IPADDR NODENAME
  do docker-machine create --driver generic \
     --engine-opt cluster-store=consul://localhost:8500 \
     --engine-opt cluster-advertise=eth0:2376 \
     --swarm --swarm-discovery consul://localhost:8500 \
     --generic-ssh-user docker \
     --generic-ip-address $IPADDR $NODENAME
  done
```

]

---

## Check what we did

Let's connect to the first node *individually*.

.exercise[

- Select the node with Machine

  ```bash
  eval $(docker-machine env node1)
  ```

- Execute some Docker commands

  ```bash
  docker version
  docker info
  ```

]

In the output of `docker info`, we should see `Cluster store` and `Cluster advertise`.

---

## Interact with the node

Let's try a few basic Docker commands on this node.

.exercise[

- Run a simple container:
  ```bash
  docker run --rm busybox echo hello world
  ```

- See running containers:
  ```bash
  docker ps
  ```

]

Two containers should show up: the agent and the manager.

---

## Connect to the Swarm cluster

Now, let's try the same operations, but when talking to the Swarm manager.

.exercise[

- Select the Swarm manager with Machine:

  ```bash
  eval $(docker-machine env node1 --swarm)
  ```

- Execute some Docker commands

  ```bash
  docker version
  docker info
  docker ps
  ```

]

The output is different! Let's review this.

---

## `docker version`

Swarm identifies itself clearly:

```
Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:38:55 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.2.2
 API version:  1.22
 Go version:   go1.5.4
 Git commit:   34e3da3
 Built:        Mon May  9 17:03:22 UTC 2016
 OS/Arch:      linux/amd64
```

---

## `docker info`

The output of `docker info` on Swarm shows a number of differences from
the output on a single Engine:

.small[
```
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.2
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 0
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.0-36-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: node1
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
```
]
---

## Why zero node?

- We haven't started Consul yet

- Swarm discovery is not operational

- Swarm can't discover the nodes

Note: Docker will start (and be functional) without a K/V store.

This lets us run Consul itself in a container.

---

## Adding Consul

- We will run Consul in containers

- We will use the [Consul official image](
  https://hub.docker.com/_/consul/) that was released *very recently*

- We will tell Docker to automatically restart it on reboots

- To simplify network setup, we will use `host` networking

---

## A few words about `host` networking

- Consul needs to be aware of its actual IP address (seen by other nodes)

- It also binds a bunch of different ports

- It makes sense (from a security point of view) to have Consul listening on localhost only

  (and have "users", i.e. Engine, Swarm, etc. connect over localhost)

- Therefore, we will use `host` networking!

- Also: Docker Machine 0.6 starts the Swarm containers in `host` networking ...

- ... but Docker Machine 0.7 doesn't (which is why we stick to 0.6 for now)

---

## Consul fundamentals (if I must give you just one slide...)

- Consul nodes can be "just an agent" or "server"

- From the client's perspective, they behave the same

- Only servers are members in the Raft consensus / leader election / etc

  (non-server agents forward requests to a server)

- All nodes must be told the address of at least another node to join

  (except for the first node, where this is optional)

- At least the first nodes must know how many nodes to expect to have quorum

- Consul can have only one "truth" at a time (hence the importance of quorum)

---

## Starting our Consul cluster

.exercise[

- Make sure you're logged into `node1`, and:

  ```bash
    IPADDR=$(ip a ls dev eth0 | sed -n 's,.*inet \(.*\)/.*,\1,p')
    for N in 1 2 3 4 5; do
      ssh node$N -- docker run -d --restart=always --name consul_node$N \
                    -e CONSUL_BIND_INTERFACE=eth0 --net host consul \
                    agent -server -retry-join $IPADDR -bootstrap-expect 5 \
                    -ui -client 0.0.0.0
    done
  ```

]

Note: in production, you probably want to remove `-client 0.0.0.0` since it
gives public access to your cluster! Also adapt `-bootstrap-expect` to your quorum.

---

## Check that our Consul cluster is up

- With your browser, navigate to any instance on port 8500
  <br/>(in "NODES" you should see the five nodes)

- Let's run a couple of useful Consul commands

.exercise[

- Ask Consul the list of members it knows:
  ```bash
  docker run --net host --rm consul members
  ```

- Ask Consul which node is the current leader:
  ```bash
  curl localhost:8500/v1/status/leader
  ```

]

---

## Check that our Swarm cluster is up

.exercise[

- Try again the `docker info` from earlier:

  ```bash
  eval $(docker-machine env --swarm node1)
  docker info
  docker ps
  ```

]

All nodes should be visible. (If not, give them a minute or two to register.)

The Consul containers should be visible.

The Swarm containers, however, are hidden by Swarm (unless you use `docker ps -a`).

---

# Running containers on Swarm

Try to run a few `busybox` containers.

Then, let's get serious:

.exercise[

- Start a Redis service:
  <br/>`docker run -dP redis`

- See the service address:
  <br/>`docker port $(docker ps -lq) 6379`

]

This can be any of your five nodes.

---

## Scheduling strategies

- Random: pick a node at random
  <br/>(but honor resource constraints)

- Spread: pick the node with the least containers
  <br/>(including stopped containers)

- Binpack: try to maximize resource usage
  <br/>(in other words: use as few hosts as possible)

---

# Resource allocation

- Swarm can honor resource reservations

- This requires containers to be started with resource limits

- Swarm refuses to schedule a container if it cannot honor a reservation

.exercise[

- Start Redis containers with 1 GB of RAM until Swarm refuses to start more:
  ```bash
  docker run -d -m 1G redis
  ```

]

On a cluster of 5 nodes with ~3.8 GB of RAM per node, Swarm will refuse to start the 16th container.

---

## Removing our Redis containers

- Let's use a little bit of shell scripting

.exercise[

- Remove all containers using the redis image:
  ```bash
  docker ps | awk '/redis/ {print $1}' | xargs docker rm -f
  ```

]

???

## Things to know about resource allocation

- `docker info` shows resource allocation for each node

- Swarm allows a 5% resource overcommit (tunable)

- Containers without resource reservation can always be started

- Resources of stopped containers are still counted as being reserved

  - this guarantees that it will be possible to restart a stopped container

  - containers have to be deleted to free up their resources

  - `docker update` can be used to change resource allocation on the fly

---

class: title

# Setting up overlay networks

---

# Multi-host networking

- Docker 1.9 has the concept of *networks*

- By default, containers are on the default "bridge" network

- You can create additional networks

- Containers can be on multiple networks

- Containers can dynamically join/leave networks

- The "overlay" driver lets networks span multiple hosts

- Containers can have "network aliases" resolvable through DNS

---

## Manipulating networks, names, and aliases

- The preferred method is to let Compose do the heavy lifting for us

  (YAML-defined networking!)

- But if we really need to, we can use the Docker CLI, with:

  `docker network ...`

  `docker run --net ... --net-alias ...`

- The following slides illustrate those commands

---

## Create a few networks and containers

.exercise[

- Create two networks, *blue* and *green*:
  ```bash
  docker network create blue
  docker network create green
  docker network ls
  ```

- Create containers with names of blue and green
  things, on their respective networks:
  ```bash
  docker run -d --net-alias things --name sky --net blue -m 3G redis
  docker run -d --net-alias things --name navy --net blue -m 3G redis
  docker run -d --net-alias things --name grass --net green -m 3G redis
  docker run -d --net-alias things --name forest --net green -m 3G redis
  ```

]

---

## Check connectivity within networks

.exercise[

- Check that our containers are on different nodes:

  ```bash
  docker ps
  ```

- This will work:

  ```bash
  docker run --rm --net blue alpine ping -c 3 navy
  ```

- This will not:

  ```bash
  docker run --rm --net blue alpine ping -c 3 grass
  ```

]

???

## Containers connected to multiple networks

- Some colors aren't *quite* blue *nor* green

.exercise[

- Create a container that we want to be on both networks:
  ```bash
  docker run -d --net-alias things --net blue --name turquoise redis
  ```

- Check connectivity:
  ```bash
  docker exec -ti turquoise ping -c 3 navy
  docker exec -ti turquoise ping -c 3 grass
  ```
  (First works; second doesn't)

]

???

## Dynamically connecting containers

- This is achieved with the command:
  <br/>`docker network connect NETNAME CONTAINER`

.exercise[

- Dynamically connect to the green network:
  ```bash
  docker network connect green turquoise
  ```

- Check connectivity:
  ```bash
  docker exec -ti turquoise ping -c 3 navy
  docker exec -ti turquoise ping -c 3 grass
  ```
  (Both commands work now)

]

---

## Network aliases

- Each container was created with the network alias `things`

- Network aliases are scoped by network

.exercise[

- Resolve the `things` alias from both networks:
  ```bash
    docker run --rm --net blue alpine nslookup things
    docker run --rm --net green alpine nslookup things
  ```

]

???

## Under the hood

- Each network has an interface in the container

- There is also an interface for the default gateway

.exercise[

- View interfaces in our `turquoise` container:
  ```bash
  docker exec -ti turquoise ip addr ls
  ```

]

???

## Dynamically disconnecting containers

- There is a mirror command to `docker network connect`

.exercise[

- Disconnect the *turquoise* container from *blue*
  (its original network):
  ```bash
  docker network disconnect blue turquoise
  ```

- Check connectivity:
  ```bash
  docker exec -ti turquoise ping -c 3 navy
  docker exec -ti turquoise ping -c 3 grass
  ```
  (First command fails, second one works)

]

---

## Cleaning up

.exercise[

- Destroy containers:

<!--
  ```bash
  docker rm -f sky navy grass forest turquoise
  ```
-->

  ```bash
  docker rm -f sky navy grass forest
  ```

- Destroy networks:

  ```bash
  docker network rm blue
  docker network rm green
  ```

]

---

## Cleaning up after an outage or a crash

- You cannot remove a network if it still has containers

- There is no `"rm -f"` for network

- If a network still has stale endpoints, you can use `"disconnect -f"`

---

class: title

# Building images with Swarm

---

## Building images with Swarm

- Special care must be taken when building and running images

- We *can* build images on Swarm (with `docker build` or `docker-compose build`)

- One node will be picked at random, and the build will happen there

- At the end of the build, the image will be present *only on that node*

---

## Building on Swarm can yield inconsistent results

- Builds are scheduled on random nodes

- Multiple builds and rebuilds can happen on different nodes

- If a build happens on a different node, the cache of the previous build cannot be used

- Worse: you can have two different images with the same name on your cluster

---

## Scaling won't work as expected

Consider the following scenario:

- `docker-compose up`
  <br/>
  → each service is built on a node, and runs there

- `docker-compose scale`
  <br/>
  → additional containers for this service can only be spawned where the image was built

- `docker-compose up` (again)
  <br/>
  → services might be built (and started) on different nodes

- `docker-compose scale`
  <br/>
   → containers can be spawned with both the new and old images

---

## Scaling correctly with Swarm

- After building an image, it should be distributed to the cluster

  (Or made available through a registry, so that nodes can download it automatically)

- Instead of referencing images with the `:latest` tag, unique tags should be used

  (Using e.g. timestamps, version numbers, or VCS hashes)

---

## Why can't Swarm do this automatically for us?

- Let's step back and think for a minute ...

- What should `docker build` do on Swarm?

  - build on one machine

  - build everywhere ($$$)

- After the build, what should `docker run` do?

  - run where we built (how do we know where it is?)

  - run on any machine that has the image

- Could Compose+Swarm solve this automatically?

---

## A few words about "sane defaults"

- *It would be nice if Swarm could pick a node, and build there!*

  - but which node should it pick?
  - what if the build is very expensive?
  - what if we want to distribute the build across nodes?
  - what if we want to tag some builder nodes?
  - ok but what if no node has been tagged?

- *It would be nice if Swarm could automatically push images!*

  - using the Docker Hub is an easy choice
    <br/>(you just need an account)
  - but some of us can't/won't use Docker Hub
    <br/>(for compliance reasons or because no network access)

.small[("Sane" defaults are nice only if we agree on the definition of "sane")]

---

## The plan

- Build on a single node (`node1`)

- Tag images with the current UNIX timestamp (for simplicity)

- Upload them to a registry

- Update the Compose file to use those images

This is all automated with the [`build-tag-push.py` script](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/build-tag-push.py).

---

## Which registry do we want to use?

.small[

- **Docker Hub**

  - hosted by Docker Inc.
  - requires an account (free, no credit card needed)
  - images will be public (unless you pay)
  - located in AWS EC2 us-east-1

- **Docker Trusted Registry**

  - self-hosted commercial product
  - requires a subscription (free 30-day trial available)
  - images can be public or private
  - located wherever you want

- **Docker open source registry**

  - self-hosted barebones repository hosting
  - doesn't require anything
  - doesn't come with anything either
  - located wherever you want

]

---

## Using Docker Hub

- Set the `DOCKER_REGISTRY` environment variable to your Docker Hub user name
  <br/>(the `build-tag-push.py` script prefixes each image name with that variable)

- We will also see how to run the open source registry
  <br/>(so use whatever option you want!)

.exercise[

<!--
```meta
^{
```
-->

- Set the following environment variable:
  <br/>`export DOCKER_REGISTRY=jpetazzo`

- (Use *your* Docker Hub login, of course!)

- Log into the Docker Hub:
  <br/>`docker login`

<!--
```meta
^}
```
-->

]

---

## Using Docker Trusted Registry

If we wanted to use DTR, we would:

- make sure we have a Docker Hub account
- [activate a Docker Datacenter subscription](
  https://hub.docker.com/enterprise/trial/)
- install DTR on our machines
- set `DOCKER_REGISTRY` to `dtraddress:port/user`

*This is out of the scope of this workshop!*

---

## Using open source registry

- We need to run a `registry:2` container
  <br/>(make sure you specify tag `:2` to run the new version!)

- It will store images and layers to the local filesystem
  <br/>(but you can add a config file to use S3, Swift, etc.)

- Docker *requires* TLS when communicating with the registry,
  unless for registries on `localhost` or with the Engine
  flag `--insecure-registry`

- Our strategy: run a reverse proxy on `localhost:5000` on each node

---

## Registry frontends and backend

![Registry frontends](registry-frontends.png)

---

# Deploying a local registry

- There is a Compose file for that

.exercise[

- Go to the `registry` directory in the repository:
  ```bash
  cd ~/orchestration-workshop/registry
  ```

]

Let's examine the `docker-compose.yml` file.

---

## Running a local registry with Compose

```yaml
version: "2"

services:
  backend:
    image: registry:2
  frontend:
    image: jpetazzo/hamba
    command: 5000 backend:5000
    ports:
      - "127.0.0.1:5000:5000"
    depends_on:
      - backend
```

- *Backend* is the actual registry.
- *Frontend* is the ambassador that we deployed earlier.
<br/>
It communicates with *backend* using an internal network
and network aliases.

---

## Starting a local registry with Compose

- We will bring up the registry

- Then we will ensure that one *frontend* is running
  on each node by scaling it to our number of nodes

.exercise[

- Start the registry:
  ```bash
  docker-compose up -d
  ```

]

---

## "Scaling" the local registry

- This is a particular kind of scaling

- We just want to ensure that one *frontend*
  is running on every single node of the cluster

.exercise[

- Scale the registry:
  ```bash
    for N in $(seq 1 5); do
      docker-compose scale frontend=$N
    done
  ```

]

Note: Swarm might do that automatically for us in the future.

---

## Testing our local registry

- We can retag a small image, and push it to the registry

.exercise[

- Make sure we have the busybox image, and retag it:
  ```bash
  docker pull busybox
  docker tag busybox localhost:5000/busybox
  ```

- Push it:
  ```bash
  docker push localhost:5000/busybox
  ```

]

---

## Checking what's on our local registry

- The registry API has endpoints to query what's there

.exercise[

- Ensure that our busybox image is now in the local registry:
  ```bash
  curl http://localhost:5000/v2/_catalog
  ```

]

The curl command should output:
```json
{"repositories":["busybox"]}
```

---

## Adapting our Compose file to run on Swarm

- We can get rid of all the `ports` section, except for the web UI

.exercise[

- Go back to the dockercoins directory:
  ```bash
  cd ~/orchestration-workshop/dockercoins
  ```

]

---

## Our new Compose file

.small[
```yaml
version: '2'

services:
  rng:
    build: rng

  hasher:
    build: hasher

  webui:
    build: webui
    ports:
    - "8000:80"

  redis:
    image: redis

  worker:
    build: worker
```
]

Copy-paste this into `docker-compose.yml`
<br/>(or you can `cp docker-compose.yml-v2 docker-compose.yml`)

---

## Use images, not builds

- We need to replace each `build` with an `image`

- We will use the `build-tag-push.py` script for that

.exercise[

- Set `DOCKER_REGISTRY` to use our local registry

- Make sure that you are building on `node1`

- Then run the script

  ```bash
  export DOCKER_REGISTRY=localhost:5000
  eval $(docker-machine env node1)
  ../bin/build-tag-push.py
  ```

]

---

## Run the application

- At this point, our app is ready to run

.exercise[

- Start the application:
  ```bash
  export COMPOSE_FILE=docker-compose.yml-`NNN`
  eval $(docker-machine env node1 --swarm)
  docker-compose up -d
  ```

- Observe that it's running on multiple nodes:
  <br/>(each container name is prefixed with the node it's running on)
  ```bash
  docker ps
  ```

]

---

## View the performance graph

- Load up the graph in the browser

.exercise[

- Check the `webui` service address and port:
  ```bash
  docker-compose port webui 80
  ```

- Open it in your browser

]

---

## Scaling workers

- Scaling the `worker` service works out of the box
  (like before)

.exercise[

- Scale `worker`:
  ```bash
  docker-compose scale worker=10
  ```

]

Check that workers are on different nodes.

However, we hit the same bottleneck as before.

How can we address that?

---

## Finding the real cause of the bottleneck

- If time permits, we can benchmark `rng` and `hasher` to find out more

- Otherwise, we'll fast-forward a bit

---

## Benchmarking in isolation

- If we want the benchmark to be accurate, we need to make sure that `rng` and `hasher` are not receiving traffic

.exercise[

- Stop the `worker` containers:
  ```bash
  docker-compose kill worker
  ```

]

---

## A better benchmarking tool

- Instead of `httping`, we will now use `ab` (Apache Bench)

- We will install it in an `alpine` container placed on the network used by our application

.exercise[

- Start an interactive `alpine` container on the `dockercoins_rng` network:
  ```bash
  docker run -ti --net dockercoins_default alpine sh
  ```

- Install `ab` with the `apache2-utils` package:
  ```bash
  apk add --update apache2-utils
  ```

]

---

## Benchmarking `rng`

We will send 50 requests, but with various levels of concurrency.

.exercise[

- Send 50 requests, with a single sequential client:
  ```bash
  ab -c 1 -n 50 http://rng/10
  ```

- Send 50 requests, with ten parallel clients:
  ```bash
  ab -c 10 -n 50 http://rng/10
  ```

]

---

## Benchmark results for `rng`

- In both cases, the benchmark takes ~5 seconds to complete

- When serving requests sequentially, they each take 100ms

- In the parallel scenario, the latency increased dramatically:

  - one request is served in 100ms
  - another is served in 200ms
  - another is served in 300ms
  - ...
  - another is served in 1000ms

- What about `hasher`?

---

## Benchmarking `hasher`

We will do the same tests for `hasher`.

The command is slightly more complex, since we need to post random data.

First, we need to put the POST payload in a temporary file.

.exercise[

- Install curl in the container, and generate 10 bytes of random data:
  ```bash
  apk add curl
  curl http://rng/10 >/tmp/random
  ```

]

---

## Benchmarking `hasher`

Once again, we will send 50 requests, with different levels of concurrency.

.exercise[

- Send 50 requests with a sequential client:
  ```bash
    ab -c 1 -n 50 -T application/octet-stream \
       -p /tmp/random http://hasher/
  ```

- Send 50 requests with 10 parallel clients:
  ```bash
    ab -c 10 -n 50 -T application/octet-stream \
       -p /tmp/random http://hasher/
  ```

]

---

## Benchmark results for `hasher`

- The sequential benchmarks takes ~5 seconds to complete

- The parallel benchmark takes less than 1 second to complete

- In both cases, each request takes a bit more than 100ms to complete

- Requests are a bit slower in the parallel benchmark

- It looks like `hasher` is better equiped to deal with concurrency than `rng`

---

class: title

Why?

---

## Why does everything take (at least) 100ms?

--

`rng` code:

![RNG code screenshot](delay-rng.png)

--

`hasher` code:

![HASHER code screenshot](delay-hasher.png)

---

class: title

But ...

WHY?!?

---

## Why did we sprinkle this sample app with sleeps?

- Deterministic performance
  <br/>(regardless of instance speed, CPUs, I/O...)

--

- Actual code sleeps all the time anyway

--

- When your code makes a remote API call:

  - it sends a request;

  - it sleeps until it gets the response;

  - it processes the response.

---

## Why do `rng` and `hasher` behave differently?

![Equations on a blackboard](equations.png)

--

(Synchronous vs. asynchronous event processing)

---

## How to make `rng` go faster

- Obvious solution: comment out the `sleep` instruction

--

- Unfortunately, in the real world, network latency exists

--

- More realistic solution: use an asynchronous framework
  <br/>(e.g. use gunicorn with gevent)

--

- Reminder: we can't change the code!

--

- Solution: scale out `rng`
  <br/>(dispatch `rng` requests on multiple instances)

---

# Scaling web services with Compose on Swarm

- We *can* scale network services with Compose

- The result may or may not be satisfactory, though!

.exercise[

- Restart the `worker` service:
  ```bash
  docker-compose start worker
  ```

- Scale the `rng` service:
  ```bash
  docker-compose scale rng=5
  ```

]

---

## Results

- In the web UI, you might see a performance increase ... or maybe not

--

- Since Engine 1.11, we get round-robin DNS records

  (i.e. resolving `rng` will yield the IP addresses of all 3 containers)

- Docker randomizes the records it sends

- But many resolvers will sort them in unexpected ways

- Depending on various factors, you could get:

  - all traffic on a single container
  - traffic perfectly balanced on all containers
  - traffic unevenly balanced across containers

---

## Assessing DNS randomness

- Let's see how our containers resolve DNS requests

.exercise[

- On each of our 10 scaled workers, execute 5 ping requests:
  ```bash
    for N in $(seq 1 10); do
      echo PING__________$N
      for I in $(seq 1 5); do
        docker exec -ti dockercoins_worker_$N ping -c1 rng
      done
    done | grep PING
  ```

]

(The 7th Might Surprise You!)

---

## DNS randomness

- Other programs can yield different results

- Same program on another distro can yield different results

- Same source code with another libc or resolver can yield different results

- Running the same test at different times can yield different results

- Did I mention that Your Results May Vary?

---

## Implementing fair load balancing

- Instead of relying on DNS round robin, let's use a proper load balancer

- Use Compose to create multiple copies of the `rng` service

- Put a load balancer in front of them

- Point other services to the load balancer

---

## Naming problem

- The service is called `rng`

- Therefore, it is reachable with the network name `rng`

- Our application code (the `worker` service) connects to `rng`

- So the name `rng` should resolve to the load balancer

- What do‽

---

## Naming is *per-network*

- Solution: put `rng` on its own network

- That way, it doesn't take the network name `rng`
  <br/>(at least not on the default network)

- Have the load balancer sit on both networks

- Add the name `rng` to the load balancer

---

class: pic

Original DockerCoins

![](dockercoins-single-node.png)

---

class: pic

Load-balanced DockerCoins

![](dockercoins-multi-node.png)

---

## Declaring networks

- Networks (other than the default one)
  *must* be declared
  in a top-level `networks` section,
  placed anywhere in the file

.exercise[

- Add the `rng` network to the Compose file, `docker-compose.yml-NNN`:
  ```yaml
    version: '2'

    networks:
      rng:

    services:
      rng:
        image: ...
    ...
  ```

]

---

## Putting the `rng` service in its network

- Services can have a `networks` section

- If they don't: they are placed in the default network

- If they do: they are placed only in the mentioned networks

.exercise[

- Change the `rng` service to put it in its network:
  ```yaml
    rng:
      image: localhost:5000/dockercoins_rng:…
      networks:
        rng:
  ```

]

---

## Adding the load balancer

- The load balancer has to be in both networks: `rng` and `default`
- In the `default` network, it must have the `rng` alias
- We will use the `jpetazzo/hamba` image

.exercise[

- Add the `rng-lb` service to the Compose file:
  ```yaml
    rng-lb:
      image: jpetazzo/hamba
      command: run
      networks:
        rng:
        default:
          aliases: [ rng ]
  ```
]

---

## Load balancer initial configuration

- We specified `run` as the initial command

- This tells `hamba` to wait for an initial configuration

- The load balancer will not be operational (until we feed it its configuration)

---

## Start the application

.exercise[

- Bring up DockerCoins:
  ```bash
  docker-compose up -d
  ```

- See that `worker` is complaining:
  ```bash
  docker-compose logs --tail 100 --follow worker
  ```
]

---

## Add one backend to the load balancer

- Multiple solutions:

  - lookup the IP address of the `rng` backend
  - use the backend's network name
  - use the backend's container name (easiest!)

.exercise[

- Configure the load balancer:
  ```bash
    docker run --rm --volumes-from dockercoins_rng-lb_1 \
                    --net container:dockercoins_rng-lb_1 \
                    jpetazzo/hamba reconfigure 80 dockercoins_rng_1 80
  ```

]

The application should now be working correctly.

---

## Add all backends to the load balancer

- The command is similar to the one before

- We need to pass the list of all backends

.exercise[

- Reconfigure the load balancer:
  ```bash
    docker run --rm \
      --volumes-from dockercoins_rng-lb_1 \
      --net container:dockercoins_rng-lb_1 \
      jpetazzo/hamba reconfigure 80 \
      $(for N in $(seq 1 5); do
          echo dockercoins_rng_$N:80
        done)
  ```

]

---

## Automating the process

- Nobody loves artisan YAML handy craft

- This can be scripted very easily

- But can it be fully automated?

---

## Use DNS to discover the addresses of all the backends

- When multiple containers have the same network alias:

  - Engine 1.10 returns only one of them (the same one across the whole network)

  - Engine 1.11 returns all of them (in a random order)

- A "smart" client can use all records to implement load balancing

- We can compose `jpetazzo/hamba` with a special-purpose container,
  which will dynamically generate HAProxy's configuration when
  the DNS records are updated

---

## Introducing `jpetazzo/watchdns`

- [100 lines of pure POSIX scriptery](
  https://github.com/jpetazzo/watchdns/blob/master/watchdns)

- Resolves a given DNS name every second

- Each time the result changes, a new HAProxy configuration is generated

- When used together with `--volumes-from` and `jpetazzo/hamba`, it
  updates the configuration of an existing load balancer

- Comes with a companion script, [`add-load-balancer-v2.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-load-balancer-v2.py), to update your Compose files

---

## Using `jpetazzo/watchdns`

.exercise[

- First, revert the Compose file to remove the load balancer

- Then, run `add-load-balancer-v2.py`:
  ```bash
  ../bin/add-load-balancer-v2.py rng
  ```

- Inspect the resulting Compose file

]

---

## Scaling with `watchdns`

.exercise[

- Start the application with the new sidekick containers:
  ```bash
  docker-compose up -d
  ```

- Scale `rng`:
  ```bash
  docker-compose scale rng=10
  ```

- Check logs:
  ```bash
  docker-compose logs rng-wd
  ```

]

---

## Comments

- This is a very crude implementation of the pattern

- A Go version would only be a bit longer, but use much less resources

- When there are many backends, reacting quickly to change is less important

  (i.e. it's not necessary to re-resolve records every second!)

---

class: title

# All things ops <br/> (logs, backups, and more)

---

# Logs

- Two strategies:

  - log to plain files on volumes

  - log to stdout
    <br/>(and use a logging driver)

---

## Logging to plain files on volumes

(Sorry, that part won't be hands-on!)

- Start a container with `-v /logs`

- Make sure that all log files are in `/logs`

- To check logs, run e.g.

  ```bash
  docker run --volumes-from ... ubuntu sh -c "grep WARN /logs/*.log"
  ```

- Or just go interactive:

  ```bash
  docker run --volumes-from ... -ti ubuntu
  ```

- You can (should) start a log shipper that way

---

## Logging to stdout

- All containers should write to stdout/stderr

- Docker will collect logs and pass them to a logging driver

- Logging driver can specified globally, and per container
  <br/>(changing it for a container overrides the global setting)

- To change the global logging driver, pass extra flags to the daemon
  <br/>(requires a daemon restart)

- To override the logging driver for a container, pass extra flags to `docker run`

---

## Specifying logging flags

- `--log-driver`

  *selects the driver*

- `--log-opt key=val`

  *adds driver-specific options*
  <br/>*(can be repeated multiple times)*

- The flags are identical for `docker daemon` and `docker run`

---

## Logging flags in practice

- If you provision your nodes with Docker Machine,
  you can set global logging flags (which will apply to all
  containers started by a given Engine) like this:

  ```bash
  docker-machine create ... --engine-opt log-driver=...
  ```

- Otherwise, use your favorite method to edit or manage configuration files

- You can set per-container logging options in Compose files

---

## Available drivers

- json-file (default)

- syslog (can send to UDP, TCP, TCP+TLS, UNIX sockets)

- awslogs (AWS CloudWatch)

- journald

- gelf

- fluentd

- splunk

---

## About json-file ...

- It doesn't rotate logs by default, so your disks will fill up

  (Unless you set `maxsize` *and* `maxfile` log options.)

- It's the only one supporting logs retrieval

  (If you want to use `docker logs`, `docker-compose logs`,
  or fetch logs from the Docker API, you need json-file!)

- This might change in the future

  (But it's complex since there is no standard protocol
  to *retrieve* log entries.)

All about logging in the documentation:
https://docs.docker.com/reference/logging/overview/

---

# Setting up ELK to store container logs

*Important foreword: this is not an "official" or "recommended"
setup; it is just an example. We do not endorse ELK, GELF,
or the other elements of the stack more than others!*

What we will do:

- Spin up an ELK stack, with Compose

- Gaze at the spiffy Kibana web UI

- Manually send a few log entries over GELF

- Reconfigure our DockerCoins app to send logs to ELK

---

## What's in an ELK stack?

- ELK is three components:

  - ElasticSearch (to store and index log entries)

  - Logstash (to receive log entries from various
    sources, process them, and forward them to various
    destinations)

  - Kibana (to view/search log entries with a nice UI)

- The only component that we will configure is Logstash

- We will accept log entries using the GELF protocol

- Log entries will be stored in ElasticSearch,
  <br/>and displayed on Logstash's stdout for debugging

---

## Starting our ELK stack

- We will use a *separate* Compose file

- The Compose file is in the `elk` directory

.exercise[

- Go to the `elk` directory:
  ```bash
  cd ~/orchestration-workshop/elk
  ```

- Start the ELK stack:
  ```bash
  unset COMPOSE_FILE
  docker-compose up -d
  ```

]

---

## Making sure that each node has a local logstash

- We will configure each container to send logs to `localhost:12201`

- We need to make sure that each node has a logstash container listening on port 12201

.exercise[

- Scale the `logstash` service to 5 instances (one per node):
  ```bash
    for N in $(seq 1 5); do
      docker-compose scale logstash=$N
    done
  ```

]

---

## Checking that our ELK stack works

- Our default Logstash configuration sends a test
  message every minute

- All messages are stored into ElasticSearch,
  but also shown on Logstash stdout

.exercise[

- Look at Logstash stdout:
  ```bash
  docker-compose logs logstash
  ```

]

After less than one minute, you should see a `"message" => "ok"`
in the output.

---

## Connect to Kibana

- Our ELK stack exposes two public services:
  <br/>the Kibana web server, and the GELF UDP socket

- They are both exposed on their default port numbers
  <br/>(5601 for Kibana, 12201 for GELF)

.exercise[

- Check the address of the node running kibana:
  ```bash
  docker-compose ps
  ```

- Open the UI in your browser: http://instance-address:5601/

]

---

## "Configuring" Kibana

- If you see a status page with a yellow item, wait a minute and reload
  (Kibana is probably still initializing)

- Kibana should offer you to "Configure an index pattern",
  just click the "Create" button

- Then:

  - click "Discover" (in the top-left corner)
  - click "Last 15 minutes" (in the top-right corner)
  - click "Last 1 hour" (in the list in the middle)
  - click "Auto-refresh" (top-right corner)
  - click "5 seconds" (top-left of the list)

- You should see a series of green bars (with one new green bar every minute)

---

![Screenshot of Kibana](kibana.png)

---

## Sending container output to Kibana

- We will create a simple container displaying "hello world"

- We will override the container logging driver

- The GELF address is `127.0.0.1:12201`, because the Compose file
  explicitly exposes the GELF socket on port 12201

.exercise[

- Start our one-off container:

  ```bash
    docker run --rm --log-driver gelf \
           --log-opt gelf-address=udp://127.0.0.1:12201 \
           alpine echo hello world
  ```

]

---

## Visualizing container logs in Kibana

- Less than 5 seconds later (the refresh rate of the UI),
  the log line should be visible in the web UI

- We can customize the web UI to be more readable

.exercise[

- In the left column, move the mouse over the following
  columns, and click the "Add" button that appears:

  - host
  - container_name
  - message

]

---

## Switching back to the DockerCoins application

.exercise[

- Go back to the dockercoins directory:
  ```bash
  cd ~/orchestration-workshop/dockercoins
  ```

- Set the `COMPOSE_FILE` variable:
  ```bash
  export COMPOSE_FILE=docker-compose.yml-`NNN`
  ```

]

---
## Add the logging driver to the Compose file

- We need to add the logging section to each container

.exercise[

- Edit the `docker-compose.yml-NNN` file, adding the following lines **to each container**:

  ```yaml
    logging:
      driver: gelf
      options:
        gelf-address: "udp://127.0.0.1:12201"
  ```

]

There is also a script, [`../bin/add-logging.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-logging.py), to do that automatically.

---

## Update the DockerCoins app

.exercise[

- Use Compose normally:
  ```bash
  docker-compose up -d
  ```

]

If you look in the Kibana web UI, you will see log lines
refreshed every 5 seconds.

Note: to do interesting things (graphs, searches...) we
would need to create indexes. This is beyond the scope
of this workshop.

---

## Logging in production

- If we were using an ELK stack:

  - scale ElasticSearch
  - interpose a Redis or Kafka queue to deal with bursts

- Configure your Engines to send all logs to ELK by default

- Start the logging containers with a different logging system
  <br/>(to avoid a logging loop)

- Make sure you don't end up writing *all logs* on the nodes running Logstash!

---

# Network traffic analysis

- We want to inspect the network traffic entering/leaving `dockercoins_redis_1`

- We will use *shared network namespaces* to perform network analysis

- Two containers sharing the same network namespace...

  - have the same IP addresses

  - have the same network interfaces

- `eth0` is therefore the same in both containers

---

## Install and start `ngrep`

Ngrep uses libpcap (like tcpdump) to sniff network traffic.

.exercise[

<!--
```meta
^{
```
-->

- Start a container with the same network namespace:
  <br/>`docker run --net container:dockercoins_redis_1 -ti alpine sh`

- Install ngrep:
  <br/>`apk update && apk add ngrep`

- Run ngrep:
  <br/>`ngrep -tpd eth0 -Wbyline . tcp`

<!--
```meta
^}
```
-->

]

You should see a stream of Redis requests and responses.

---

# Backups

- We want to enable backups for `dockercoins_redis_1`

- We don't want to install extra software in this container

- We will use a special backup container:

  - sharing the same volumes

  - using the same network stack (to connect to it easily)

  - possibly containing our backup tools

- This works because the `redis` container image stores its data on a volume

---

## Starting the backup container

- We will use the `--net container:` option to be able to connect locally

- We will use the `--volumes-from` option to access the container's persistent data

.exercise[

<!--
```meta
^{
```
-->

- Start the container:

  ```bash
    docker run --net container:dockercoins_redis_1 \
               --volumes-from dockercoins_redis_1:ro \
               -v /tmp/myredis:/output \
               -ti alpine sh
  ```

- Look in `/data` in the container (that's where Redis puts its data dumps)
]

---

## Connecting to Redis

- We need to tell Redis to perform a data dump *now*

.exercise[

- Connect to Redis:
  ```bash
  telnet localhost 6379
  ```

- Issue commands `SAVE` then `QUIT`

- Look at `/data` again (notice the time stamps)

]

- There should be a recent dump file now!

---

## Getting the dump out of the container

- We could use many things:

  - s3cmd to copy to S3
  - SSH to copy to a remote host
  - gzip/bzip/etc before copying

- We'll just copy it to the Docker host

.exercise[

- Copy the file from `/data` to `/output`

- Exit the container

- Look into `/tmp/myredis` (on the host)

<!--
```meta
^}
```
-->

]

---

## Scheduling backups

In the "old world," we (generally) use cron.

With containers, what are our options?

--

- run `cron` on the Docker host, and put `docker run` in the crontab

--

- run `cron` in the backup container, and make sure it keeps running
  <br/>(e.g. with `docker run --restart=…`)

--

- run `cron` in a container, and start backup containers from there

--

- listen to the Docker events stream, automatically scheduling backups
  <br/>when database containers are started

---

# Controlling Docker from a container

- In a local environment, just bind-mount the Docker control socket:
  ```bash
  docker run -ti -v /var/run/docker.sock:/var/run/docker.sock docker
  ```

- Otherwise, you have to:

  - set `DOCKER_HOST`,
  - set `DOCKER_TLS_VERIFY` and `DOCKER_CERT_PATH` (if you use TLS),
  - copy certificates to the container that will need API access.

More resources on this topic:

- [Do not use Docker-in-Docker for CI](
  http://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/)
- [One container to rule them all](
  http://jpetazzo.github.io/2016/04/03/one-container-to-rule-them-all/)

---

# Docker events stream

- Using the Docker API, we can get real-time
  notifications of everything happening in the Engine:

  - container creation/destruction
  - container start/stop
  - container exit/signal/out of memory
  - container attach/detach
  - volume creation/destruction
  - network creation/destruction
  - connection/disconnection of containers

---

## Subscribing to the events stream

- This is done with `docker events`

.exercise[

- Get a stream of events:
  ```bash
  docker events
  ```

<!--
```meta
^Z
```
-->

- In a new terminal, do *anything*:
  ```bash
  docker run --rm alpine sleep 10
  ```

]

You should see events for the lifecycle of the
container, as well as its connection/disconnection
to the default `bridge` network.

---

## A few tools to use the events stream

- [docker-spotter](https://github.com/discordianfish/docker-spotter)

  Written in Go; simple building block to use directly in Shell scripts

- [ahab](https://github.com/instacart/ahab)

  Written in Python; available as a library; ships with a CLI tool

---

# Security upgrades

- This section is not hands-on

- Public Service Announcement

- We'll discuss:

  - how to upgrade the Docker daemon

  - how to upgrade container images

---

## Upgrading the Docker daemon

- Stop all containers cleanly

- Stop the Docker daemon

- Upgrade the Docker daemon

- Start the Docker daemon

- Start all containers

- This is like upgrading your Linux kernel, but it will get better

(Docker Engine 1.11 is using containerd, which will ultimately allow seamless upgrades.)

???

## In practice

- Keep track of running containers before stopping the Engine:
  ```bash
  docker ps --no-trunc -q |
  tee /tmp/running |
  xargs -n1 -P10 docker stop
  ```

- Restart those containers after the Engine is running again:
  ```bash
  xargs docker start < /tmp/running
  ```
  <br/>(Run this multiple times if you have linked containers!)

---

## Upgrading container images

- When a vulnerability is announced:

  - if it affects your base images: make sure they are fixed first

  - if it affects downloaded packages: make sure they are fixed first

  - re-pull base images

  - rebuild

  - restart containers

---

## How do we know when to upgrade?

- Subscribe to CVE notifications

  - https://cve.mitre.org/

  - your distros' security announcements

- Check CVE status in official images
  <br/>(tag [cve-tracker](
  https://github.com/docker-library/official-images/labels/cve-tracker)
  in [docker-library/official-images](
  https://github.com/docker-library/official-images/labels/cve-tracker)
  repo)

- Use a container vulnerability scanner
  <br/>(e.g. [Docker Security Scanning](https://blog.docker.com/2016/05/docker-security-scanning/))

---

## Upgrading with Compose

Compose makes this particularly easy:
```bash
docker-compose build --pull --no-cache
docker-compose up -d
```

This will automatically:

- pull base images;
- rebuild all container images;
- bring up the new containers.

Remember: Compose will automatically move our
volumes to the new containers, so data is preserved.

---

class: title

# Resiliency <br/> and <br/> high availability

---

## What are our single points of failure?

- The TLS certificates created by Machine are on `node1`

- We have only one Swarm manager

- If a node (running containers) is down or unreachable,
  our application will be affected

---

# Distributing Machine credentials

- All the credentials (TLS keys and certs) are on node1
  <br/>(the node on which we ran `docker-machine create`)

- If we lose node1, we're toast

- We need to move (or copy) the credentials somewhere safe

- Credentials are regular files, and relatively small

- Ah, if only we had a highly available, hierarchic store ...

--

- Wait a minute, we have one!

--

(That's Consul, if you were wondering)

---

## Storing files in Consul

- We will use [Benjamin Wester's consulfs](
  https://github.com/bwester/consulfs)

- It mounts a Consul key/value store as a local filesystem

- Performance will be horrible
  <br/>(don't run a database on top of that!)

- But to store files of a few KB, nobody will notice

- We will copy/link/sync... `~/.docker/machine` to Consul

---

## Installing consulfs

- Option 1: install Go, git clone, go build ...

- Option 2: be lazy and use [jpetazzo/consulfs](
  https://hub.docker.com/r/jpetazzo/consulfs/)

.exercise[

- Be lazy and use the Docker image:
  ```bash
  eval $(docker-machine env node1)
  docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs
  ```
]

Note: the `jpetazzo/consulfs` image contains the
`consulfs` binary.

It copies it to `/target` (if `/target` is a volume).

---

## Can't we run consulfs in a container?

- Yes we can!

- The filesystem will be mounted in the container

- It won't be visible outside of the container (from the host)

- We can use *shared mounts* to propagate mounts from containers to Docker

- But propagating from Docker to the host requires particular systemd flags

- ... So we'll run it on the host for now

---

## Running consulfs

- The `consulfs` binary takes two arguments:

  - the Consul server address
  - a mount point (that has to be created first)

.exercise[

- Create a mount point and mount Consul as a local filesystem:
  ```bash
  mkdir ~/consul
  consulfs localhost:8500 ~/consul
  ```

]

Leave this running in the foreground.

---

## Checking our consulfs mount point

- All key/values will be visible:

  - Swarm discovery

  - overlay networks

  - ... anything you put in Consul!

.exercise[

- Check that Consul key/values are visible:
  ```bash
  ls -l ~/consul/
  ```

]

---

## Copying our credentials to Consul

- Use standard UNIX commands

- Don't try to preserve permissions, though (`consulfs` doesn't store permissions)

.exercise[

- Copy Machine credentials into Consul:
  ```bash
  cp -r ~/.docker/machine/. ~/consul/machine/
  ```

]

(This command can be re-executed to update the copy.)

---

## Install consulfs on another node

- We will repeat the previous steps to install consulfs

.exercise[

- Connect to node2:
  ```bash
  ssh node2
  ```

- Install `consulfs`:
  ```bash
  docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs
  ```

]

---

## Mount Consul

- The procedure is still the same as on the first node

.exercise[

- Create the mount point:
  ```bash
  mkdir ~/consul
  ```

- Mount the filesystem:
  ```bash
  consulfs localhost:8500 ~/consul &
  ```

]

At this point, `ls -l ~/consul` should show `docker` and
`machine` directories.

---

## Access the credentials from the other node

- We will create a symlink

- We could also copy the credentials

.exercise[

- Create the symlink:
  ```bash
  mkdir -p ~/.docker/
  ln -s ~/consul/machine ~/.docker/
  ```

- Check that all nodes are visible:
  ```bash
  docker-machine ls
  ```

]

---

## A few words on this strategy

- Anyone accessing Consul can control your Docker cluster
  <br/>(to be fair: anyone accessing Consul can wreck
  serious havoc to your cluster anyway)

- ConsulFS doesn't support *all* POSIX operations,
  so a few things (like `mv`) will not work)

- As a consequence, with Machine 0.6, you cannot
  run `docker-machine create` directly on top of ConsulFS

---

## What if Consul becomes unavailable?

- If Consul becomes unavailable (e.g. loses quorum),
  <br/>you won't be able to access your credentials

- If Consul becomes unavailable ...
  <br/>your cluster will be in a bad state anyway

- You can still access each Docker Engine over the
  local UNIX socket
  <br/>(and repair Consul that way)


---

# Highly available Swarm managers

- Until now, the Swarm manager was a SPOF
  <br/>(Single Point Of Failure)

- Swarm has support for replication

- When replication is enabled, you deploy multiple (identical) managers

  - one will be "primary"
  - the other(s) will be "secondary"
  - this is determined automatically
    <br/>(through *leader election*)

---

## Swarm leader election

- The leader election mechanism relies on a key/value store
  <br/>(Consul, etcd, Zookeeper)

- There is no requirement on the number of replicas
  <br/>(the quorum is achieved through the key/value store)

- When the leader (or "primary") is unavailable,
  <br/>a new election happens automatically

- You can issue API requests to any manager:
  <br/>if you talk to a secondary, it forwards to the primary

.warning[There is currently a bug when
the Consul cluster itself has a leader election;
<br/>see [docker/swarm#1782](https://github.com/docker/swarm/issues/1782).]

---

## Swarm replication in practice

- We need to give two extra flags to the Swarm manager:

  - `--replication`

    *enables replication (duh!)*

  - `--advertise ip.ad.dr.ess:port`

    *address and port where this Swarm manager is reachable*

- Do you deploy with Docker Machine?
  <br/>Then you can use `--swarm-opt`
  to automatically pass flags to the Swarm manager

---

## Cleaning up our current Swarm containers

- We will use Docker Machine to re-provision Swarm

- We need to:

  - remove the nodes from the Machine registry
  - remove the Swarm containers

.exercise[

- Remove the current configuration (remember to go back to node1!):
  ```bash
    for N in 1 2 3 4 5; do
      ssh node$N docker rm -f swarm-agent swarm-agent-master
      docker-machine rm -f node$N
    done
  ```

]

---

## Re-deploy with the new configuration

- This time, all nodes can be deployed identically
  <br/>(instead of 1 manager + 4 non-managers)

.exercise[

```bash
  grep node[12345] /etc/hosts | grep -v ^127 |
  while read IPADDR NODENAME; do
    docker-machine create --driver generic \
      --engine-opt cluster-store=consul://localhost:8500 \
      --engine-opt cluster-advertise=eth0:2376 \
      --swarm --swarm-master \
      --swarm-discovery consul://localhost:8500  \
      --swarm-opt replication --swarm-opt advertise=$IPADDR:3376 \
      --generic-ssh-user docker --generic-ip-address $IPADDR $NODENAME
  done
```

]

.small[
Note: Consul is still running thanks to the `--restart=always` policy.
Other containers are now stopped, because the engines have been
reconfigured and restarted.
]

---

## Assess our new cluster health

- The output of `docker info` will tell us the status
  of the node that we are talking to (primary or replica)

- If we talk to a replica, it will tell us who is the primary

.exercise[

- Talk to a random node, and ask its view of the cluster:
  ```bash
  eval $(docker-machine env node3 --swarm)
  docker info | grep -e ^Name -e ^Role -e ^Primary
  ```

]

Note: `docker info` is one of the only commands that will
work even when there is no elected primary. This helps
debugging.

---

## Test Swarm manager failover

- The previous command told us which node was the primary manager

  - if `Role` is `primary`,
    <br/>then the primary is indicated by `Name`

  - if `Role` is `replica`,
    <br/>then the primary is indicated by `Primary`

.exercise[

- Kill the primary manager:
  ```bash
  ssh node`N` docker kill swarm-agent-master
  ```

]

Look at the output of `docker info` every few seconds.

---

# Highly available containers

- Swarm has support for *rescheduling* on node failure

- It has to be explicitly enabled on a per-container basis

- When the primary manager detects that a node goes down,
  <br/>those containers are rescheduled elsewhere

- If the containers can't be rescheduled (constraints issue),
  <br/>they are lost (there is no reconciliation loop yet)

- In Swarm 1.1, this is an *experimental* feature
  <br/>(To enable it, you must pass the `--experimental` flag when you start Swarm itself!)

- In Swarm 1.2, you don't need the `--experimental` flag anymore

---

## About Swarm generic flags

- Some flags like `--experimental` and `--debug` must be *before* the Swarm command
  <br/>(i.e. `docker run swarm --debug manage ...`)

- We cannot use Docker Machine to pass that flag ☹
  <br/>(Machine adds flags *after* the Swarm command)

- Instead, we can use a custom Swarm image:
  ```dockerfile
  FROM swarm
  ENTRYPOINT ["/swarm", "--debug"]
  ```

- We can tell Machine to use this with `--swarm-image`

---

## Start a resilient container

- By default, containers will not be restarted when their node goes down

- You must pass an explicit *rescheduling policy* to make that happen

- For now, the only policy is "on-node-failure"

.exercise[

- Start a container with a rescheduling policy:

  ```bash
  docker run --name highlander -d -e reschedule:on-node-failure nginx
  ```

]

Check that the container is up and running.

---

## Simulate a node failure

- We will reboot the node running this container

- Swarm will reschedule it

.exercise[

- Check on which node the container is running:
  </br>`NODE=$(docker inspect --format '{{.Node.Name}}' highlander)`

- Reboot that node:
  <br/>`ssh $NODE sudo reboot`

- Check that the container has been recheduled:
  <br/>`docker ps -a`

]

---

## Reboots

- When rebooting a node, Docker is stopped cleanly, and containers are stopped

- Our container is rescheduled, but not started

- To simulate a "proper" failure, we can use the Chaos Monkey script instead

```bash
~/orchestration-workshop/bin/chaosmonkey $NODE <connect|disconnect|reboot>
```

---

## Cluster reconciliation

- After the cluster rejoins, we can end up with duplicate containers

.exercise[

- Once the node is back, remove one of the extraneous containers:
  ```bash
  docker rm -f node`N`/highlander
  ```

]

---

## .warning[Caveats]

- There are some corner cases when the node is also
  the Swarm leader or the Consul leader; this is being improved
  right now!

- The safest way to address for now this is to run the Consul
  servers, the Swarm managers, and your containers, on
  different nodes.

- Swarm doesn't handle gracefully the fact that after the
  reboot, you have *two* containers named `highlander`,
  and attempts to manipulate the container with its name
  will not work. This will be improved too.

---

class: title

# Conclusions

---

## Swarm cluster deployment

- We saw how to use Machine with the `generic` driver to turn
  any set of machines into a Swarm cluster

- This can trivially be adapted to provision cloud instances
  on the fly (using "normal" drivers of Docker Machine)

- For auto-scaling, you can use e.g.:

  - private admin-only network

  - no TLS

  - static discovery on a /24 to /20 network (depending on your needs)

---

## Key/value store

- We saw an easy deployment method for Consul

- This is good for 3 to 9 nodes

- Remember: raft write performance *degrades* as you add nodes!

- For bigger clusters:

  - have e.g. 5 "static" server nodes

  - put them in round robin DNS record set (or behind an ELB)

  - run a normal agent on the other nodes

---

## App deployment

- We saw how to transform a Compose file into a series of build artefacts

  - using S3 or another object store is trivial

- We saw how to programmatically add load balancing, logging

- This can be improved further by using variable interpolation for the image tags

- Rolling deploys are relatively straightforward, but:

  - I recommend to aim directly for blue/green (or canary) deploy

  - In the production stack, abstract stateful services with ambassadors

---

## Operations

- We saw how to setup an ELK stack and send logs to it in a record time

  *Important: this doesn't mean that operating ELK suddenly became an easy thing!*

- We saw how to translate a few basic tasks to containerized environments

  (Backups, network traffic analysis)

- Debugging is surprisingly similar to what it used to be:

  - remember that containerized processes are normal processes running on the host

  - `docker exec` is your friend

  - also: `docker run --net host --pid host -v /:/hostfs alpine chroot /hostfs`

---

## Things we haven't covered

- Per-container system metrics (look at cAdvisor, Snap, Prometheus...)

- Application metrics (continue to use whatever you were using before)

- Supervision (whatever you were using before still works exactly the same way)

- Tracking access to credentials and sensitive information (see Vault, Keywhiz...)

- ... (tell me what I should cover in future workshops!) ...

---

## Resilience

- We saw how to store important data (crendentials) in Consul

- We saw how to achieve H/A for Swarm itself

- Rescheduling policies give us basic H/A for containers

- This will be improved in future releases

- Docker in general, and Swarm in particular, move *fast*

- Current high availability features are not Chaos-Monkey proof (yet)

- We (well, the Swarm team) is working to change that

---

## What's next?

- November 2015: Compose 1.5 + Engine 1.9 =
  <br/>first release with multi-host networking

- January 2016: Compose 1.6 + Engine 1.10 =
  <br/>embedded DNS server, experimental high availability

- April 2016: Compose 1.7 + Engine 1.11 =
  <br/>round robin DNS records, huge improvements in HA

- Next release: another truckload of features

- I will deliver this workshop about twice a month

- Check out the GitHub repo for updated content!
  <br/>(there is a tag for each big round of updates)

---

## Overall complexity

- The scripts used here are pretty simple (each is less than 100 LOCs)

- You can easily rewrite them in your favorite language,
  <br/>adapt and customize them, in a few hours of time

- FYI: those scripts are smaller and simpler than the
  scripts (cloud init etc) used to deploy the VMs for this
  workshop!

- Docker Inc. has commercial products to wrap all this:

  - Docker Cloud
    <br/>(manage your Docker nodes from a SAAS portal)

  - Docker Datacenter
    <br/>(buzzword-compliant management solution:
    <br/>turnkey, enterprise-class, on-premise, etc.)

---

class: title

# Thanks! <br/> Questions?

## [@jpetazzo](https://twitter.com/jpetazzo) <br/> [@docker](https://twitter.com/docker)

    </textarea>
    <script src="https://gnab.github.io/remark/downloads/remark-0.13.min.js" type="text/javascript">
    </script>
    <script type="text/javascript">
      var slideshow = remark.create({
        ratio: '16:9',
        highlightSpans: true
      });
    </script>
  </body>
</html>