Files
container.training/www/htdocs/index.html

4738 lines
92 KiB
HTML

<!DOCTYPE html>
<html>
<head>
<base target="_blank">
<title>Docker Orchestration Workshop</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif'; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
margin-top: 0.5em;
}
a {
text-decoration: none;
color: blue;
}
.remark-slide-content { padding: 1em 2.5em 1em 2.5em; }
.remark-slide-content { font-size: 25px; }
.remark-slide-content h1 { font-size: 50px; }
.remark-slide-content h2 { font-size: 50px; }
.remark-slide-content h3 { font-size: 25px; }
.remark-code { font-size: 25px; }
.small .remark-code { font-size: 16px; }
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
.red { color: #fa0000; }
.gray { color: #ccc; }
.small { font-size: 70%; }
.big { font-size: 140%; }
.underline { text-decoration: underline; }
.pic {
vertical-align: middle;
text-align: center;
padding: 0 0 0 0 !important;
}
img {
max-width: 100%;
max-height: 550px;
}
.title {
vertical-align: middle;
text-align: center;
}
.title h1 { font-size: 100px; }
.title p { font-size: 100px; }
.quote {
background: #eee;
border-left: 10px solid #ccc;
margin: 1.5em 10px;
padding: 0.5em 10px;
quotes: "\201C""\201D""\2018""\2019";
font-style: italic;
}
.quote:before {
color: #ccc;
content: open-quote;
font-size: 4em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
.quote p {
display: inline;
}
.warning {
background-image: url("warning.png");
background-size: 1.5em;
background-repeat: no-repeat;
padding-left: 2em;
}
.exercise {
background-color: #eee;
background-image: url("keyboard.png");
background-size: 1.4em;
background-repeat: no-repeat;
background-position: 0.2em 0.2em;
border: 2px dotted black;
}
.exercise::before {
content: "Exercise";
margin-left: 1.8em;
}
li p { line-height: 1.25em; }
</style>
</head>
<body>
<textarea id="source">
class: title
Docker <br/> Orchestration <br/> Workshop
---
## Logistics
- Hello! We're `jerome at docker dot com` and `aj at soulshake dot net`
<!--
Reminder, when updating the agenda: when people are told to show
up at 9am, they usually trickle in until 9:30am (except for paid
training sessions). If you're not sure that people will be there
on time, it's a good idea to have a breakfast with the attendees
at e.g. 9am, and start at 9:30.
- Agenda:
.small[
- 08:00-09:00 hello and breakfast
- 09:00:10:25 part 1
- 10:25-10:35 coffee break
- 10:35-12:00 part 2
- 12:00-13:00 lunch break
- 13:00-14:25 part 3
- 14:25-14:35 coffee break
- 14:35-16:00 part 4
]
-->
- The tutorial will run from 1:20pm to 4:40pm
- There will be a break from 3:00pm to 3:15pm
- This will be FAST PACED, but DON'T PANIC!
- All the content is publicly available (slides, code samples, scripts)
<!--
Remember to change:
- the Gitter link below
- the "tweet my speed" hashtag in DockerCoins HTML
-->
- Live feedback, questions, help on
[Gitter](http://container.training/chat)
---
<!--
grep '^# ' index.html | grep -v '<br' | tr '#' '-'
-->
## Chapter 1: getting started
- Pre-requirements
- VM environment
- Our sample application
- Running the application
- Identifying bottlenecks
- Scaling out
- Connecting to containers on other hosts
- Abstracting remote services with ambassadors
---
## Chapter 2: Swarm setup and deployment
- Dynamic orchestration
- Deploying Swarm
- Picking a key/value store
- Running containers on Swarm
- Resource allocation
- Multi-host networking
- Building images with Swarm
- Deploying a local registry
- Scaling web services with Compose on Swarm
---
## Chapter 3: Docker for Ops
- Logs
- Setting up ELK to store container logs
- Network traffic analysis
- Backups
- Controlling Docker from a container
- Docker events stream
- Security upgrades
---
## Chapter 4: high availability (additional content)
- Distributing Machine credentials
- Highly available Swarm managers
- Highly available containers
- Conclusions
---
# Pre-requirements
- Computer with network connection and SSH client
- on Linux, OS X, FreeBSD... you are probably all set
- on Windows, get [putty](http://www.putty.org/),
[Git BASH](https://git-for-windows.github.io/), or
[MobaXterm](http://mobaxterm.mobatek.net/)
- Basic Docker knowledge
<br/>(but that's OK if you're not a Docker expert!)
---
## Nice-to-haves
- [GitHub](https://github.com/join) account
<br/>(if you want to fork the repo; also used to join Gitter)
- [Gitter](https://gitter.im/) account
<br/>(to join the conversation during the workshop)
- [Docker Hub](https://hub.docker.com) account
<br/>(it's one way to distribute images on your Swarm cluster)
---
## Hands-on sections
- The whole workshop is hands-on
- I will show Docker in action
- I invite you to reproduce what I do
- All hands-on sections are clearly identified, like the gray rectangle below
.exercise[
- This is the stuff you're supposed to do!
- Go to [container.training](http://container.training/) to view these slides
- Join the chat room on
[Gitter](http://container.training/chat)
]
---
# VM environment
- Each person gets 5 private VMs (not shared with anybody else)
- They'll be up until tonight
- You have a little card with login+password+IP addresses
- You can automatically SSH from one VM to another
.exercise[
<!--
```bash
for N in $(seq 1 5); do
ssh -o StrictHostKeyChecking=no node$N true
done
for N in $(seq 1 5); do
(.
docker-machine rm -f node$N
ssh node$N "docker ps -aq | xargs -r docker rm -f"
ssh node$N sudo rm -f /etc/systemd/system/docker.service
ssh node$N sudo systemctl daemon-reload
echo Restarting node$N.
ssh node$N sudo systemctl restart docker
echo Restarted node$N.
) &
done
wait
```
-->
- Log into the first VM (`node1`)
- Check that you can SSH (without password) to `node2`:
```bash
ssh node2
```
- Type `exit` or `^D` to come back to node1
<!--
```meta
^D
```
-->
]
---
## We will (mostly) interact with node1 only
- Unless instructed, **all commands must be run from the first VM, `node1`**
- We will only checkout/copy the code on `node1`
- When we will use the other nodes, we will do it mostly through the Docker API
- We will use SSH only for a few "out of band" operations (mass-removing containers...)
---
## Terminals
Once in a while, the instructions will say:
<br/>"Open a new terminal."
There are multiple ways to do this:
- create a new window or tab on your machine, and SSH into the VM;
- use screen or tmux on the VM and open a new window from there.
You are welcome to use the method that you feel the most comfortable with.
---
## Tmux cheatsheet
- Ctrl-b c → creates a new window
- Ctrl-b n → go to next window
- Ctrl-b p → go to previous window
- Ctrl-b " → split window top/bottom
- Ctrl-b % → split window left/right
- Ctrl-b Alt-1 → rearrange windows in columns
- Ctrl-b Alt-2 → rearrange windows in rows
- Ctrl-b arrows → navigate to other windows
- Ctrl-b d → detach session
- tmux attach → reattach to session
---
## Brand new versions!
- Engine 1.11
- Compose 1.7
- Swarm 1.2
- Machine 0.6
.exercise[
- Check all installed versions:
```bash
docker version
docker-compose -v
docker run --rm swarm -version
docker-machine -v
```
]
---
## Why are we not using the latest version of Machine?
- The latest version of Machine is 0.7
- The way it deploys Swarm is different from 0.6
- This causes a regression in the strategy that we will use later
- More details later!
---
# Our sample application
- Visit the GitHub repository with all the materials of this workshop:
<br/>https://github.com/jpetazzo/orchestration-workshop
- The application is in the [dockercoins](
https://github.com/jpetazzo/orchestration-workshop/tree/master/dockercoins)
subdirectory
- Let's look at the general layout of the source code:
there is a Compose file [docker-compose.yml](
https://github.com/jpetazzo/orchestration-workshop/blob/master/dockercoins/docker-compose.yml) ...
... and 4 other services, each in its own directory:
- `rng` = web service generating random bytes
- `hasher` = web service computing hash of POSTed data
- `worker` = background process using `rng` and `hasher`
- `webui` = web interface to watch progress
---
## Compose file format version
*Particularly relevant if you have used Compose before...*
- Compose 1.6 introduced support for a new Compose file format (aka "v2")
- Services are no longer at the top level, but under a `services` section
- There has to be a `version` key at the top level, with value `"2"` (as a string, not an integer)
- Containers are placed on a dedicated network, making links unnecessary
- There are other minor differences, but upgrade is easy and straightforward
---
## Links, naming, and service discovery
- Containers can have network aliases (resolvable through DNS)
- Compose file version 2 makes each container reachable through its service name
- Compose file version 1 requires "links" sections
- Our code can connect to services using their short name
(instead of e.g. IP address or FQDN)
---
## Example in `worker/worker.py`
![Service discovery](service-discovery.png)
---
## What's this application?
---
class: pic
![DockerCoins logo](dockercoins.png)
(DockerCoins 2016 logo courtesy of @XtlCnslt and @ndeloof. Thanks!)
---
## What's this application?
- It is a DockerCoin miner! 💰🐳📦🚢
- No, you can't buy coffee with DockerCoins
- How DockerCoins works:
- `worker` asks to `rng` to give it random bytes
- `worker` feeds those random bytes into `hasher`
- each hash starting with `0` is a DockerCoin
- DockerCoins are stored in `redis`
- `redis` is also updated every second to track speed
- you can see the progress with the `webui`
---
## Getting the application source code
- We will clone the GitHub repository
- The repository also contains scripts and tools that we will use through the workshop
.exercise[
<!--
```bash
[ -d orchestration-workshop ] && mv orchestration-workshop orchestration-workshop.$$
```
-->
- Clone the repository on `node1`:
```bash
git clone git://github.com/jpetazzo/orchestration-workshop
```
]
(You can also fork the repository on GitHub and clone your fork if you prefer that.)
---
# Running the application
Without further ado, let's start our application.
.exercise[
- Go to the `dockercoins` directory, in the cloned repo:
```bash
cd ~/orchestration-workshop/dockercoins
```
- Use Compose to build and run all containers:
```bash
docker-compose up
```
]
Compose tells Docker to build all container images (pulling
the corresponding base images), then starts all containers,
and displays aggregated logs.
---
## Lots of logs
- The application continuously generates logs
- We can see the `worker` service making requests to `rng` and `hasher`
- Let's put that in the background
.exercise[
- Stop the application by hitting `^C`
<!--
```meta
^C
```
-->
]
- `^C` stops all containers by sending them the `TERM` signal
- Some containers exit immediately, others take longer
<br/>(because they don't handle `SIGTERM` and end up being killed after a 10s timeout)
---
## Restarting in the background
- Many flags and commands of Compose are modeled after those of `docker`
.exercise[
- Start the app in the background with the `-d` option:
```bash
docker-compose up -d
```
- Check that our app is running with the `ps` command:
```bash
docker-compose ps
```
]
`docker-compose ps` also shows the ports exposed by the application.
---
## Viewing logs
- The `docker-compose logs` command works like `docker logs`
.exercise[
- View all logs since container creation and exit when done:
```bash
docker-compose logs
```
- Stream container logs, starting at the last 10 lines for each container:
```bash
docker-compose logs --tail 10 --follow
```
<!--
```meta
^C
```
-->
]
Tip: use `^S` and `^Q` to pause/resume log output.
???
## Upgrading from Compose 1.6
.warning[The `logs` command has changed between Compose 1.6 and 1.7!]
- Up to 1.6
- `docker-compose logs` is the equivalent of `logs --follow`
- `docker-compose logs` must be restarted if containers are added
- Since 1.7
- `--follow` must be specified explicitly
- new containers are automatically picked up by `docker-compose logs`
---
## Connecting to the web UI
- The `webui` container exposes a web dashboard; let's view it
.exercise[
- Open http://[yourVMaddr]:8000/ (from a browser)
]
- The app actually has a constant, steady speed (3.33 coins/second)
- The speed seems not-so-steady because:
- the worker doesn't update the counter after every loop, but up to once per second
- the speed is computed by the browser, checking the counter about once per second
- between two consecutive updates, the counter will increase either by 4, or by 0
---
## Scaling up the application
- Our goal is to make that performance graph go up (without changing a line of code!)
- Before trying to scale the application, we'll figure out if we need more resources
(CPU, RAM...)
- For that, we will use good old UNIX tools on our Docker node
<!-- FIXME add reference to cadvisor, snap, ...? -->
---
## Looking at resource usage
- Let's look at CPU, memory, and I/O usage
.exercise[
- run `top` to see CPU and memory usage (you should see idle cycles)
- run `vmstat 3` to see I/O usage (si/so/bi/bo)
<br/>(the 4 numbers should be almost zero, except `bo` for logging)
]
We have available resources.
- Why?
- How can we use them?
---
## Scaling workers on a single node
- Docker Compose supports scaling
- Let's scale `worker` and see what happens!
.exercise[
- Start one more `worker` container:
```bash
docker-compose scale worker=2
```
- Look at the performance graph (it should show a x2 improvement)
- Look at the aggregated logs of our containers (`worker_2` should show up)
- Look at the impact on CPU load with e.g. top (it should be negligible)
]
---
## Adding more workers
- Great, let's add more workers and call it a day, then!
.exercise[
- Start eight more `worker` containers:
```bash
docker-compose scale worker=10
```
- Look at the performance graph: does it show a x10 improvement?
- Look at the aggregated logs of our containers
- Look at the impact on CPU load and memory usage
<!--
```bash
sleep 5
killall docker-compose
```
-->
]
---
# Identifying bottlenecks
- You should have seen a 3x speed bump (not 10x)
- Adding workers didn't result in linear improvement
- *Something else* is slowing us down
--
- ... But what?
--
- The code doesn't have instrumentation
- Let's use state-of-the-art HTTP performance analysis!
<br/>(i.e. good old tools like `ab`, `httping`...)
---
## Measuring latency under load
We will use `httping`.
.exercise[
- Check the latency of `rng`:
```bash
httping -c 10 localhost:8001
```
- Check the latency of `hasher`:
```bash
httping -c 10 localhost:8002
```
]
`rng` has a much higher latency than `hasher`.
---
## Let's draw hasty conclusions
- The bottleneck seems to be `rng`
- *What if* we don't have enough entropy and can't generate enough random numbers?
- We need to scale out the `rng` service on multiple machines!
Note: this is a fiction! We have enough entropy. But we need a pretext to scale out.
<br/>(In fact, the code of `rng` uses `/dev/urandom`, which doesn't need entropy.)
---
class: title
# Scaling out
---
# Connecting to containers on other hosts
- So far, our whole stack is on a single machine
- We want to scale out (across multiple nodes)
- We will deploy the same stack multiple times
- But we want every stack to use the same Redis
<br/>(in other words: Redis is our only *stateful* service here)
--
- And remember: we're not allowed to change the code!
- the code connects to host `redis`
- `redis` must resolve to the address of our Redis service
- the Redis service must listen on the default port (6379)
???
## Using custom DNS mapping
- We could setup a Redis server on its default port
- And add a DNS entry mapping `redis` to this server
.exercise[
- See what happens if we run:
```bash
docker run --add-host redis:1.2.3.4 alpine ping redis
```
<!--
```meta
^C
```
-->
]
There is a Compose file option for that: `extra_hosts`.
---
# Abstracting remote services with ambassadors
<!--
- What if we can't/won't run Redis on its default port?
- What if we want to be able to move it easily?
-->
- We will use an ambassador
- Redis will be started independently of our stack
- It will run at an arbitrary location (host+port)
- In our stack, we replace `redis` with an ambassador
- The ambassador will connect to Redis
- The ambassador will "act as" Redis in the stack
---
class: pic
![Ambassador principle](static-orchestration-1-node-a.png)
---
class: pic
![Ambassador principle](static-orchestration-1-node-b.png)
---
class: pic
![Ambassador principle](static-orchestration-1-node-c.png)
---
class: pic
![Ambassador principle](static-orchestration-2-nodes.png)
---
class: pic
![Ambassador principle](static-orchestration-3-nodes.png)
---
class: pic
![Ambassador principle](static-orchestration-4-nodes.png)
---
class: pic
![Ambassador principle](static-orchestration-5-nodes.png)
---
## Start redis
- Start a standalone Redis container
- Let Docker expose it on a random port
.exercise[
- Run redis with a random public port:
<br/>`docker run -d -P --name myredis redis`
- Check which port was allocated:
<br/>`docker port myredis 6379`
]
- Note the IP address of the machine, and this port
---
## Introduction to `jpetazzo/hamba`
- General purpose load balancer and traffic director
- [Source code is available on GitHub](
https://github.com/jpetazzo/hamba)
- [Public image is available on the Docker Hub](
https://hub.docker.com/r/jpetazzo/hamba/)
- Generates a configuration file for HAProxy, then starts HAProxy
- Parameters are provided on the command line; for instance:
```bash
docker run -d -p 80 jpetazzo/hamba 80 www1:1234 www2:2345
docker run -d -p 80 jpetazzo/hamba 80 www1 1234 www2 2345
```
Those two commands do the same thing: they start a load balancer
listening on port 80, and balancing traffic across www1:1234 and www2:2345
---
## Update `docker-compose.yml`
.exercise[
- Replace `redis` with an ambassador using `jpetazzo/hamba`:
```yaml
redis:
image: jpetazzo/hamba
command: 6379 `AA.BB.CC.DD:EEEEE`
```
<!--
```edit
cat docker-compose.yml-ambassador | sed "s/AA.BB.CC.DD/$(curl myip.enix.org/REMOTE_ADDR)/" | sed "s/EEEEE/$(docker port myredis 6379 | cut -d: -f2)/" > docker-compose.yml
```
-->
]
Shortcut: `docker-compose.yml-ambassador`
<br/>(But you still have to update `AA.BB.CC.DD:EEEEE`!)
---
## Start the stack on the first machine
- Compose will detect the change in the `redis` service
- It will replace `redis` with a `jpetazzo/hamba` instance
.exercise[
- Just tell Compose to do its thing:
<br/>`docker-compose up -d`
- Check that the stack is up and running:
<br/>`docker-compose ps`
- Look at the web UI to make sure that it works fine
]
---
## Controlling other Docker Engines
- Many tools in the ecosystem will honor the `DOCKER_HOST` environment variable
- Those tools include (obviously!) the Docker CLI and Docker Compose
- Our training VMs have been setup to accept API requests on port 55555
<br/>(without authentication - this is very insecure, by the way!)
- We will see later how to setup mutual authentication with certificates
---
## Setting the `DOCKER_HOST` environment variable
.exercise[
- Check how many containers are running on `node1`:
```bash
docker ps
```
- Set the `DOCKER_HOST` variable to control `node2`, and compare:
```bash
export DOCKER_HOST=tcp://node2:55555
docker ps
```
]
You shouldn't see any container running on `node2` at this point.
---
## Start the stack on another machine
- We will tell Compose to bring up our stack on the other node
- It will use the local code (we don't need to checkout the code on `node2`)
.exercise[
- Start the stack:
```bash
docker-compose up -d
```
]
Note: this will build the container images on `node2`, resulting
in potentially different results from `node1`. We will see later
how to use the same images across the whole cluster.
---
## Run the application on every node
- We will repeat the previous step with a little shell loop
... but introduce parallelism to save some time
.exercise[
- Deploy one instance of the stack on each node:
```bash
for N in 3 4 5; do
DOCKER_HOST=tcp://node$N:55555 docker-compose up -d &
done
wait
```
]
Note: again, this will rebuild the container images on each node.
---
## Scale!
- The app is built (and running!) everywhere
- Scaling can be done very quickly
.exercise[
- Add a bunch of workers all over the place:
```bash
for N in 1 2 3 4 5; do
DOCKER_HOST=tcp://node$N:55555 docker-compose scale worker=10
done
```
- Admire the result in the web UI!
]
---
## A few words about development volumes
- Try to access the web UI on another node
--
- It doesn't work! Why?
--
- Static assets are masked by an empty volume
--
- We need to comment out the `volumes` section
---
## Why must we comment out the `volumes` section?
- Volumes have multiple uses:
- storing persistent stuff (database files...)
- sharing files between containers (logs, configuration...)
- sharing files between host and containers (source...)
- The `volumes` directive expands to an host path:
`/home/docker/orchestration-workshop/dockercoins/webui/files`
- This host path exists on the local machine (not on the others)
- This specific volume is used in development (not in production)
---
## Stop the app
- Let's use `docker-compose down`
- It will stop and remove the DockerCoins app (but leave other containers running)
.exercise[
- We can do another simple parallel shell loop:
```bash
for N in $(seq 1 5); do
export DOCKER_HOST=tcp://node$N:55555
docker-compose down &
done
wait
```
]
---
## Clean up the redis container
- `docker-compose down` only removes containers defined with Compose
.exercise[
- Check that `myredis` is still there:
```bash
unset DOCKER_HOST
docker ps
```
- Remove it:
```bash
docker rm -f myredis
```
]
---
## Considerations about ambassadors
"Ambassador" is a design pattern.
There are many ways to implement it.
Others implementations include:
- [interlock](https://github.com/ehazlett/interlock);
- [registrator](http://gliderlabs.com/registrator/latest/);
- [smartstack](http://nerds.airbnb.com/smartstack-service-discovery-cloud/);
- [zuul](https://github.com/Netflix/zuul/wiki);
- and more!
<!--
We will present three increasingly complex (but also powerful)
ways to deploy ambassadors.
-->
???
## Single-tier ambassador deployment
- One-shot configuration process
- Must be executed manually after each scaling operation
- Scans current state, updates load balancer configuration
- Pros:
<br/>- simple, robust, no extra moving part
<br/>- easy to customize (thanks to simple design)
<br/>- can deal efficiently with large changes
- Cons:
<br/>- must be executed after each scaling operation
<br/>- harder to compose different strategies
- Example: this workshop
???
## Two-tier ambassador deployment
- Daemon listens to Docker events API
- Reacts to container start/stop events
- Adds/removes back-ends to load balancers configuration
- Pros:
<br/>- no extra step required when scaling up/down
- Cons:
<br/>- extra process to run and maintain
<br/>- deals with one event at a time (ordering matters)
- Hidden gotcha: load balancer creation
- Example: interlock
???
## Three-tier ambassador deployment
- Daemon listens to Docker events API
- Reacts to container start/stop events
- Adds/removes scaled services in distributed config DB (Zookeeper, etcd, Consul…)
- Another daemon listens to config DB events,
<br/>adds/removes backends to load balancers configuration
- Pros:
<br/>- more flexibility
- Cons:
<br/>- three extra services to run and maintain
- Example: registrator
---
## Ambassadors and overlay networks
- Overlay networks allow direct multi-host communication
- Ambassadors are still useful to implement other tasks:
- load balancing;
- credentials injection;
- instrumentation;
- fail-over;
- etc.
---
class: title
# Dynamic orchestration
---
## Static vs Dynamic
- Static
- you decide what goes where
- simple to describe and implement
- seems easy at first but doesn't scale efficiently
- Dynamic
- the system decides what goes where
- requires extra components (HA KV...)
- scaling can be finer-grained, more efficient
---
class: pic
## Hands-on Swarm
![Swarm Logo](swarm.png)
---
## Swarm (in theory)
- Consolidates multiple Docker hosts into a single one
- You talk to Swarm using the Docker API
→ you can use all existing tools: Docker CLI, Docker Compose, etc.
- Swarm talks to your Docker Engines using the Docker API too
→ you can use existing Engines without modification
- Dispatches (schedules) your containers across the cluster, transparently
- Open source and written in Go (like the Docker Engine)
- Initial design and implementation by [@aluzzardi](https://twitter.com/aluzzardi) and [@vieux](https://twitter.com/vieux),
who were also the authors of the first versions of the Docker Engine
---
## Swarm (in practice)
- Stable since November 2015
- Easy to setup (compared to other orchestrators)
- Tested with 1000 nodes + 50000 containers
<br/>.small[(without particular tuning; see DockerCon EU opening keynotes!)]
- Requires a key/value store for advanced features
- Can use Consul, etcd, or Zookeeper
---
# Deploying Swarm
- Components involved:
- cluster discovery mechanism
<br/>(so that the manager can learn about the nodes)
- Swarm manager
<br/>(your frontend to the cluster)
- Swarm agent
<br/>(runs on each node, registers it with service discovery)
---
## Cluster discovery
- Possible backends:
- dynamic, self-hosted
<br/>(requires to run a Consul/etcd/Zookeeper cluster)
- static, through command-line or file
<br/>(great for testing, or for private subnets, see [this article](
https://medium.com/on-docker/docker-swarm-flat-file-engine-discovery-2b23516c71d4#.6vp94h5wn))
- external, token-based
<br/>(dynamic; nothing to operate; relies on external service operated by Docker Inc.)
---
## Swarm agent
- Used only for dynamic discovery (ZK, etcd, Consul, token)
- Must run on each node
- Every 20s (by default), tells to the discovery system:
*"Hello, there is a Swarm node at A.B.C.D:EFGH"*
- Must know the node's IP address
(It cannot figure it out by itself, because it doesn't know whether to use public or private addresses)
- The node continues to work even if the agent dies
---
## Swarm manager
- Accepts Docker API requests
- Communicates with the cluster nodes
- Performs healthchecks, scheduling...
---
# Picking a key/value store
- We are going to use a key/value store, and use it for:
- cluster membership discovery
- overlay networks backend
- resilient storage of important credentials
- Swarm leader election
- We are going to use Consul, and run one Consul instance on each node
(That way, we can always access Consul over localhost)
---
## Do we really need a key/value store?
- Cluster membership discovery doesn't *require* a key/value store
(We could use the token mechanism instead)
- Network overlays don't *require* a key/value store
(We could use a plugin like Weave instead)
- Credentials can be distributed through other mechanisms
(E.g. copying them to a private S3 bucket)
- Swarm leader election, however, requires a key/value store
---
## Why are we using a key/value store, then?
- Each aforementioned mechanism requires some reliable, distributed storage
- If we don't use our own key/value store, we end up using *something else*:
- Docker Inc.'s centralized token discovery service
- [Weave's CRDT protocol](https://github.com/weaveworks/weave/wiki/IP-allocation-design)
- AWS S3 (or your cloud provider's equivalent, or some other file storage system)
- Each of those is one extra potential point of failure
- See for instance [Kyle Kingsbury's analysis of Chronos](https://aphyr.com/posts/326-jepsen-chronos) for an illustration of this problem
- By operating our own key/value store, we have 1 extra service instead of 3 (or more)
---
## Should we always use a key/value store?
--
- No!
--
- If you don't want to operate your own key/value store, don't do it
- You might be more comfortable using tokens + Weave + S3, for instance
- You can also use static discovery
- Maybe you don't even need overlay networks
---
## Why Consul?
- Consul is not the "official" or best way to do this
- This is an arbitrary decision made by Truly Yours
- I *personally* find Consul easier to setup for a workshop like this
- ... But etcd and Zookeper will work too!
---
## Setting up our Swarm cluster
We need to:
- create certificates,
- distribute them on our nodes,
- run the Swarm agent on every node,
- run the Swarm manager on `node1`,
- reconfigure the Engine on each node to add extra flags (for overlay networks).
That's a lot of work, so we'll use Docker Machine to automate this.
---
## Using Docker Machine to setup a Swarm cluster
- Docker Machine has two primary uses:
- provisioning cloud instances running the Docker Engine
- managing local Docker VMs within e.g. VirtualBox
- It can also create Swarm clusters, and will:
- create and manage certificates
- automatically start swarm agent and manager containers
- It comes with a special driver, `generic`, to (re)configure existing machines
---
## Setting up Docker Machine
- Install `docker-machine` (single binary download)
(This is already done on your VMs!)
- Set a few environment variables (cloud credentials)
```bash
export AWS_ACCESS_KEY_ID=AKI...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=eu-west-2
export DIGITALOCEAN_ACCESS_TOKEN=...
export DIGITALOCEAN_SIZE=2gb
export AZURE_SUBSCRIPTION_ID=...
```
(We already have 5 nodes, so we don't need to do this!)
---
## Creating nodes with Docker Machine
- The only two mandatory parameters are the driver to use, and the machine name:
```bash
docker-machine create -d digitalocean node42
```
- *Tons* of parameters can be specified; see [Docker Machine driver documentation](https://docs.docker.com/machine/drivers/)
- To list machines and their status:
```bash
docker-machine ls
```
- To destroy a machine:
```bash
docker-machine rm node42
```
---
## Communicating with nodes managed by Docker Machine
- Select a machine for use:
```bash
eval $(docker-machine env node42)
```
This will set a few environment variables (at least `DOCKER_HOST`).
- Execute regular commands with Docker, Compose, etc.
(They will pick up remote host address from environment)
- If you need to go under the hood, you can get SSH access:
```bash
docker-machine ssh node42
```
---
## Docker Machine `generic` driver
- Most drivers work the same way:
- use cloud API to create instance
- connect to instance over SSH
- install Docker
- The `generic` driver skips the first step
- It can install Docker on any machine, as long as you have SSH access
- We will use that!
---
## Setting up Swarm with Docker Machine
When invoking Machine, we will provide three sets of parameters:
- the machine driver to use (`generic`) and the SSH connection information
- Swarm-specific options indicating the cluster membership discovery mechanism
- Extra flags to be passed to the Engine, to enable overlay networks
---
## Provisioning the first node
.exercise[
- Use the following command to provision the manager node:
<!--
```placeholder
AA.BB.CC.DD $(getent hosts node1 | awk '{print $1}')
```
-->
```bash
docker-machine create --driver generic \
--engine-opt cluster-store=consul://localhost:8500 \
--engine-opt cluster-advertise=eth0:2376 \
--swarm --swarm-master --swarm-discovery consul://localhost:8500 \
--generic-ssh-user docker --generic-ip-address `AA.BB.CC.DD` node1
```
]
---
## Provisioning the other nodes
- The command is almost the same, but without the `--swarm-master` flag
- We will use a shell snippet for convenience
.exercise[
```bash
grep node[2345] /etc/hosts | grep -v ^127 |
while read IPADDR NODENAME
do docker-machine create --driver generic \
--engine-opt cluster-store=consul://localhost:8500 \
--engine-opt cluster-advertise=eth0:2376 \
--swarm --swarm-discovery consul://localhost:8500 \
--generic-ssh-user docker \
--generic-ip-address $IPADDR $NODENAME
done
```
]
---
## Check what we did
Let's connect to the first node *individually*.
.exercise[
- Select the node with Machine
```bash
eval $(docker-machine env node1)
```
- Execute some Docker commands
```bash
docker version
docker info
```
]
In the output of `docker info`, we should see `Cluster store` and `Cluster advertise`.
---
## Interact with the node
Let's try a few basic Docker commands on this node.
.exercise[
- Run a simple container:
```bash
docker run --rm busybox echo hello world
```
- See running containers:
```bash
docker ps
```
]
Two containers should show up: the agent and the manager.
---
## Connect to the Swarm cluster
Now, let's try the same operations, but when talking to the Swarm manager.
.exercise[
- Select the Swarm manager with Machine:
```bash
eval $(docker-machine env node1 --swarm)
```
- Execute some Docker commands
```bash
docker version
docker info
docker ps
```
]
The output is different! Let's review this.
---
## `docker version`
Swarm identifies itself clearly:
```
Client:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Tue Apr 26 23:38:55 2016
OS/Arch: linux/amd64
Server:
Version: swarm/1.2.2
API version: 1.22
Go version: go1.5.4
Git commit: 34e3da3
Built: Mon May 9 17:03:22 UTC 2016
OS/Arch: linux/amd64
```
---
## `docker info`
The output of `docker info` on Swarm shows a number of differences from
the output on a single Engine:
.small[
```
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.2.2
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 0
Plugins:
Volume:
Network:
Kernel Version: 4.2.0-36-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: node1
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
```
]
---
## Why zero nodes?
- We haven't started Consul yet
- Swarm discovery is not operational
- Swarm can't discover the nodes
Note: Docker will start (and be functional) without a K/V store.
This lets us run Consul itself in a container.
---
## Adding Consul
- We will run Consul in containers
- We will use the [Consul official image](
https://hub.docker.com/_/consul/) that was released *very recently*
- We will tell Docker to automatically restart it on reboots
- To simplify network setup, we will use `host` networking
---
## A few words about `host` networking
- Consul needs to be aware of its actual IP address (seen by other nodes)
- It also binds a bunch of different ports
- It makes sense (from a security point of view) to have Consul listening on localhost only
(and have "users", i.e. Engine, Swarm, etc. connect over localhost)
- Therefore, we will use `host` networking!
- Also: Docker Machine 0.6 starts the Swarm containers in `host` networking ...
- ... but Docker Machine 0.7 doesn't (which is why we stick to 0.6 for now)
---
## Consul fundamentals (if I must give you just one slide...)
- Consul nodes can be "just an agent" or "server"
- From the client's perspective, they behave the same
- Only servers are members in the Raft consensus / leader election / etc
(non-server agents forward requests to a server)
- All nodes must be told the address of at least another node to join
(except for the first node, where this is optional)
- At least the first nodes must know how many nodes to expect to have quorum
- Consul can have only one "truth" at a time (hence the importance of quorum)
---
## Starting our Consul cluster
.exercise[
- Make sure you're logged into `node1`, and:
```bash
IPADDR=$(ip a ls dev eth0 | sed -n 's,.*inet \(.*\)/.*,\1,p')
for N in 1 2 3 4 5; do
ssh node$N -- docker run -d --restart=always --name consul_node$N \
-e CONSUL_BIND_INTERFACE=eth0 --net host consul \
agent -server -retry-join $IPADDR -bootstrap-expect 5 \
-ui -client 0.0.0.0
done
```
]
Note: in production, you probably want to remove `-client 0.0.0.0` since it
gives public access to your cluster! Also adapt `-bootstrap-expect` to your quorum.
---
## Check that our Consul cluster is up
- With your browser, navigate to any instance on port 8500
<br/>(in "NODES" you should see the five nodes)
- Let's run a couple of useful Consul commands
.exercise[
- Ask Consul the list of members it knows:
```bash
docker run --net host --rm consul members
```
- Ask Consul which node is the current leader:
```bash
curl localhost:8500/v1/status/leader
```
]
---
## Check that our Swarm cluster is up
.exercise[
- Try again the `docker info` from earlier:
```bash
eval $(docker-machine env --swarm node1)
docker info
docker ps
```
]
All nodes should be visible. (If not, give them a minute or two to register.)
The Consul containers should be visible.
The Swarm containers, however, are hidden by Swarm (unless you use `docker ps -a`).
---
# Running containers on Swarm
Try to run a few `busybox` containers.
Then, let's get serious:
.exercise[
- Start a Redis service:
<br/>`docker run -dP redis`
- See the service address:
<br/>`docker port $(docker ps -lq) 6379`
]
This can be any of your five nodes.
---
## Scheduling strategies
- Random: pick a node at random
<br/>(but honor resource constraints)
- Spread: pick the node with the least containers
<br/>(including stopped containers)
- Binpack: try to maximize resource usage
<br/>(in other words: use as few hosts as possible)
---
# Resource allocation
- Swarm can honor resource reservations
- This requires containers to be started with resource limits
- Swarm refuses to schedule a container if it cannot honor a reservation
.exercise[
- Start Redis containers with 1 GB of RAM until Swarm refuses to start more:
```bash
docker run -d -m 1G redis
```
]
On a cluster of 5 nodes with ~3.8 GB of RAM per node, Swarm will refuse to start the 16th container.
---
## Removing our Redis containers
- Let's use a little bit of shell scripting
.exercise[
- Remove all containers using the redis image:
```bash
docker ps | awk '/redis/ {print $1}' | xargs docker rm -f
```
]
???
## Things to know about resource allocation
- `docker info` shows resource allocation for each node
- Swarm allows a 5% resource overcommit (tunable)
- Containers without resource reservation can always be started
- Resources of stopped containers are still counted as being reserved
- this guarantees that it will be possible to restart a stopped container
- containers have to be deleted to free up their resources
- `docker update` can be used to change resource allocation on the fly
---
class: title
# Setting up overlay networks
---
# Multi-host networking
- Docker 1.9 has the concept of *networks*
- By default, containers are on the default "bridge" network
- You can create additional networks
- Containers can be on multiple networks
- Containers can dynamically join/leave networks
- The "overlay" driver lets networks span multiple hosts
- Containers can have "network aliases" resolvable through DNS
---
## Manipulating networks, names, and aliases
- The preferred method is to let Compose do the heavy lifting for us
(YAML-defined networking!)
- But if we really need to, we can use the Docker CLI, with:
`docker network ...`
`docker run --net ... --net-alias ...`
- The following slides illustrate those commands
---
## Create a few networks and containers
.exercise[
- Create two networks, *blue* and *green*:
```bash
docker network create blue
docker network create green
docker network ls
```
- Create containers with names of blue and green
things, on their respective networks:
```bash
docker run -d --net-alias things --name sky --net blue -m 3G redis
docker run -d --net-alias things --name navy --net blue -m 3G redis
docker run -d --net-alias things --name grass --net green -m 3G redis
docker run -d --net-alias things --name forest --net green -m 3G redis
```
]
---
## Check connectivity within networks
.exercise[
- Check that our containers are on different nodes:
```bash
docker ps
```
- This will work:
```bash
docker run --rm --net blue alpine ping -c 3 navy
```
- This will not:
```bash
docker run --rm --net blue alpine ping -c 3 grass
```
]
???
## Containers connected to multiple networks
- Some colors aren't *quite* blue *nor* green
.exercise[
- Create a container that we want to be on both networks:
```bash
docker run -d --net-alias things --net blue --name turquoise redis
```
- Check connectivity:
```bash
docker exec -ti turquoise ping -c 3 navy
docker exec -ti turquoise ping -c 3 grass
```
(First works; second doesn't)
]
???
## Dynamically connecting containers
- This is achieved with the command:
<br/>`docker network connect NETNAME CONTAINER`
.exercise[
- Dynamically connect to the green network:
```bash
docker network connect green turquoise
```
- Check connectivity:
```bash
docker exec -ti turquoise ping -c 3 navy
docker exec -ti turquoise ping -c 3 grass
```
(Both commands work now)
]
---
## Network aliases
- Each container was created with the network alias `things`
- Network aliases are scoped by network
.exercise[
- Resolve the `things` alias from both networks:
```bash
docker run --rm --net blue alpine nslookup things
docker run --rm --net green alpine nslookup things
```
]
???
## Under the hood
- Each network has an interface in the container
- There is also an interface for the default gateway
.exercise[
- View interfaces in our `turquoise` container:
```bash
docker exec -ti turquoise ip addr ls
```
]
???
## Dynamically disconnecting containers
- There is a mirror command to `docker network connect`
.exercise[
- Disconnect the *turquoise* container from *blue*
(its original network):
```bash
docker network disconnect blue turquoise
```
- Check connectivity:
```bash
docker exec -ti turquoise ping -c 3 navy
docker exec -ti turquoise ping -c 3 grass
```
(First command fails, second one works)
]
---
## Cleaning up
.exercise[
- Destroy containers:
<!--
```bash
docker rm -f sky navy grass forest turquoise
```
-->
```bash
docker rm -f sky navy grass forest
```
- Destroy networks:
```bash
docker network rm blue
docker network rm green
```
]
---
## Cleaning up after an outage or a crash
- You cannot remove a network if it still has containers
- There is no `"rm -f"` for network
- If a network still has stale endpoints, you can use `"disconnect -f"`
---
class: title
# Building images with Swarm
---
## Building images with Swarm
- Special care must be taken when building and running images
- We *can* build images on Swarm (with `docker build` or `docker-compose build`)
- One node will be picked at random, and the build will happen there
- At the end of the build, the image will be present *only on that node*
---
## Building on Swarm can yield inconsistent results
- Builds are scheduled on random nodes
- Multiple builds and rebuilds can happen on different nodes
- If a build happens on a different node, the cache of the previous build cannot be used
- Worse: you can have two different images with the same name on your cluster
---
## Scaling won't work as expected
Consider the following scenario:
- `docker-compose up`
<br/>
→ each service is built on a node, and runs there
- `docker-compose scale`
<br/>
→ additional containers for this service can only be spawned where the image was built
- `docker-compose up` (again)
<br/>
→ services might be built (and started) on different nodes
- `docker-compose scale`
<br/>
→ containers can be spawned with both the new and old images
---
## Scaling correctly with Swarm
- After building an image, it should be distributed to the cluster
(Or made available through a registry, so that nodes can download it automatically)
- Instead of referencing images with the `:latest` tag, unique tags should be used
(Using e.g. timestamps, version numbers, or VCS hashes)
---
## Why can't Swarm do this automatically for us?
- Let's step back and think for a minute ...
- What should `docker build` do on Swarm?
- build on one machine
- build everywhere ($$$)
- After the build, what should `docker run` do?
- run where we built (how do we know where it is?)
- run on any machine that has the image
- Could Compose+Swarm solve this automatically?
---
## A few words about "sane defaults"
- *It would be nice if Swarm could pick a node, and build there!*
- but which node should it pick?
- what if the build is very expensive?
- what if we want to distribute the build across nodes?
- what if we want to tag some builder nodes?
- ok but what if no node has been tagged?
- *It would be nice if Swarm could automatically push images!*
- using the Docker Hub is an easy choice
<br/>(you just need an account)
- but some of us can't/won't use Docker Hub
<br/>(for compliance reasons or because no network access)
.small[("Sane" defaults are nice only if we agree on the definition of "sane")]
---
## The plan
- Build on a single node (`node1`)
- Tag images with the current UNIX timestamp (for simplicity)
- Upload them to a registry
- Update the Compose file to use those images
This is all automated with the [`build-tag-push.py` script](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/build-tag-push.py).
---
## Which registry do we want to use?
.small[
- **Docker Hub**
- hosted by Docker Inc.
- requires an account (free, no credit card needed)
- images will be public (unless you pay)
- located in AWS EC2 us-east-1
- **Docker Trusted Registry**
- self-hosted commercial product
- requires a subscription (free 30-day trial available)
- images can be public or private
- located wherever you want
- **Docker open source registry**
- self-hosted barebones repository hosting
- doesn't require anything
- doesn't come with anything either
- located wherever you want
]
---
## Using Docker Hub
- Set the `DOCKER_REGISTRY` environment variable to your Docker Hub user name
<br/>(the `build-tag-push.py` script prefixes each image name with that variable)
- We will also see how to run the open source registry
<br/>(so use whatever option you want!)
.exercise[
<!--
```meta
^{
```
-->
- Set the following environment variable:
<br/>`export DOCKER_REGISTRY=jpetazzo`
- (Use *your* Docker Hub login, of course!)
- Log into the Docker Hub:
<br/>`docker login`
<!--
```meta
^}
```
-->
]
---
## Using Docker Trusted Registry
If we wanted to use DTR, we would:
- make sure we have a Docker Hub account
- [activate a Docker Datacenter subscription](
https://hub.docker.com/enterprise/trial/)
- install DTR on our machines
- set `DOCKER_REGISTRY` to `dtraddress:port/user`
*This is out of the scope of this workshop!*
---
## Using open source registry
- We need to run a `registry:2` container
<br/>(make sure you specify tag `:2` to run the new version!)
- It will store images and layers to the local filesystem
<br/>(but you can add a config file to use S3, Swift, etc.)
- Docker *requires* TLS when communicating with the registry,
unless for registries on `localhost` or with the Engine
flag `--insecure-registry`
- Our strategy: run a reverse proxy on `localhost:5000` on each node
---
## Registry frontends and backend
![Registry frontends](registry-frontends.png)
---
# Deploying a local registry
- There is a Compose file for that
.exercise[
- Go to the `registry` directory in the repository:
```bash
cd ~/orchestration-workshop/registry
```
]
Let's examine the `docker-compose.yml` file.
---
## Running a local registry with Compose
```yaml
version: "2"
services:
backend:
image: registry:2
frontend:
image: jpetazzo/hamba
command: 5000 backend:5000
ports:
- "127.0.0.1:5000:5000"
depends_on:
- backend
```
- *Backend* is the actual registry.
- *Frontend* is the ambassador that we deployed earlier.
<br/>
It communicates with *backend* using an internal network
and network aliases.
---
## Starting a local registry with Compose
- We will bring up the registry
- Then we will ensure that one *frontend* is running
on each node by scaling it to our number of nodes
.exercise[
- Start the registry:
```bash
docker-compose up -d
```
]
---
## "Scaling" the local registry
- This is a particular kind of scaling
- We just want to ensure that one *frontend*
is running on every single node of the cluster
.exercise[
- Scale the registry:
```bash
for N in $(seq 1 5); do
docker-compose scale frontend=$N
done
```
]
Note: Swarm might do that automatically for us in the future.
---
## Testing our local registry
- We can retag a small image, and push it to the registry
.exercise[
- Make sure we have the busybox image, and retag it:
```bash
docker pull busybox
docker tag busybox localhost:5000/busybox
```
- Push it:
```bash
docker push localhost:5000/busybox
```
]
---
## Checking what's on our local registry
- The registry API has endpoints to query what's there
.exercise[
- Ensure that our busybox image is now in the local registry:
```bash
curl http://localhost:5000/v2/_catalog
```
]
The curl command should output:
```json
{"repositories":["busybox"]}
```
---
## Adapting our Compose file to run on Swarm
- We can get rid of all the `ports` section, except for the web UI
.exercise[
- Go back to the dockercoins directory:
```bash
cd ~/orchestration-workshop/dockercoins
```
]
---
## Our new Compose file
.small[
```yaml
version: '2'
services:
rng:
build: rng
hasher:
build: hasher
webui:
build: webui
ports:
- "8000:80"
redis:
image: redis
worker:
build: worker
```
]
Copy-paste this into `docker-compose.yml`
<br/>(or you can `cp docker-compose.yml-v2 docker-compose.yml`)
---
## Use images, not builds
- We need to replace each `build` with an `image`
- We will use the `build-tag-push.py` script for that
.exercise[
- Set `DOCKER_REGISTRY` to use our local registry
- Make sure that you are building on `node1`
- Then run the script
```bash
export DOCKER_REGISTRY=localhost:5000
eval $(docker-machine env node1)
../bin/build-tag-push.py
```
]
---
## Run the application
- At this point, our app is ready to run
.exercise[
- Start the application:
```bash
export COMPOSE_FILE=docker-compose.yml-`NNN`
eval $(docker-machine env node1 --swarm)
docker-compose up -d
```
- Observe that it's running on multiple nodes:
<br/>(each container name is prefixed with the node it's running on)
```bash
docker ps
```
]
---
## View the performance graph
- Load up the graph in the browser
.exercise[
- Check the `webui` service address and port:
```bash
docker-compose port webui 80
```
- Open it in your browser
]
---
## Scaling workers
- Scaling the `worker` service works out of the box
(like before)
.exercise[
- Scale `worker`:
```bash
docker-compose scale worker=10
```
]
Check that workers are on different nodes.
However, we hit the same bottleneck as before.
How can we address that?
---
## Finding the real cause of the bottleneck
- If time permits, we can benchmark `rng` and `hasher` to find out more
- Otherwise, we'll fast-forward a bit
---
## Benchmarking in isolation
- If we want the benchmark to be accurate, we need to make sure that `rng` and `hasher` are not receiving traffic
.exercise[
- Stop the `worker` containers:
```bash
docker-compose kill worker
```
]
---
## A better benchmarking tool
- Instead of `httping`, we will now use `ab` (Apache Bench)
- We will install it in an `alpine` container placed on the network used by our application
.exercise[
- Start an interactive `alpine` container on the `dockercoins_rng` network:
```bash
docker run -ti --net dockercoins_default alpine sh
```
- Install `ab` with the `apache2-utils` package:
```bash
apk add --update apache2-utils
```
]
---
## Benchmarking `rng`
We will send 50 requests, but with various levels of concurrency.
.exercise[
- Send 50 requests, with a single sequential client:
```bash
ab -c 1 -n 50 http://rng/10
```
- Send 50 requests, with ten parallel clients:
```bash
ab -c 10 -n 50 http://rng/10
```
]
---
## Benchmark results for `rng`
- In both cases, the benchmark takes ~5 seconds to complete
- When serving requests sequentially, they each take 100ms
- In the parallel scenario, the latency increased dramatically:
- one request is served in 100ms
- another is served in 200ms
- another is served in 300ms
- ...
- another is served in 1000ms
- What about `hasher`?
---
## Benchmarking `hasher`
We will do the same tests for `hasher`.
The command is slightly more complex, since we need to post random data.
First, we need to put the POST payload in a temporary file.
.exercise[
- Install curl in the container, and generate 10 bytes of random data:
```bash
apk add curl
curl http://rng/10 >/tmp/random
```
]
---
## Benchmarking `hasher`
Once again, we will send 50 requests, with different levels of concurrency.
.exercise[
- Send 50 requests with a sequential client:
```bash
ab -c 1 -n 50 -T application/octet-stream \
-p /tmp/random http://hasher/
```
- Send 50 requests with 10 parallel clients:
```bash
ab -c 10 -n 50 -T application/octet-stream \
-p /tmp/random http://hasher/
```
]
---
## Benchmark results for `hasher`
- The sequential benchmarks takes ~5 seconds to complete
- The parallel benchmark takes less than 1 second to complete
- In both cases, each request takes a bit more than 100ms to complete
- Requests are a bit slower in the parallel benchmark
- It looks like `hasher` is better equiped to deal with concurrency than `rng`
---
class: title
Why?
---
## Why does everything take (at least) 100ms?
--
`rng` code:
![RNG code screenshot](delay-rng.png)
--
`hasher` code:
![HASHER code screenshot](delay-hasher.png)
---
class: title
But ...
WHY?!?
---
## Why did we sprinkle this sample app with sleeps?
- Deterministic performance
<br/>(regardless of instance speed, CPUs, I/O...)
--
- Actual code sleeps all the time anyway
--
- When your code makes a remote API call:
- it sends a request;
- it sleeps until it gets the response;
- it processes the response.
---
## Why do `rng` and `hasher` behave differently?
![Equations on a blackboard](equations.png)
--
(Synchronous vs. asynchronous event processing)
---
## How to make `rng` go faster
- Obvious solution: comment out the `sleep` instruction
--
- Unfortunately, in the real world, network latency exists
--
- More realistic solution: use an asynchronous framework
<br/>(e.g. use gunicorn with gevent)
--
- Reminder: we can't change the code!
--
- Solution: scale out `rng`
<br/>(dispatch `rng` requests on multiple instances)
---
# Scaling web services with Compose on Swarm
- We *can* scale network services with Compose
- The result may or may not be satisfactory, though!
.exercise[
- Restart the `worker` service:
```bash
docker-compose start worker
```
- Scale the `rng` service:
```bash
docker-compose scale rng=5
```
]
---
## Results
- In the web UI, you might see a performance increase ... or maybe not
--
- Since Engine 1.11, we get round-robin DNS records
(i.e. resolving `rng` will yield the IP addresses of all 3 containers)
- Docker randomizes the records it sends
- But many resolvers will sort them in unexpected ways
- Depending on various factors, you could get:
- all traffic on a single container
- traffic perfectly balanced on all containers
- traffic unevenly balanced across containers
---
## Assessing DNS randomness
- Let's see how our containers resolve DNS requests
.exercise[
- On each of our 10 scaled workers, execute 5 ping requests:
```bash
for N in $(seq 1 10); do
echo PING__________$N
for I in $(seq 1 5); do
docker exec -ti dockercoins_worker_$N ping -c1 rng
done
done | grep PING
```
]
(The 7th Might Surprise You!)
---
## DNS randomness
- Other programs can yield different results
- Same program on another distro can yield different results
- Same source code with another libc or resolver can yield different results
- Running the same test at different times can yield different results
- Did I mention that Your Results May Vary?
---
## Implementing fair load balancing
- Instead of relying on DNS round robin, let's use a proper load balancer
- Use Compose to create multiple copies of the `rng` service
- Put a load balancer in front of them
- Point other services to the load balancer
---
## Naming problem
- The service is called `rng`
- Therefore, it is reachable with the network name `rng`
- Our application code (the `worker` service) connects to `rng`
- So the name `rng` should resolve to the load balancer
- What should we do‽
---
## Naming is *per-network*
- Solution: put `rng` on its own network
- That way, it doesn't take the network name `rng`
<br/>(at least not on the default network)
- Have the load balancer sit on both networks
- Add the name `rng` to the load balancer
---
class: pic
Original DockerCoins
![](dockercoins-single-node.png)
---
class: pic
Load-balanced DockerCoins
![](dockercoins-multi-node.png)
---
## Declaring networks
- Networks (other than the default one)
*must* be declared
in a top-level `networks` section,
placed anywhere in the file
.exercise[
- Add the `rng` network to the Compose file, `docker-compose.yml-NNN`:
```yaml
version: '2'
networks:
rng:
services:
rng:
image: ...
...
```
]
---
## Putting the `rng` service in its network
- Services can have a `networks` section
- If they don't: they are placed in the default network
- If they do: they are placed only in the mentioned networks
.exercise[
- Change the `rng` service to put it in its network:
```yaml
rng:
image: localhost:5000/dockercoins_rng:…
networks:
rng:
```
]
---
## Adding the load balancer
- The load balancer has to be in both networks: `rng` and `default`
- In the `default` network, it must have the `rng` alias
- We will use the `jpetazzo/hamba` image
.exercise[
- Add the `rng-lb` service to the Compose file:
```yaml
rng-lb:
image: jpetazzo/hamba
command: run
networks:
rng:
default:
aliases: [ rng ]
```
]
---
## Load balancer initial configuration
- We specified `run` as the initial command
- This tells `hamba` to wait for an initial configuration
- The load balancer will not be operational (until we feed it its configuration)
---
## Start the application
.exercise[
- Bring up DockerCoins:
```bash
docker-compose up -d
```
- See that `worker` is complaining:
```bash
docker-compose logs --tail 100 --follow worker
```
]
---
## Add one backend to the load balancer
- Multiple solutions:
- lookup the IP address of the `rng` backend
- use the backend's network name
- use the backend's container name (easiest!)
.exercise[
- Configure the load balancer:
```bash
docker run --rm --volumes-from dockercoins_rng-lb_1 \
--net container:dockercoins_rng-lb_1 \
jpetazzo/hamba reconfigure 80 dockercoins_rng_1 80
```
]
The application should now be working correctly.
---
## Add all backends to the load balancer
- The command is similar to the one before
- We need to pass the list of all backends
.exercise[
- Reconfigure the load balancer:
```bash
docker run --rm \
--volumes-from dockercoins_rng-lb_1 \
--net container:dockercoins_rng-lb_1 \
jpetazzo/hamba reconfigure 80 \
$(for N in $(seq 1 5); do
echo dockercoins_rng_$N:80
done)
```
]
---
## Automating the process
- Nobody loves artisan YAML handicraft
- This can be scripted very easily
- But can it be fully automated?
---
## Use DNS to discover the addresses of all the backends
- When multiple containers have the same network alias:
- Engine 1.10 returns only one of them (the same one across the whole network)
- Engine 1.11 returns all of them (in a random order)
- A "smart" client can use all records to implement load balancing
- We can compose `jpetazzo/hamba` with a special-purpose container,
which will dynamically generate HAProxy's configuration when
the DNS records are updated
---
## Introducing `jpetazzo/watchdns`
- [100 lines of pure POSIX scriptery](
https://github.com/jpetazzo/watchdns/blob/master/watchdns)
- Resolves a given DNS name every second
- Each time the result changes, a new HAProxy configuration is generated
- When used together with `--volumes-from` and `jpetazzo/hamba`, it
updates the configuration of an existing load balancer
- Comes with a companion script, [`add-load-balancer-v2.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-load-balancer-v2.py), to update your Compose files
---
## Using `jpetazzo/watchdns`
.exercise[
- First, revert the Compose file to remove the load balancer
- Then, run `add-load-balancer-v2.py`:
```bash
../bin/add-load-balancer-v2.py rng
```
- Inspect the resulting Compose file
]
---
## Scaling with `watchdns`
.exercise[
- Start the application with the new sidekick containers:
```bash
docker-compose up -d
```
- Scale `rng`:
```bash
docker-compose scale rng=10
```
- Check logs:
```bash
docker-compose logs rng-wd
```
]
---
## Comments
- This is a very crude implementation of the pattern
- A Go version would only be a bit longer, but use much less resources
- When there are many backends, reacting quickly to change is less important
(i.e. it's not necessary to re-resolve records every second!)
---
class: title
# All things ops <br/> (logs, backups, and more)
---
# Logs
- Two strategies:
- log to plain files on volumes
- log to stdout
<br/>(and use a logging driver)
---
## Logging to plain files on volumes
(Sorry, that part won't be hands-on!)
- Start a container with `-v /logs`
- Make sure that all log files are in `/logs`
- To check logs, run e.g.
```bash
docker run --volumes-from ... ubuntu sh -c "grep WARN /logs/*.log"
```
- Or just go interactive:
```bash
docker run --volumes-from ... -ti ubuntu
```
- You can (should) start a log shipper that way
---
## Logging to stdout
- All containers should write to stdout/stderr
- Docker will collect logs and pass them to a logging driver
- Logging driver can specified globally, and per container
<br/>(changing it for a container overrides the global setting)
- To change the global logging driver, pass extra flags to the daemon
<br/>(requires a daemon restart)
- To override the logging driver for a container, pass extra flags to `docker run`
---
## Specifying logging flags
- `--log-driver`
*selects the driver*
- `--log-opt key=val`
*adds driver-specific options*
<br/>*(can be repeated multiple times)*
- The flags are identical for `docker daemon` and `docker run`
---
## Logging flags in practice
- If you provision your nodes with Docker Machine,
you can set global logging flags (which will apply to all
containers started by a given Engine) like this:
```bash
docker-machine create ... --engine-opt log-driver=...
```
- Otherwise, use your favorite method to edit or manage configuration files
- You can set per-container logging options in Compose files
---
## Available drivers
- json-file (default)
- syslog (can send to UDP, TCP, TCP+TLS, UNIX sockets)
- awslogs (AWS CloudWatch)
- journald
- gelf
- fluentd
- splunk
---
## About json-file ...
- It doesn't rotate logs by default, so your disks will fill up
(Unless you set `maxsize` *and* `maxfile` log options.)
- It's the only one supporting logs retrieval
(If you want to use `docker logs`, `docker-compose logs`,
or fetch logs from the Docker API, you need json-file!)
- This might change in the future
(But it's complex since there is no standard protocol
to *retrieve* log entries.)
All about logging in the documentation:
https://docs.docker.com/reference/logging/overview/
---
# Setting up ELK to store container logs
*Important foreword: this is not an "official" or "recommended"
setup; it is just an example. We do not endorse ELK, GELF,
or the other elements of the stack more than others!*
What we will do:
- Spin up an ELK stack, with Compose
- Gaze at the spiffy Kibana web UI
- Manually send a few log entries over GELF
- Reconfigure our DockerCoins app to send logs to ELK
---
## What's in an ELK stack?
- ELK is three components:
- ElasticSearch (to store and index log entries)
- Logstash (to receive log entries from various
sources, process them, and forward them to various
destinations)
- Kibana (to view/search log entries with a nice UI)
- The only component that we will configure is Logstash
- We will accept log entries using the GELF protocol
- Log entries will be stored in ElasticSearch,
<br/>and displayed on Logstash's stdout for debugging
---
## Starting our ELK stack
- We will use a *separate* Compose file
- The Compose file is in the `elk` directory
.exercise[
- Go to the `elk` directory:
```bash
cd ~/orchestration-workshop/elk
```
- Start the ELK stack:
```bash
unset COMPOSE_FILE
docker-compose up -d
```
]
---
## Making sure that each node has a local logstash
- We will configure each container to send logs to `localhost:12201`
- We need to make sure that each node has a logstash container listening on port 12201
.exercise[
- Scale the `logstash` service to 5 instances (one per node):
```bash
for N in $(seq 1 5); do
docker-compose scale logstash=$N
done
```
]
---
## Checking that our ELK stack works
- Our default Logstash configuration sends a test
message every minute
- All messages are stored into ElasticSearch,
but also shown on Logstash stdout
.exercise[
- Look at Logstash stdout:
```bash
docker-compose logs logstash
```
]
After less than one minute, you should see a `"message" => "ok"`
in the output.
---
## Connect to Kibana
- Our ELK stack exposes two public services:
<br/>the Kibana web server, and the GELF UDP socket
- They are both exposed on their default port numbers
<br/>(5601 for Kibana, 12201 for GELF)
.exercise[
- Check the address of the node running kibana:
```bash
docker-compose ps
```
- Open the UI in your browser: http://instance-address:5601/
]
---
## "Configuring" Kibana
- If you see a status page with a yellow item, wait a minute and reload
(Kibana is probably still initializing)
- Kibana should offer you to "Configure an index pattern",
just click the "Create" button
- Then:
- click "Discover" (in the top-left corner)
- click "Last 15 minutes" (in the top-right corner)
- click "Last 1 hour" (in the list in the middle)
- click "Auto-refresh" (top-right corner)
- click "5 seconds" (top-left of the list)
- You should see a series of green bars (with one new green bar every minute)
---
![Screenshot of Kibana](kibana.png)
---
## Sending container output to Kibana
- We will create a simple container displaying "hello world"
- We will override the container logging driver
- The GELF address is `127.0.0.1:12201`, because the Compose file
explicitly exposes the GELF socket on port 12201
.exercise[
- Start our one-off container:
```bash
docker run --rm --log-driver gelf \
--log-opt gelf-address=udp://127.0.0.1:12201 \
alpine echo hello world
```
]
---
## Visualizing container logs in Kibana
- Less than 5 seconds later (the refresh rate of the UI),
the log line should be visible in the web UI
- We can customize the web UI to be more readable
.exercise[
- In the left column, move the mouse over the following
columns, and click the "Add" button that appears:
- host
- container_name
- message
]
---
## Switching back to the DockerCoins application
.exercise[
- Go back to the dockercoins directory:
```bash
cd ~/orchestration-workshop/dockercoins
```
- Set the `COMPOSE_FILE` variable:
```bash
export COMPOSE_FILE=docker-compose.yml-`NNN`
```
]
---
## Add the logging driver to the Compose file
- We need to add the logging section to each container
.exercise[
- Edit the `docker-compose.yml-NNN` file, adding the following lines **to each container**:
```yaml
logging:
driver: gelf
options:
gelf-address: "udp://127.0.0.1:12201"
```
]
There is also a script, [`../bin/add-logging.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-logging.py), to do that automatically.
---
## Update the DockerCoins app
.exercise[
- Use Compose normally:
```bash
docker-compose up -d
```
]
If you look in the Kibana web UI, you will see log lines
refreshed every 5 seconds.
Note: to do interesting things (graphs, searches...) we
would need to create indexes. This is beyond the scope
of this workshop.
---
## Logging in production
- If we were using an ELK stack:
- scale ElasticSearch
- interpose a Redis or Kafka queue to deal with bursts
- Configure your Engines to send all logs to ELK by default
- Start the logging containers with a different logging system
<br/>(to avoid a logging loop)
- Make sure you don't end up writing *all logs* on the nodes running Logstash!
---
# Network traffic analysis
- We want to inspect the network traffic entering/leaving `dockercoins_redis_1`
- We will use *shared network namespaces* to perform network analysis
- Two containers sharing the same network namespace...
- have the same IP addresses
- have the same network interfaces
- `eth0` is therefore the same in both containers
---
## Install and start `ngrep`
Ngrep uses libpcap (like tcpdump) to sniff network traffic.
.exercise[
<!--
```meta
^{
```
-->
- Start a container with the same network namespace:
<br/>`docker run --net container:dockercoins_redis_1 -ti alpine sh`
- Install ngrep:
<br/>`apk update && apk add ngrep`
- Run ngrep:
<br/>`ngrep -tpd eth0 -Wbyline . tcp`
<!--
```meta
^}
```
-->
]
You should see a stream of Redis requests and responses.
---
# Backups
- We want to enable backups for `dockercoins_redis_1`
- We don't want to install extra software in this container
- We will use a special backup container:
- sharing the same volumes
- using the same network stack (to connect to it easily)
- possibly containing our backup tools
- This works because the `redis` container image stores its data on a volume
---
## Starting the backup container
- We will use the `--net container:` option to be able to connect locally
- We will use the `--volumes-from` option to access the container's persistent data
.exercise[
<!--
```meta
^{
```
-->
- Start the container:
```bash
docker run --net container:dockercoins_redis_1 \
--volumes-from dockercoins_redis_1:ro \
-v /tmp/myredis:/output \
-ti alpine sh
```
- Look in `/data` in the container (that's where Redis puts its data dumps)
]
---
## Connecting to Redis
- We need to tell Redis to perform a data dump *now*
.exercise[
- Connect to Redis:
```bash
telnet localhost 6379
```
- Issue commands `SAVE` then `QUIT`
- Look at `/data` again (notice the time stamps)
]
- There should be a recent dump file now!
---
## Getting the dump out of the container
- We could use many things:
- s3cmd to copy to S3
- SSH to copy to a remote host
- gzip/bzip/etc before copying
- We'll just copy it to the Docker host
.exercise[
- Copy the file from `/data` to `/output`
- Exit the container
- Look into `/tmp/myredis` (on the host)
<!--
```meta
^}
```
-->
]
---
## Scheduling backups
In the "old world," we (generally) use cron.
With containers, what are our options?
--
- run `cron` on the Docker host, and put `docker run` in the crontab
--
- run `cron` in the backup container, and make sure it keeps running
<br/>(e.g. with `docker run --restart=…`)
--
- run `cron` in a container, and start backup containers from there
--
- listen to the Docker events stream, automatically scheduling backups
<br/>when database containers are started
---
# Controlling Docker from a container
- In a local environment, just bind-mount the Docker control socket:
```bash
docker run -ti -v /var/run/docker.sock:/var/run/docker.sock docker
```
- Otherwise, you have to:
- set `DOCKER_HOST`,
- set `DOCKER_TLS_VERIFY` and `DOCKER_CERT_PATH` (if you use TLS),
- copy certificates to the container that will need API access.
More resources on this topic:
- [Do not use Docker-in-Docker for CI](
http://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/)
- [One container to rule them all](
http://jpetazzo.github.io/2016/04/03/one-container-to-rule-them-all/)
---
# Docker events stream
- Using the Docker API, we can get real-time
notifications of everything happening in the Engine:
- container creation/destruction
- container start/stop
- container exit/signal/out of memory
- container attach/detach
- volume creation/destruction
- network creation/destruction
- connection/disconnection of containers
---
## Subscribing to the events stream
- This is done with `docker events`
.exercise[
- Get a stream of events:
```bash
docker events
```
<!--
```meta
^Z
```
-->
- In a new terminal, do *anything*:
```bash
docker run --rm alpine sleep 10
```
]
You should see events for the lifecycle of the
container, as well as its connection/disconnection
to the default `bridge` network.
---
## A few tools to use the events stream
- [docker-spotter](https://github.com/discordianfish/docker-spotter)
Written in Go; simple building block to use directly in Shell scripts
- [ahab](https://github.com/instacart/ahab)
Written in Python; available as a library; ships with a CLI tool
---
# Security upgrades
- This section is not hands-on
- Public Service Announcement
- We'll discuss:
- how to upgrade the Docker daemon
- how to upgrade container images
---
## Upgrading the Docker daemon
- Stop all containers cleanly
- Stop the Docker daemon
- Upgrade the Docker daemon
- Start the Docker daemon
- Start all containers
- This is like upgrading your Linux kernel, but it will get better
(Docker Engine 1.11 is using containerd, which will ultimately allow seamless upgrades.)
???
## In practice
- Keep track of running containers before stopping the Engine:
```bash
docker ps --no-trunc -q |
tee /tmp/running |
xargs -n1 -P10 docker stop
```
- Restart those containers after the Engine is running again:
```bash
xargs docker start < /tmp/running
```
<br/>(Run this multiple times if you have linked containers!)
---
## Upgrading container images
- When a vulnerability is announced:
- if it affects your base images: make sure they are fixed first
- if it affects downloaded packages: make sure they are fixed first
- re-pull base images
- rebuild
- restart containers
---
## How do we know when to upgrade?
- Subscribe to CVE notifications
- https://cve.mitre.org/
- your distros' security announcements
- Check CVE status in official images
<br/>(tag [cve-tracker](
https://github.com/docker-library/official-images/labels/cve-tracker)
in [docker-library/official-images](
https://github.com/docker-library/official-images/labels/cve-tracker)
repo)
- Use a container vulnerability scanner
<br/>(e.g. [Docker Security Scanning](https://blog.docker.com/2016/05/docker-security-scanning/))
---
## Upgrading with Compose
Compose makes this particularly easy:
```bash
docker-compose build --pull --no-cache
docker-compose up -d
```
This will automatically:
- pull base images;
- rebuild all container images;
- bring up the new containers.
Remember: Compose will automatically move our
volumes to the new containers, so data is preserved.
---
class: title
# Resiliency <br/> and <br/> high availability
---
## What are our single points of failure?
- The TLS certificates created by Machine are on `node1`
- We have only one Swarm manager
- If a node (running containers) is down or unreachable,
our application will be affected
---
# Distributing Machine credentials
- All the credentials (TLS keys and certs) are on node1
<br/>(the node on which we ran `docker-machine create`)
- If we lose node1, we're toast
- We need to move (or copy) the credentials somewhere safe
- Credentials are regular files, and relatively small
- Ah, if only we had a highly available, hierarchic store ...
--
- Wait a minute, we have one!
--
(That's Consul, if you were wondering)
---
## Storing files in Consul
- We will use [Benjamin Wester's consulfs](
https://github.com/bwester/consulfs)
- It mounts a Consul key/value store as a local filesystem
- Performance will be horrible
<br/>(don't run a database on top of that!)
- But to store files of a few KB, nobody will notice
- We will copy/link/sync... `~/.docker/machine` to Consul
---
## Installing consulfs
- Option 1: install Go, git clone, go build ...
- Option 2: be lazy and use [jpetazzo/consulfs](
https://hub.docker.com/r/jpetazzo/consulfs/)
.exercise[
- Be lazy and use the Docker image:
```bash
eval $(docker-machine env node1)
docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs
```
]
Note: the `jpetazzo/consulfs` image contains the
`consulfs` binary.
It copies it to `/target` (if `/target` is a volume).
---
## Can't we run consulfs in a container?
- Yes we can!
- The filesystem will be mounted in the container
- It won't be visible outside of the container (from the host)
- We can use *shared mounts* to propagate mounts from containers to Docker
- But propagating from Docker to the host requires particular systemd flags
- ... So we'll run it on the host for now
---
## Running consulfs
- The `consulfs` binary takes two arguments:
- the Consul server address
- a mount point (that has to be created first)
.exercise[
- Create a mount point and mount Consul as a local filesystem:
```bash
mkdir ~/consul
consulfs localhost:8500 ~/consul
```
]
Leave this running in the foreground.
---
## Checking our consulfs mount point
- All key/values will be visible:
- Swarm discovery
- overlay networks
- ... anything you put in Consul!
.exercise[
- Check that Consul key/values are visible:
```bash
ls -l ~/consul/
```
]
---
## Copying our credentials to Consul
- Use standard UNIX commands
- Don't try to preserve permissions, though (`consulfs` doesn't store permissions)
.exercise[
- Copy Machine credentials into Consul:
```bash
cp -r ~/.docker/machine/. ~/consul/machine/
```
]
(This command can be re-executed to update the copy.)
---
## Install consulfs on another node
- We will repeat the previous steps to install consulfs
.exercise[
- Connect to node2:
```bash
ssh node2
```
- Install `consulfs`:
```bash
docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs
```
]
---
## Mount Consul
- The procedure is still the same as on the first node
.exercise[
- Create the mount point:
```bash
mkdir ~/consul
```
- Mount the filesystem:
```bash
consulfs localhost:8500 ~/consul &
```
]
At this point, `ls -l ~/consul` should show `docker` and
`machine` directories.
---
## Access the credentials from the other node
- We will create a symlink
- We could also copy the credentials
.exercise[
- Create the symlink:
```bash
mkdir -p ~/.docker/
ln -s ~/consul/machine ~/.docker/
```
- Check that all nodes are visible:
```bash
docker-machine ls
```
]
---
## A few words on this strategy
- Anyone accessing Consul can control your Docker cluster
<br/>(to be fair: anyone accessing Consul can wreck
serious havoc to your cluster anyway)
- ConsulFS doesn't support *all* POSIX operations,
so a few things (like `mv`) will not work)
- As a consequence, with Machine 0.6, you cannot
run `docker-machine create` directly on top of ConsulFS
---
## What if Consul becomes unavailable?
- If Consul becomes unavailable (e.g. loses quorum),
<br/>you won't be able to access your credentials
- If Consul becomes unavailable ...
<br/>your cluster will be in a bad state anyway
- You can still access each Docker Engine over the
local UNIX socket
<br/>(and repair Consul that way)
---
# Highly available Swarm managers
- Until now, the Swarm manager was a SPOF
<br/>(Single Point Of Failure)
- Swarm has support for replication
- When replication is enabled, you deploy multiple (identical) managers
- one will be "primary"
- the other(s) will be "secondary"
- this is determined automatically
<br/>(through *leader election*)
---
## Swarm leader election
- The leader election mechanism relies on a key/value store
<br/>(Consul, etcd, Zookeeper)
- There is no requirement on the number of replicas
<br/>(the quorum is achieved through the key/value store)
- When the leader (or "primary") is unavailable,
<br/>a new election happens automatically
- You can issue API requests to any manager:
<br/>if you talk to a secondary, it forwards to the primary
.warning[Until recently there was a bug when
the Consul cluster itself had a leader election;
<br/>see [docker/swarm#1782](https://github.com/docker/swarm/issues/1782).]
---
## Swarm replication in practice
- We need to give two extra flags to the Swarm manager:
- `--replication`
*enables replication (duh!)*
- `--advertise ip.ad.dr.ess:port`
*address and port where this Swarm manager is reachable*
- Do you deploy with Docker Machine?
<br/>Then you can use `--swarm-opt`
to automatically pass flags to the Swarm manager
---
## Cleaning up our current Swarm containers
- We will use Docker Machine to re-provision Swarm
- We need to:
- remove the nodes from the Machine registry
- remove the Swarm containers
.exercise[
- Remove the current configuration (remember to go back to node1!):
```bash
for N in 1 2 3 4 5; do
ssh node$N docker rm -f swarm-agent swarm-agent-master
docker-machine rm -f node$N
done
```
]
---
## Re-deploy with the new configuration
- This time, all nodes can be deployed identically
<br/>(instead of 1 manager + 4 non-managers)
.exercise[
```bash
grep node[12345] /etc/hosts | grep -v ^127 |
while read IPADDR NODENAME; do
docker-machine create --driver generic \
--engine-opt cluster-store=consul://localhost:8500 \
--engine-opt cluster-advertise=eth0:2376 \
--swarm --swarm-master \
--swarm-discovery consul://localhost:8500 \
--swarm-opt replication --swarm-opt advertise=$IPADDR:3376 \
--generic-ssh-user docker --generic-ip-address $IPADDR $NODENAME
done
```
]
.small[
Note: Consul is still running thanks to the `--restart=always` policy.
Other containers are now stopped, because the engines have been
reconfigured and restarted.
]
---
## Assess our new cluster health
- The output of `docker info` will tell us the status
of the node that we are talking to (primary or replica)
- If we talk to a replica, it will tell us who is the primary
.exercise[
- Talk to a random node, and ask its view of the cluster:
```bash
eval $(docker-machine env node3 --swarm)
docker info | grep -e ^Name -e ^Role -e ^Primary
```
]
Note: `docker info` is one of the only commands that will
work even when there is no elected primary. This helps
debugging.
---
## Test Swarm manager failover
- The previous command told us which node was the primary manager
- if `Role` is `primary`,
<br/>then the primary is indicated by `Name`
- if `Role` is `replica`,
<br/>then the primary is indicated by `Primary`
.exercise[
- Kill the primary manager:
```bash
ssh node`N` docker kill swarm-agent-master
```
]
Look at the output of `docker info` every few seconds.
---
# Highly available containers
- Swarm has support for *rescheduling* on node failure
- It has to be explicitly enabled on a per-container basis
- When the primary manager detects that a node goes down,
<br/>those containers are rescheduled elsewhere
- If the containers can't be rescheduled (constraints issue),
<br/>they are lost (there is no reconciliation loop yet)
- In Swarm 1.1, this is an *experimental* feature
<br/>(To enable it, you must pass the `--experimental` flag when you start Swarm itself!)
- In Swarm 1.2, you don't need the `--experimental` flag anymore
---
## About Swarm generic flags
- Some flags like `--experimental` and `--debug` must be *before* the Swarm command
<br/>(i.e. `docker run swarm --debug manage ...`)
- We cannot use Docker Machine to pass that flag ☹
<br/>(Machine adds flags *after* the Swarm command)
- Instead, we can use a custom Swarm image:
```dockerfile
FROM swarm
ENTRYPOINT ["/swarm", "--debug"]
```
- We can tell Machine to use this with `--swarm-image`
---
## Start a resilient container
- By default, containers will not be restarted when their node goes down
- You must pass an explicit *rescheduling policy* to make that happen
- For now, the only policy is "on-node-failure"
.exercise[
- Start a container with a rescheduling policy:
```bash
docker run --name highlander -d -e reschedule:on-node-failure nginx
```
]
Check that the container is up and running.
---
## Simulate a node failure
- We will reboot the node running this container
- Swarm will reschedule it
.exercise[
- Check on which node the container is running:
</br>`NODE=$(docker inspect --format '{{.Node.Name}}' highlander)`
- Reboot that node:
<br/>`ssh $NODE sudo reboot`
- Check that the container has been recheduled:
<br/>`docker ps -a`
]
---
## Reboots
- When rebooting a node, Docker is stopped cleanly, and containers are stopped
- Our container is rescheduled, but not started
- To simulate a "proper" failure, we can use the Chaos Monkey script instead
```bash
~/orchestration-workshop/bin/chaosmonkey $NODE <connect|disconnect|reboot>
```
---
## Cluster reconciliation
- After the cluster rejoins, we can end up with duplicate containers
.exercise[
- Once the node is back, remove one of the extraneous containers:
```bash
docker rm -f node`N`/highlander
```
]
---
## .warning[Caveats]
- There are some corner cases when the node is also
the Swarm leader or the Consul leader; this is being improved
right now!
- The safest way to address for now this is to run the Consul
servers, the Swarm managers, and your containers, on
different nodes.
- Swarm doesn't handle gracefully the fact that after the
reboot, you have *two* containers named `highlander`,
and attempts to manipulate the container with its name
will not work. This will be improved too.
---
class: title
# Conclusions
---
## Swarm cluster deployment
- We saw how to use Machine with the `generic` driver to turn
any set of machines into a Swarm cluster
- This can trivially be adapted to provision cloud instances
on the fly (using "normal" drivers of Docker Machine)
- For auto-scaling, you can use e.g.:
- private admin-only network
- no TLS
- static discovery on a /24 to /20 network (depending on your needs)
---
## Key/value store
- We saw an easy deployment method for Consul
- This is good for 3 to 9 nodes
- Remember: raft write performance *degrades* as you add nodes!
- For bigger clusters:
- have e.g. 5 "static" server nodes
- put them in round robin DNS record set (or behind an ELB)
- run a normal agent on the other nodes
---
## App deployment
- We saw how to transform a Compose file into a series of build artifacts
- using S3 or another object store is trivial
- We saw how to programmatically add load balancing, logging
- This can be improved further by using variable interpolation for the image tags
- Rolling deploys are relatively straightforward, but:
- I recommend to aim directly for blue/green (or canary) deploy
- In the production stack, abstract stateful services with ambassadors
---
## Operations
- We saw how to setup an ELK stack and send logs to it in a record time
*Important: this doesn't mean that operating ELK suddenly became an easy thing!*
- We saw how to translate a few basic tasks to containerized environments
(Backups, network traffic analysis)
- Debugging is surprisingly similar to what it used to be:
- remember that containerized processes are normal processes running on the host
- `docker exec` is your friend
- also: `docker run --net host --pid host -v /:/hostfs alpine chroot /hostfs`
---
## Things we haven't covered
- Per-container system metrics (look at cAdvisor, Snap, Prometheus...)
- Application metrics (continue to use whatever you were using before)
- Supervision (whatever you were using before still works exactly the same way)
- Tracking access to credentials and sensitive information (see Vault, Keywhiz...)
- ... (tell me what I should cover in future workshops!) ...
---
## Resilience
- We saw how to store important data (crendentials) in Consul
- We saw how to achieve H/A for Swarm itself
- Rescheduling policies give us basic H/A for containers
- This will be improved in future releases
- Docker in general, and Swarm in particular, move *fast*
- Current high availability features are not Chaos-Monkey proof (yet)
- We (well, the Swarm team) is working to change that
---
## What's next?
- November 2015: Compose 1.5 + Engine 1.9 =
<br/>first release with multi-host networking
- January 2016: Compose 1.6 + Engine 1.10 =
<br/>embedded DNS server, experimental high availability
- April 2016: Compose 1.7 + Engine 1.11 =
<br/>round robin DNS records, huge improvements in HA
- Next release: another truckload of features
- I will deliver this workshop about twice a month
- Check out the GitHub repo for updated content!
<br/>(there is a tag for each big round of updates)
---
## Overall complexity
- The scripts used here are pretty simple (each is less than 100 LOCs)
- You can easily rewrite them in your favorite language,
<br/>adapt and customize them, in a few hours of time
- FYI: those scripts are smaller and simpler than the
scripts (cloud init etc) used to deploy the VMs for this
workshop!
- Docker Inc. has commercial products to wrap all this:
- Docker Cloud
<br/>(manage your Docker nodes from a SAAS portal)
- Docker Datacenter
<br/>(buzzword-compliant management solution:
<br/>turnkey, enterprise-class, on-premise, etc.)
---
class: title
# Thanks! <br/> Questions?
## [@jpetazzo](https://twitter.com/jpetazzo) <br/> [@docker](https://twitter.com/docker)
</textarea>
<script src="https://gnab.github.io/remark/downloads/remark-0.13.min.js" type="text/javascript">
</script>
<script type="text/javascript">
var slideshow = remark.create({
ratio: '16:9',
highlightSpans: true
});
</script>
</body>
</html>