class: title Docker
Orchestration
Workshop --- ## Logistics - Hello! We're `jerome at docker dot com` and `aj at soulshake dot net` - The tutorial will run from 1:20pm to 4:40pm - There will be a break from 3:00pm to 3:15pm - This will be FAST PACED, but DON'T PANIC! - All the content is publicly available (slides, code samples, scripts) - Live feedback, questions, help on [Gitter](http://container.training/chat) --- ## Chapter 1: getting started - Pre-requirements - VM environment - Our sample application - Running the application - Identifying bottlenecks - Scaling out - Connecting to containers on other hosts - Abstracting remote services with ambassadors --- ## Chapter 2: Swarm setup and deployment - Dynamic orchestration - Deploying Swarm - Picking a key/value store - Running containers on Swarm - Resource allocation - Multi-host networking - Building images with Swarm - Deploying a local registry - Scaling web services with Compose on Swarm --- ## Chapter 3: Docker for Ops - Logs - Setting up ELK to store container logs - Network traffic analysis - Backups - Controlling Docker from a container - Docker events stream - Security upgrades --- ## Chapter 4: high availability (additional content) - Distributing Machine credentials - Highly available Swarm managers - Highly available containers - Conclusions --- # Pre-requirements - Computer with network connection and SSH client - on Linux, OS X, FreeBSD... you are probably all set - on Windows, get [putty](http://www.putty.org/), [Git BASH](https://git-for-windows.github.io/), or [MobaXterm](http://mobaxterm.mobatek.net/) - Basic Docker knowledge
(but that's OK if you're not a Docker expert!) --- ## Nice-to-haves - [GitHub](https://github.com/join) account
(if you want to fork the repo; also used to join Gitter) - [Gitter](https://gitter.im/) account
(to join the conversation during the workshop) - [Docker Hub](https://hub.docker.com) account
(it's one way to distribute images on your Swarm cluster) --- ## Hands-on sections - The whole workshop is hands-on - I will show Docker in action - I invite you to reproduce what I do - All hands-on sections are clearly identified, like the gray rectangle below .exercise[ - This is the stuff you're supposed to do! - Go to [container.training](http://container.training/) to view these slides - Join the chat room on [Gitter](http://container.training/chat) ] --- # VM environment - Each person gets 5 private VMs (not shared with anybody else) - They'll be up until tonight - You have a little card with login+password+IP addresses - You can automatically SSH from one VM to another .exercise[ - Log into the first VM (`node1`) - Check that you can SSH (without password) to `node2`: ```bash ssh node2 ``` - Type `exit` or `^D` to come back to node1 ] --- ## We will (mostly) interact with node1 only - Unless instructed, **all commands must be run from the first VM, `node1`** - We will only checkout/copy the code on `node1` - When we will use the other nodes, we will do it mostly through the Docker API - We will use SSH only for a few "out of band" operations (mass-removing containers...) --- ## Terminals Once in a while, the instructions will say:
"Open a new terminal." There are multiple ways to do this: - create a new window or tab on your machine, and SSH into the VM; - use screen or tmux on the VM and open a new window from there. You are welcome to use the method that you feel the most comfortable with. --- ## Tmux cheatsheet - Ctrl-b c → creates a new window - Ctrl-b n → go to next window - Ctrl-b p → go to previous window - Ctrl-b " → split window top/bottom - Ctrl-b % → split window left/right - Ctrl-b Alt-1 → rearrange windows in columns - Ctrl-b Alt-2 → rearrange windows in rows - Ctrl-b arrows → navigate to other windows - Ctrl-b d → detach session - tmux attach → reattach to session --- ## Brand new versions! - Engine 1.11 - Compose 1.7 - Swarm 1.2 - Machine 0.6 .exercise[ - Check all installed versions: ```bash docker version docker-compose -v docker run --rm swarm -version docker-machine -v ``` ] --- ## Why are we not using the latest version of Machine? - The latest version of Machine is 0.7 - The way it deploys Swarm is different from 0.6 - This causes a regression in the strategy that we will use later - More details later! --- # Our sample application - Visit the GitHub repository with all the materials of this workshop:
https://github.com/jpetazzo/orchestration-workshop - The application is in the [dockercoins]( https://github.com/jpetazzo/orchestration-workshop/tree/master/dockercoins) subdirectory - Let's look at the general layout of the source code: there is a Compose file [docker-compose.yml]( https://github.com/jpetazzo/orchestration-workshop/blob/master/dockercoins/docker-compose.yml) ... ... and 4 other services, each in its own directory: - `rng` = web service generating random bytes - `hasher` = web service computing hash of POSTed data - `worker` = background process using `rng` and `hasher` - `webui` = web interface to watch progress --- ## Compose file format version *Particularly relevant if you have used Compose before...* - Compose 1.6 introduced support for a new Compose file format (aka "v2") - Services are no longer at the top level, but under a `services` section - There has to be a `version` key at the top level, with value `"2"` (as a string, not an integer) - Containers are placed on a dedicated network, making links unnecessary - There are other minor differences, but upgrade is easy and straightforward --- ## Links, naming, and service discovery - Containers can have network aliases (resolvable through DNS) - Compose file version 2 makes each container reachable through its service name - Compose file version 1 requires "links" sections - Our code can connect to services using their short name (instead of e.g. IP address or FQDN) --- ## Example in `worker/worker.py`  --- ## What's this application? --- class: pic  (DockerCoins 2016 logo courtesy of @XtlCnslt and @ndeloof. Thanks!) --- ## What's this application? - It is a DockerCoin miner! 💰🐳📦🚢 - No, you can't buy coffee with DockerCoins - How DockerCoins works: - `worker` asks to `rng` to give it random bytes - `worker` feeds those random bytes into `hasher` - each hash starting with `0` is a DockerCoin - DockerCoins are stored in `redis` - `redis` is also updated every second to track speed - you can see the progress with the `webui` --- ## Getting the application source code - We will clone the GitHub repository - The repository also contains scripts and tools that we will use through the workshop .exercise[ - Clone the repository on `node1`: ```bash git clone git://github.com/jpetazzo/orchestration-workshop ``` ] (You can also fork the repository on GitHub and clone your fork if you prefer that.) --- # Running the application Without further ado, let's start our application. .exercise[ - Go to the `dockercoins` directory, in the cloned repo: ```bash cd ~/orchestration-workshop/dockercoins ``` - Use Compose to build and run all containers: ```bash docker-compose up ``` ] Compose tells Docker to build all container images (pulling the corresponding base images), then starts all containers, and displays aggregated logs. --- ## Lots of logs - The application continuously generates logs - We can see the `worker` service making requests to `rng` and `hasher` - Let's put that in the background .exercise[ - Stop the application by hitting `^C` ] - `^C` stops all containers by sending them the `TERM` signal - Some containers exit immediately, others take longer
(because they don't handle `SIGTERM` and end up being killed after a 10s timeout) --- ## Restarting in the background - Many flags and commands of Compose are modeled after those of `docker` .exercise[ - Start the app in the background with the `-d` option: ```bash docker-compose up -d ``` - Check that our app is running with the `ps` command: ```bash docker-compose ps ``` ] `docker-compose ps` also shows the ports exposed by the application. --- ## Viewing logs - The `docker-compose logs` command works like `docker logs` .exercise[ - View all logs since container creation and exit when done: ```bash docker-compose logs ``` - Stream container logs, starting at the last 10 lines for each container: ```bash docker-compose logs --tail 10 --follow ``` ] Tip: use `^S` and `^Q` to pause/resume log output. ??? ## Upgrading from Compose 1.6 .warning[The `logs` command has changed between Compose 1.6 and 1.7!] - Up to 1.6 - `docker-compose logs` is the equivalent of `logs --follow` - `docker-compose logs` must be restarted if containers are added - Since 1.7 - `--follow` must be specified explicitly - new containers are automatically picked up by `docker-compose logs` --- ## Connecting to the web UI - The `webui` container exposes a web dashboard; let's view it .exercise[ - Open http://[yourVMaddr]:8000/ (from a browser) ] - The app actually has a constant, steady speed (3.33 coins/second) - The speed seems not-so-steady because: - the worker doesn't update the counter after every loop, but up to once per second - the speed is computed by the browser, checking the counter about once per second - between two consecutive updates, the counter will increase either by 4, or by 0 --- ## Scaling up the application - Our goal is to make that performance graph go up (without changing a line of code!) - Before trying to scale the application, we'll figure out if we need more resources (CPU, RAM...) - For that, we will use good old UNIX tools on our Docker node --- ## Looking at resource usage - Let's look at CPU, memory, and I/O usage .exercise[ - run `top` to see CPU and memory usage (you should see idle cycles) - run `vmstat 3` to see I/O usage (si/so/bi/bo)
(the 4 numbers should be almost zero, except `bo` for logging) ] We have available resources. - Why? - How can we use them? --- ## Scaling workers on a single node - Docker Compose supports scaling - Let's scale `worker` and see what happens! .exercise[ - Start one more `worker` container: ```bash docker-compose scale worker=2 ``` - Look at the performance graph (it should show a x2 improvement) - Look at the aggregated logs of our containers (`worker_2` should show up) - Look at the impact on CPU load with e.g. top (it should be negligible) ] --- ## Adding more workers - Great, let's add more workers and call it a day, then! .exercise[ - Start eight more `worker` containers: ```bash docker-compose scale worker=10 ``` - Look at the performance graph: does it show a x10 improvement? - Look at the aggregated logs of our containers - Look at the impact on CPU load and memory usage ] --- # Identifying bottlenecks - You should have seen a 3x speed bump (not 10x) - Adding workers didn't result in linear improvement - *Something else* is slowing us down -- - ... But what? -- - The code doesn't have instrumentation - Let's use state-of-the-art HTTP performance analysis!
(i.e. good old tools like `ab`, `httping`...) --- ## Measuring latency under load We will use `httping`. .exercise[ - Check the latency of `rng`: ```bash httping -c 10 localhost:8001 ``` - Check the latency of `hasher`: ```bash httping -c 10 localhost:8002 ``` ] `rng` has a much higher latency than `hasher`. --- ## Let's draw hasty conclusions - The bottleneck seems to be `rng` - *What if* we don't have enough entropy and can't generate enough random numbers? - We need to scale out the `rng` service on multiple machines! Note: this is a fiction! We have enough entropy. But we need a pretext to scale out.
(In fact, the code of `rng` uses `/dev/urandom`, which doesn't need entropy.) --- class: title # Scaling out --- # Connecting to containers on other hosts - So far, our whole stack is on a single machine - We want to scale out (across multiple nodes) - We will deploy the same stack multiple times - But we want every stack to use the same Redis
(in other words: Redis is our only *stateful* service here) -- - And remember: we're not allowed to change the code! - the code connects to host `redis` - `redis` must resolve to the address of our Redis service - the Redis service must listen on the default port (6379) ??? ## Using custom DNS mapping - We could setup a Redis server on its default port - And add a DNS entry mapping `redis` to this server .exercise[ - See what happens if we run: ```bash docker run --add-host redis:1.2.3.4 alpine ping redis ``` ] There is a Compose file option for that: `extra_hosts`. --- # Abstracting remote services with ambassadors - We will use an ambassador - Redis will be started independently of our stack - It will run at an arbitrary location (host+port) - In our stack, we replace `redis` with an ambassador - The ambassador will connect to Redis - The ambassador will "act as" Redis in the stack --- class: pic  --- class: pic  --- class: pic  --- class: pic  --- class: pic  --- class: pic  --- class: pic  --- ## Start redis - Start a standalone Redis container - Let Docker expose it on a random port .exercise[ - Run redis with a random public port:
`docker run -d -P --name myredis redis` - Check which port was allocated:
`docker port myredis 6379` ] - Note the IP address of the machine, and this port --- ## Introduction to `jpetazzo/hamba` - General purpose load balancer and traffic director - [Source code is available on GitHub]( https://github.com/jpetazzo/hamba) - [Public image is available on the Docker Hub]( https://hub.docker.com/r/jpetazzo/hamba/) - Generates a configuration file for HAProxy, then starts HAProxy - Parameters are provided on the command line; for instance: ```bash docker run -d -p 80 jpetazzo/hamba 80 www1:1234 www2:2345 docker run -d -p 80 jpetazzo/hamba 80 www1 1234 www2 2345 ``` Those two commands do the same thing: they start a load balancer listening on port 80, and balancing traffic across www1:1234 and www2:2345 --- ## Update `docker-compose.yml` .exercise[ - Replace `redis` with an ambassador using `jpetazzo/hamba`: ```yaml redis: image: jpetazzo/hamba command: 6379 `AA.BB.CC.DD:EEEEE` ``` ] Shortcut: `docker-compose.yml-ambassador`
(But you still have to update `AA.BB.CC.DD:EEEEE`!) --- ## Start the stack on the first machine - Compose will detect the change in the `redis` service - It will replace `redis` with a `jpetazzo/hamba` instance .exercise[ - Just tell Compose to do its thing:
`docker-compose up -d` - Check that the stack is up and running:
`docker-compose ps` - Look at the web UI to make sure that it works fine ] --- ## Controlling other Docker Engines - Many tools in the ecosystem will honor the `DOCKER_HOST` environment variable - Those tools include (obviously!) the Docker CLI and Docker Compose - Our training VMs have been setup to accept API requests on port 55555
(without authentication - this is very insecure, by the way!) - We will see later how to setup mutual authentication with certificates --- ## Setting the `DOCKER_HOST` environment variable .exercise[ - Check how many containers are running on `node1`: ```bash docker ps ``` - Set the `DOCKER_HOST` variable to control `node2`, and compare: ```bash export DOCKER_HOST=tcp://node2:55555 docker ps ``` ] You shouldn't see any container running on `node2` at this point. --- ## Start the stack on another machine - We will tell Compose to bring up our stack on the other node - It will use the local code (we don't need to checkout the code on `node2`) .exercise[ - Start the stack: ```bash docker-compose up -d ``` ] Note: this will build the container images on `node2`, resulting in potentially different results from `node1`. We will see later how to use the same images across the whole cluster. --- ## Run the application on every node - We will repeat the previous step with a little shell loop ... but introduce parallelism to save some time .exercise[ - Deploy one instance of the stack on each node: ```bash for N in 3 4 5; do DOCKER_HOST=tcp://node$N:55555 docker-compose up -d & done wait ``` ] Note: again, this will rebuild the container images on each node. --- ## Scale! - The app is built (and running!) everywhere - Scaling can be done very quickly .exercise[ - Add a bunch of workers all over the place: ```bash for N in 1 2 3 4 5; do DOCKER_HOST=tcp://node$N:55555 docker-compose scale worker=10 done ``` - Admire the result in the web UI! ] --- ## A few words about development volumes - Try to access the web UI on another node -- - It doesn't work! Why? -- - Static assets are masked by an empty volume -- - We need to comment out the `volumes` section --- ## Why must we comment out the `volumes` section? - Volumes have multiple uses: - storing persistent stuff (database files...) - sharing files between containers (logs, configuration...) - sharing files between host and containers (source...) - The `volumes` directive expands to an host path: `/home/docker/orchestration-workshop/dockercoins/webui/files` - This host path exists on the local machine (not on the others) - This specific volume is used in development (not in production) --- ## Stop the app - Let's use `docker-compose down` - It will stop and remove the DockerCoins app (but leave other containers running) .exercise[ - We can do another simple parallel shell loop: ```bash for N in $(seq 1 5); do export DOCKER_HOST=tcp://node$N:55555 docker-compose down & done wait ``` ] --- ## Clean up the redis container - `docker-compose down` only removes containers defined with Compose .exercise[ - Check that `myredis` is still there: ```bash unset DOCKER_HOST docker ps ``` - Remove it: ```bash docker rm -f myredis ``` ] --- ## Considerations about ambassadors "Ambassador" is a design pattern. There are many ways to implement it. Others implementations include: - [interlock](https://github.com/ehazlett/interlock); - [registrator](http://gliderlabs.com/registrator/latest/); - [smartstack](http://nerds.airbnb.com/smartstack-service-discovery-cloud/); - [zuul](https://github.com/Netflix/zuul/wiki); - and more! ??? ## Single-tier ambassador deployment - One-shot configuration process - Must be executed manually after each scaling operation - Scans current state, updates load balancer configuration - Pros:
- simple, robust, no extra moving part
- easy to customize (thanks to simple design)
- can deal efficiently with large changes - Cons:
- must be executed after each scaling operation
- harder to compose different strategies - Example: this workshop ??? ## Two-tier ambassador deployment - Daemon listens to Docker events API - Reacts to container start/stop events - Adds/removes back-ends to load balancers configuration - Pros:
- no extra step required when scaling up/down - Cons:
- extra process to run and maintain
- deals with one event at a time (ordering matters) - Hidden gotcha: load balancer creation - Example: interlock ??? ## Three-tier ambassador deployment - Daemon listens to Docker events API - Reacts to container start/stop events - Adds/removes scaled services in distributed config DB (Zookeeper, etcd, Consul…) - Another daemon listens to config DB events,
adds/removes backends to load balancers configuration - Pros:
- more flexibility - Cons:
- three extra services to run and maintain - Example: registrator --- ## Ambassadors and overlay networks - Overlay networks allow direct multi-host communication - Ambassadors are still useful to implement other tasks: - load balancing; - credentials injection; - instrumentation; - fail-over; - etc. --- class: title # Dynamic orchestration --- ## Static vs Dynamic - Static - you decide what goes where - simple to describe and implement - seems easy at first but doesn't scale efficiently - Dynamic - the system decides what goes where - requires extra components (HA KV...) - scaling can be finer-grained, more efficient --- class: pic ## Hands-on Swarm  --- ## Swarm (in theory) - Consolidates multiple Docker hosts into a single one - You talk to Swarm using the Docker API → you can use all existing tools: Docker CLI, Docker Compose, etc. - Swarm talks to your Docker Engines using the Docker API too → you can use existing Engines without modification - Dispatches (schedules) your containers across the cluster, transparently - Open source and written in Go (like the Docker Engine) - Initial design and implementation by [@aluzzardi](https://twitter.com/aluzzardi) and [@vieux](https://twitter.com/vieux), who were also the authors of the first versions of the Docker Engine --- ## Swarm (in practice) - Stable since November 2015 - Easy to setup (compared to other orchestrators) - Tested with 1000 nodes + 50000 containers
.small[(without particular tuning; see DockerCon EU opening keynotes!)] - Requires a key/value store for advanced features - Can use Consul, etcd, or Zookeeper --- # Deploying Swarm - Components involved: - cluster discovery mechanism
(so that the manager can learn about the nodes) - Swarm manager
(your frontend to the cluster) - Swarm agent
(runs on each node, registers it with service discovery) --- ## Cluster discovery - Possible backends: - dynamic, self-hosted
(requires to run a Consul/etcd/Zookeeper cluster) - static, through command-line or file
(great for testing, or for private subnets, see [this article]( https://medium.com/on-docker/docker-swarm-flat-file-engine-discovery-2b23516c71d4#.6vp94h5wn)) - external, token-based
(dynamic; nothing to operate; relies on external service operated by Docker Inc.) --- ## Swarm agent - Used only for dynamic discovery (ZK, etcd, Consul, token) - Must run on each node - Every 20s (by default), tells to the discovery system: *"Hello, there is a Swarm node at A.B.C.D:EFGH"* - Must know the node's IP address (It cannot figure it out by itself, because it doesn't know whether to use public or private addresses) - The node continues to work even if the agent dies --- ## Swarm manager - Accepts Docker API requests - Communicates with the cluster nodes - Performs healthchecks, scheduling... --- # Picking a key/value store - We are going to use a key/value store, and use it for: - cluster membership discovery - overlay networks backend - resilient storage of important credentials - Swarm leader election - We are going to use Consul, and run one Consul instance on each node (That way, we can always access Consul over localhost) --- ## Do we really need a key/value store? - Cluster membership discovery doesn't *require* a key/value store (We could use the token mechanism instead) - Network overlays don't *require* a key/value store (We could use a plugin like Weave instead) - Credentials can be distributed through other mechanisms (E.g. copying them to a private S3 bucket) - Swarm leader election, however, requires a key/value store --- ## Why are we using a key/value store, then? - Each aforementioned mechanism requires some reliable, distributed storage - If we don't use our own key/value store, we end up using *something else*: - Docker Inc.'s centralized token discovery service - [Weave's CRDT protocol](https://github.com/weaveworks/weave/wiki/IP-allocation-design) - AWS S3 (or your cloud provider's equivalent, or some other file storage system) - Each of those is one extra potential point of failure - See for instance [Kyle Kingsbury's analysis of Chronos](https://aphyr.com/posts/326-jepsen-chronos) for an illustration of this problem - By operating our own key/value store, we have 1 extra service instead of 3 (or more) --- ## Should we always use a key/value store? -- - No! -- - If you don't want to operate your own key/value store, don't do it - You might be more comfortable using tokens + Weave + S3, for instance - You can also use static discovery - Maybe you don't even need overlay networks --- ## Why Consul? - Consul is not the "official" or best way to do this - This is an arbitrary decision made by Truly Yours - I *personally* find Consul easier to setup for a workshop like this - ... But etcd and Zookeper will work too! --- ## Setting up our Swarm cluster We need to: - create certificates, - distribute them on our nodes, - run the Swarm agent on every node, - run the Swarm manager on `node1`, - reconfigure the Engine on each node to add extra flags (for overlay networks). That's a lot of work, so we'll use Docker Machine to automate this. --- ## Using Docker Machine to setup a Swarm cluster - Docker Machine has two primary uses: - provisioning cloud instances running the Docker Engine - managing local Docker VMs within e.g. VirtualBox - It can also create Swarm clusters, and will: - create and manage certificates - automatically start swarm agent and manager containers - It comes with a special driver, `generic`, to (re)configure existing machines --- ## Setting up Docker Machine - Install `docker-machine` (single binary download) (This is already done on your VMs!) - Set a few environment variables (cloud credentials) ```bash export AWS_ACCESS_KEY_ID=AKI... export AWS_SECRET_ACCESS_KEY=... export AWS_DEFAULT_REGION=eu-west-2 export DIGITALOCEAN_ACCESS_TOKEN=... export DIGITALOCEAN_SIZE=2gb export AZURE_SUBSCRIPTION_ID=... ``` (We already have 5 nodes, so we don't need to do this!) --- ## Creating nodes with Docker Machine - The only two mandatory parameters are the driver to use, and the machine name: ```bash docker-machine create -d digitalocean node42 ``` - *Tons* of parameters can be specified; see [Docker Machine driver documentation](https://docs.docker.com/machine/drivers/) - To list machines and their status: ```bash docker-machine ls ``` - To destroy a machine: ```bash docker-machine rm node42 ``` --- ## Communicating with nodes managed by Docker Machine - Select a machine for use: ```bash eval $(docker-machine env node42) ``` This will set a few environment variables (at least `DOCKER_HOST`). - Execute regular commands with Docker, Compose, etc. (They will pick up remote host address from environment) - If you need to go under the hood, you can get SSH access: ```bash docker-machine ssh node42 ``` --- ## Docker Machine `generic` driver - Most drivers work the same way: - use cloud API to create instance - connect to instance over SSH - install Docker - The `generic` driver skips the first step - It can install Docker on any machine, as long as you have SSH access - We will use that! --- ## Setting up Swarm with Docker Machine When invoking Machine, we will provide three sets of parameters: - the machine driver to use (`generic`) and the SSH connection information - Swarm-specific options indicating the cluster membership discovery mechanism - Extra flags to be passed to the Engine, to enable overlay networks --- ## Provisioning the first node .exercise[ - Use the following command to provision the manager node: ```bash docker-machine create --driver generic \ --engine-opt cluster-store=consul://localhost:8500 \ --engine-opt cluster-advertise=eth0:2376 \ --swarm --swarm-master --swarm-discovery consul://localhost:8500 \ --generic-ssh-user docker --generic-ip-address `AA.BB.CC.DD` node1 ``` ] --- ## Provisioning the other nodes - The command is almost the same, but without the `--swarm-master` flag - We will use a shell snippet for convenience .exercise[ ```bash grep node[2345] /etc/hosts | grep -v ^127 | while read IPADDR NODENAME do docker-machine create --driver generic \ --engine-opt cluster-store=consul://localhost:8500 \ --engine-opt cluster-advertise=eth0:2376 \ --swarm --swarm-discovery consul://localhost:8500 \ --generic-ssh-user docker \ --generic-ip-address $IPADDR $NODENAME done ``` ] --- ## Check what we did Let's connect to the first node *individually*. .exercise[ - Select the node with Machine ```bash eval $(docker-machine env node1) ``` - Execute some Docker commands ```bash docker version docker info ``` ] In the output of `docker info`, we should see `Cluster store` and `Cluster advertise`. --- ## Interact with the node Let's try a few basic Docker commands on this node. .exercise[ - Run a simple container: ```bash docker run --rm busybox echo hello world ``` - See running containers: ```bash docker ps ``` ] Two containers should show up: the agent and the manager. --- ## Connect to the Swarm cluster Now, let's try the same operations, but when talking to the Swarm manager. .exercise[ - Select the Swarm manager with Machine: ```bash eval $(docker-machine env node1 --swarm) ``` - Execute some Docker commands ```bash docker version docker info docker ps ``` ] The output is different! Let's review this. --- ## `docker version` Swarm identifies itself clearly: ``` Client: Version: 1.11.1 API version: 1.23 Go version: go1.5.4 Git commit: 5604cbe Built: Tue Apr 26 23:38:55 2016 OS/Arch: linux/amd64 Server: Version: swarm/1.2.2 API version: 1.22 Go version: go1.5.4 Git commit: 34e3da3 Built: Mon May 9 17:03:22 UTC 2016 OS/Arch: linux/amd64 ``` --- ## `docker info` The output of `docker info` on Swarm shows a number of differences from the output on a single Engine: .small[ ``` Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 0 Server Version: swarm/1.2.2 Role: primary Strategy: spread Filters: health, port, containerslots, dependency, affinity, constraint Nodes: 0 Plugins: Volume: Network: Kernel Version: 4.2.0-36-generic Operating System: linux Architecture: amd64 CPUs: 0 Total Memory: 0 B Name: node1 Docker Root Dir: Debug mode (client): false Debug mode (server): false WARNING: No kernel memory limit support ``` ] --- ## Why zero nodes? - We haven't started Consul yet - Swarm discovery is not operational - Swarm can't discover the nodes Note: Docker will start (and be functional) without a K/V store. This lets us run Consul itself in a container. --- ## Adding Consul - We will run Consul in containers - We will use the [Consul official image]( https://hub.docker.com/_/consul/) that was released *very recently* - We will tell Docker to automatically restart it on reboots - To simplify network setup, we will use `host` networking --- ## A few words about `host` networking - Consul needs to be aware of its actual IP address (seen by other nodes) - It also binds a bunch of different ports - It makes sense (from a security point of view) to have Consul listening on localhost only (and have "users", i.e. Engine, Swarm, etc. connect over localhost) - Therefore, we will use `host` networking! - Also: Docker Machine 0.6 starts the Swarm containers in `host` networking ... - ... but Docker Machine 0.7 doesn't (which is why we stick to 0.6 for now) --- ## Consul fundamentals (if I must give you just one slide...) - Consul nodes can be "just an agent" or "server" - From the client's perspective, they behave the same - Only servers are members in the Raft consensus / leader election / etc (non-server agents forward requests to a server) - All nodes must be told the address of at least another node to join (except for the first node, where this is optional) - At least the first nodes must know how many nodes to expect to have quorum - Consul can have only one "truth" at a time (hence the importance of quorum) --- ## Starting our Consul cluster .exercise[ - Make sure you're logged into `node1`, and: ```bash IPADDR=$(ip a ls dev eth0 | sed -n 's,.*inet \(.*\)/.*,\1,p') for N in 1 2 3 4 5; do ssh node$N -- docker run -d --restart=always --name consul_node$N \ -e CONSUL_BIND_INTERFACE=eth0 --net host consul \ agent -server -retry-join $IPADDR -bootstrap-expect 5 \ -ui -client 0.0.0.0 done ``` ] Note: in production, you probably want to remove `-client 0.0.0.0` since it gives public access to your cluster! Also adapt `-bootstrap-expect` to your quorum. --- ## Check that our Consul cluster is up - With your browser, navigate to any instance on port 8500
(in "NODES" you should see the five nodes) - Let's run a couple of useful Consul commands .exercise[ - Ask Consul the list of members it knows: ```bash docker run --net host --rm consul members ``` - Ask Consul which node is the current leader: ```bash curl localhost:8500/v1/status/leader ``` ] --- ## Check that our Swarm cluster is up .exercise[ - Try again the `docker info` from earlier: ```bash eval $(docker-machine env --swarm node1) docker info docker ps ``` ] All nodes should be visible. (If not, give them a minute or two to register.) The Consul containers should be visible. The Swarm containers, however, are hidden by Swarm (unless you use `docker ps -a`). --- # Running containers on Swarm Try to run a few `busybox` containers. Then, let's get serious: .exercise[ - Start a Redis service:
`docker run -dP redis` - See the service address:
`docker port $(docker ps -lq) 6379` ] This can be any of your five nodes. --- ## Scheduling strategies - Random: pick a node at random
(but honor resource constraints) - Spread: pick the node with the least containers
(including stopped containers) - Binpack: try to maximize resource usage
(in other words: use as few hosts as possible) --- # Resource allocation - Swarm can honor resource reservations - This requires containers to be started with resource limits - Swarm refuses to schedule a container if it cannot honor a reservation .exercise[ - Start Redis containers with 1 GB of RAM until Swarm refuses to start more: ```bash docker run -d -m 1G redis ``` ] On a cluster of 5 nodes with ~3.8 GB of RAM per node, Swarm will refuse to start the 16th container. --- ## Removing our Redis containers - Let's use a little bit of shell scripting .exercise[ - Remove all containers using the redis image: ```bash docker ps | awk '/redis/ {print $1}' | xargs docker rm -f ``` ] ??? ## Things to know about resource allocation - `docker info` shows resource allocation for each node - Swarm allows a 5% resource overcommit (tunable) - Containers without resource reservation can always be started - Resources of stopped containers are still counted as being reserved - this guarantees that it will be possible to restart a stopped container - containers have to be deleted to free up their resources - `docker update` can be used to change resource allocation on the fly --- class: title # Setting up overlay networks --- # Multi-host networking - Docker 1.9 has the concept of *networks* - By default, containers are on the default "bridge" network - You can create additional networks - Containers can be on multiple networks - Containers can dynamically join/leave networks - The "overlay" driver lets networks span multiple hosts - Containers can have "network aliases" resolvable through DNS --- ## Manipulating networks, names, and aliases - The preferred method is to let Compose do the heavy lifting for us (YAML-defined networking!) - But if we really need to, we can use the Docker CLI, with: `docker network ...` `docker run --net ... --net-alias ...` - The following slides illustrate those commands --- ## Create a few networks and containers .exercise[ - Create two networks, *blue* and *green*: ```bash docker network create blue docker network create green docker network ls ``` - Create containers with names of blue and green things, on their respective networks: ```bash docker run -d --net-alias things --name sky --net blue -m 3G redis docker run -d --net-alias things --name navy --net blue -m 3G redis docker run -d --net-alias things --name grass --net green -m 3G redis docker run -d --net-alias things --name forest --net green -m 3G redis ``` ] --- ## Check connectivity within networks .exercise[ - Check that our containers are on different nodes: ```bash docker ps ``` - This will work: ```bash docker run --rm --net blue alpine ping -c 3 navy ``` - This will not: ```bash docker run --rm --net blue alpine ping -c 3 grass ``` ] ??? ## Containers connected to multiple networks - Some colors aren't *quite* blue *nor* green .exercise[ - Create a container that we want to be on both networks: ```bash docker run -d --net-alias things --net blue --name turquoise redis ``` - Check connectivity: ```bash docker exec -ti turquoise ping -c 3 navy docker exec -ti turquoise ping -c 3 grass ``` (First works; second doesn't) ] ??? ## Dynamically connecting containers - This is achieved with the command:
`docker network connect NETNAME CONTAINER` .exercise[ - Dynamically connect to the green network: ```bash docker network connect green turquoise ``` - Check connectivity: ```bash docker exec -ti turquoise ping -c 3 navy docker exec -ti turquoise ping -c 3 grass ``` (Both commands work now) ] --- ## Network aliases - Each container was created with the network alias `things` - Network aliases are scoped by network .exercise[ - Resolve the `things` alias from both networks: ```bash docker run --rm --net blue alpine nslookup things docker run --rm --net green alpine nslookup things ``` ] ??? ## Under the hood - Each network has an interface in the container - There is also an interface for the default gateway .exercise[ - View interfaces in our `turquoise` container: ```bash docker exec -ti turquoise ip addr ls ``` ] ??? ## Dynamically disconnecting containers - There is a mirror command to `docker network connect` .exercise[ - Disconnect the *turquoise* container from *blue* (its original network): ```bash docker network disconnect blue turquoise ``` - Check connectivity: ```bash docker exec -ti turquoise ping -c 3 navy docker exec -ti turquoise ping -c 3 grass ``` (First command fails, second one works) ] --- ## Cleaning up .exercise[ - Destroy containers: ```bash docker rm -f sky navy grass forest ``` - Destroy networks: ```bash docker network rm blue docker network rm green ``` ] --- ## Cleaning up after an outage or a crash - You cannot remove a network if it still has containers - There is no `"rm -f"` for network - If a network still has stale endpoints, you can use `"disconnect -f"` --- class: title # Building images with Swarm --- ## Building images with Swarm - Special care must be taken when building and running images - We *can* build images on Swarm (with `docker build` or `docker-compose build`) - One node will be picked at random, and the build will happen there - At the end of the build, the image will be present *only on that node* --- ## Building on Swarm can yield inconsistent results - Builds are scheduled on random nodes - Multiple builds and rebuilds can happen on different nodes - If a build happens on a different node, the cache of the previous build cannot be used - Worse: you can have two different images with the same name on your cluster --- ## Scaling won't work as expected Consider the following scenario: - `docker-compose up`
→ each service is built on a node, and runs there - `docker-compose scale`
→ additional containers for this service can only be spawned where the image was built - `docker-compose up` (again)
→ services might be built (and started) on different nodes - `docker-compose scale`
→ containers can be spawned with both the new and old images --- ## Scaling correctly with Swarm - After building an image, it should be distributed to the cluster (Or made available through a registry, so that nodes can download it automatically) - Instead of referencing images with the `:latest` tag, unique tags should be used (Using e.g. timestamps, version numbers, or VCS hashes) --- ## Why can't Swarm do this automatically for us? - Let's step back and think for a minute ... - What should `docker build` do on Swarm? - build on one machine - build everywhere ($$$) - After the build, what should `docker run` do? - run where we built (how do we know where it is?) - run on any machine that has the image - Could Compose+Swarm solve this automatically? --- ## A few words about "sane defaults" - *It would be nice if Swarm could pick a node, and build there!* - but which node should it pick? - what if the build is very expensive? - what if we want to distribute the build across nodes? - what if we want to tag some builder nodes? - ok but what if no node has been tagged? - *It would be nice if Swarm could automatically push images!* - using the Docker Hub is an easy choice
(you just need an account) - but some of us can't/won't use Docker Hub
(for compliance reasons or because no network access) .small[("Sane" defaults are nice only if we agree on the definition of "sane")] --- ## The plan - Build on a single node (`node1`) - Tag images with the current UNIX timestamp (for simplicity) - Upload them to a registry - Update the Compose file to use those images This is all automated with the [`build-tag-push.py` script](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/build-tag-push.py). --- ## Which registry do we want to use? .small[ - **Docker Hub** - hosted by Docker Inc. - requires an account (free, no credit card needed) - images will be public (unless you pay) - located in AWS EC2 us-east-1 - **Docker Trusted Registry** - self-hosted commercial product - requires a subscription (free 30-day trial available) - images can be public or private - located wherever you want - **Docker open source registry** - self-hosted barebones repository hosting - doesn't require anything - doesn't come with anything either - located wherever you want ] --- ## Using Docker Hub - Set the `DOCKER_REGISTRY` environment variable to your Docker Hub user name
(the `build-tag-push.py` script prefixes each image name with that variable) - We will also see how to run the open source registry
(so use whatever option you want!) .exercise[ - Set the following environment variable:
`export DOCKER_REGISTRY=jpetazzo` - (Use *your* Docker Hub login, of course!) - Log into the Docker Hub:
`docker login` ] --- ## Using Docker Trusted Registry If we wanted to use DTR, we would: - make sure we have a Docker Hub account - [activate a Docker Datacenter subscription]( https://hub.docker.com/enterprise/trial/) - install DTR on our machines - set `DOCKER_REGISTRY` to `dtraddress:port/user` *This is out of the scope of this workshop!* --- ## Using open source registry - We need to run a `registry:2` container
(make sure you specify tag `:2` to run the new version!) - It will store images and layers to the local filesystem
(but you can add a config file to use S3, Swift, etc.) - Docker *requires* TLS when communicating with the registry, unless for registries on `localhost` or with the Engine flag `--insecure-registry` - Our strategy: run a reverse proxy on `localhost:5000` on each node --- ## Registry frontends and backend  --- # Deploying a local registry - There is a Compose file for that .exercise[ - Go to the `registry` directory in the repository: ```bash cd ~/orchestration-workshop/registry ``` ] Let's examine the `docker-compose.yml` file. --- ## Running a local registry with Compose ```yaml version: "2" services: backend: image: registry:2 frontend: image: jpetazzo/hamba command: 5000 backend:5000 ports: - "127.0.0.1:5000:5000" depends_on: - backend ``` - *Backend* is the actual registry. - *Frontend* is the ambassador that we deployed earlier.
It communicates with *backend* using an internal network and network aliases. --- ## Starting a local registry with Compose - We will bring up the registry - Then we will ensure that one *frontend* is running on each node by scaling it to our number of nodes .exercise[ - Start the registry: ```bash docker-compose up -d ``` ] --- ## "Scaling" the local registry - This is a particular kind of scaling - We just want to ensure that one *frontend* is running on every single node of the cluster .exercise[ - Scale the registry: ```bash for N in $(seq 1 5); do docker-compose scale frontend=$N done ``` ] Note: Swarm might do that automatically for us in the future. --- ## Testing our local registry - We can retag a small image, and push it to the registry .exercise[ - Make sure we have the busybox image, and retag it: ```bash docker pull busybox docker tag busybox localhost:5000/busybox ``` - Push it: ```bash docker push localhost:5000/busybox ``` ] --- ## Checking what's on our local registry - The registry API has endpoints to query what's there .exercise[ - Ensure that our busybox image is now in the local registry: ```bash curl http://localhost:5000/v2/_catalog ``` ] The curl command should output: ```json {"repositories":["busybox"]} ``` --- ## Adapting our Compose file to run on Swarm - We can get rid of all the `ports` section, except for the web UI .exercise[ - Go back to the dockercoins directory: ```bash cd ~/orchestration-workshop/dockercoins ``` ] --- ## Our new Compose file .small[ ```yaml version: '2' services: rng: build: rng hasher: build: hasher webui: build: webui ports: - "8000:80" redis: image: redis worker: build: worker ``` ] Copy-paste this into `docker-compose.yml`
(or you can `cp docker-compose.yml-v2 docker-compose.yml`) --- ## Use images, not builds - We need to replace each `build` with an `image` - We will use the `build-tag-push.py` script for that .exercise[ - Set `DOCKER_REGISTRY` to use our local registry - Make sure that you are building on `node1` - Then run the script ```bash export DOCKER_REGISTRY=localhost:5000 eval $(docker-machine env node1) ../bin/build-tag-push.py ``` ] --- ## Run the application - At this point, our app is ready to run .exercise[ - Start the application: ```bash export COMPOSE_FILE=docker-compose.yml-`NNN` eval $(docker-machine env node1 --swarm) docker-compose up -d ``` - Observe that it's running on multiple nodes:
(each container name is prefixed with the node it's running on) ```bash docker ps ``` ] --- ## View the performance graph - Load up the graph in the browser .exercise[ - Check the `webui` service address and port: ```bash docker-compose port webui 80 ``` - Open it in your browser ] --- ## Scaling workers - Scaling the `worker` service works out of the box (like before) .exercise[ - Scale `worker`: ```bash docker-compose scale worker=10 ``` ] Check that workers are on different nodes. However, we hit the same bottleneck as before. How can we address that? --- ## Finding the real cause of the bottleneck - If time permits, we can benchmark `rng` and `hasher` to find out more - Otherwise, we'll fast-forward a bit --- ## Benchmarking in isolation - If we want the benchmark to be accurate, we need to make sure that `rng` and `hasher` are not receiving traffic .exercise[ - Stop the `worker` containers: ```bash docker-compose kill worker ``` ] --- ## A better benchmarking tool - Instead of `httping`, we will now use `ab` (Apache Bench) - We will install it in an `alpine` container placed on the network used by our application .exercise[ - Start an interactive `alpine` container on the `dockercoins_rng` network: ```bash docker run -ti --net dockercoins_default alpine sh ``` - Install `ab` with the `apache2-utils` package: ```bash apk add --update apache2-utils ``` ] --- ## Benchmarking `rng` We will send 50 requests, but with various levels of concurrency. .exercise[ - Send 50 requests, with a single sequential client: ```bash ab -c 1 -n 50 http://rng/10 ``` - Send 50 requests, with ten parallel clients: ```bash ab -c 10 -n 50 http://rng/10 ``` ] --- ## Benchmark results for `rng` - In both cases, the benchmark takes ~5 seconds to complete - When serving requests sequentially, they each take 100ms - In the parallel scenario, the latency increased dramatically: - one request is served in 100ms - another is served in 200ms - another is served in 300ms - ... - another is served in 1000ms - What about `hasher`? --- ## Benchmarking `hasher` We will do the same tests for `hasher`. The command is slightly more complex, since we need to post random data. First, we need to put the POST payload in a temporary file. .exercise[ - Install curl in the container, and generate 10 bytes of random data: ```bash apk add curl curl http://rng/10 >/tmp/random ``` ] --- ## Benchmarking `hasher` Once again, we will send 50 requests, with different levels of concurrency. .exercise[ - Send 50 requests with a sequential client: ```bash ab -c 1 -n 50 -T application/octet-stream \ -p /tmp/random http://hasher/ ``` - Send 50 requests with 10 parallel clients: ```bash ab -c 10 -n 50 -T application/octet-stream \ -p /tmp/random http://hasher/ ``` ] --- ## Benchmark results for `hasher` - The sequential benchmarks takes ~5 seconds to complete - The parallel benchmark takes less than 1 second to complete - In both cases, each request takes a bit more than 100ms to complete - Requests are a bit slower in the parallel benchmark - It looks like `hasher` is better equiped to deal with concurrency than `rng` --- class: title Why? --- ## Why does everything take (at least) 100ms? -- `rng` code:  -- `hasher` code:  --- class: title But ... WHY?!? --- ## Why did we sprinkle this sample app with sleeps? - Deterministic performance
(regardless of instance speed, CPUs, I/O...) -- - Actual code sleeps all the time anyway -- - When your code makes a remote API call: - it sends a request; - it sleeps until it gets the response; - it processes the response. --- ## Why do `rng` and `hasher` behave differently?  -- (Synchronous vs. asynchronous event processing) --- ## How to make `rng` go faster - Obvious solution: comment out the `sleep` instruction -- - Unfortunately, in the real world, network latency exists -- - More realistic solution: use an asynchronous framework
(e.g. use gunicorn with gevent) -- - Reminder: we can't change the code! -- - Solution: scale out `rng`
(dispatch `rng` requests on multiple instances) --- # Scaling web services with Compose on Swarm - We *can* scale network services with Compose - The result may or may not be satisfactory, though! .exercise[ - Restart the `worker` service: ```bash docker-compose start worker ``` - Scale the `rng` service: ```bash docker-compose scale rng=5 ``` ] --- ## Results - In the web UI, you might see a performance increase ... or maybe not -- - Since Engine 1.11, we get round-robin DNS records (i.e. resolving `rng` will yield the IP addresses of all 3 containers) - Docker randomizes the records it sends - But many resolvers will sort them in unexpected ways - Depending on various factors, you could get: - all traffic on a single container - traffic perfectly balanced on all containers - traffic unevenly balanced across containers --- ## Assessing DNS randomness - Let's see how our containers resolve DNS requests .exercise[ - On each of our 10 scaled workers, execute 5 ping requests: ```bash for N in $(seq 1 10); do echo PING__________$N for I in $(seq 1 5); do docker exec -ti dockercoins_worker_$N ping -c1 rng done done | grep PING ``` ] (The 7th Might Surprise You!) --- ## DNS randomness - Other programs can yield different results - Same program on another distro can yield different results - Same source code with another libc or resolver can yield different results - Running the same test at different times can yield different results - Did I mention that Your Results May Vary? --- ## Implementing fair load balancing - Instead of relying on DNS round robin, let's use a proper load balancer - Use Compose to create multiple copies of the `rng` service - Put a load balancer in front of them - Point other services to the load balancer --- ## Naming problem - The service is called `rng` - Therefore, it is reachable with the network name `rng` - Our application code (the `worker` service) connects to `rng` - So the name `rng` should resolve to the load balancer - What should we do‽ --- ## Naming is *per-network* - Solution: put `rng` on its own network - That way, it doesn't take the network name `rng`
(at least not on the default network) - Have the load balancer sit on both networks - Add the name `rng` to the load balancer --- class: pic Original DockerCoins  --- class: pic Load-balanced DockerCoins  --- ## Declaring networks - Networks (other than the default one) *must* be declared in a top-level `networks` section, placed anywhere in the file .exercise[ - Add the `rng` network to the Compose file, `docker-compose.yml-NNN`: ```yaml version: '2' networks: rng: services: rng: image: ... ... ``` ] --- ## Putting the `rng` service in its network - Services can have a `networks` section - If they don't: they are placed in the default network - If they do: they are placed only in the mentioned networks .exercise[ - Change the `rng` service to put it in its network: ```yaml rng: image: localhost:5000/dockercoins_rng:… networks: rng: ``` ] --- ## Adding the load balancer - The load balancer has to be in both networks: `rng` and `default` - In the `default` network, it must have the `rng` alias - We will use the `jpetazzo/hamba` image .exercise[ - Add the `rng-lb` service to the Compose file: ```yaml rng-lb: image: jpetazzo/hamba command: run networks: rng: default: aliases: [ rng ] ``` ] --- ## Load balancer initial configuration - We specified `run` as the initial command - This tells `hamba` to wait for an initial configuration - The load balancer will not be operational (until we feed it its configuration) --- ## Start the application .exercise[ - Bring up DockerCoins: ```bash docker-compose up -d ``` - See that `worker` is complaining: ```bash docker-compose logs --tail 100 --follow worker ``` ] --- ## Add one backend to the load balancer - Multiple solutions: - lookup the IP address of the `rng` backend - use the backend's network name - use the backend's container name (easiest!) .exercise[ - Configure the load balancer: ```bash docker run --rm --volumes-from dockercoins_rng-lb_1 \ --net container:dockercoins_rng-lb_1 \ jpetazzo/hamba reconfigure 80 dockercoins_rng_1 80 ``` ] The application should now be working correctly. --- ## Add all backends to the load balancer - The command is similar to the one before - We need to pass the list of all backends .exercise[ - Reconfigure the load balancer: ```bash docker run --rm \ --volumes-from dockercoins_rng-lb_1 \ --net container:dockercoins_rng-lb_1 \ jpetazzo/hamba reconfigure 80 \ $(for N in $(seq 1 5); do echo dockercoins_rng_$N:80 done) ``` ] --- ## Automating the process - Nobody loves artisan YAML handicraft - This can be scripted very easily - But can it be fully automated? --- ## Use DNS to discover the addresses of all the backends - When multiple containers have the same network alias: - Engine 1.10 returns only one of them (the same one across the whole network) - Engine 1.11 returns all of them (in a random order) - A "smart" client can use all records to implement load balancing - We can compose `jpetazzo/hamba` with a special-purpose container, which will dynamically generate HAProxy's configuration when the DNS records are updated --- ## Introducing `jpetazzo/watchdns` - [100 lines of pure POSIX scriptery]( https://github.com/jpetazzo/watchdns/blob/master/watchdns) - Resolves a given DNS name every second - Each time the result changes, a new HAProxy configuration is generated - When used together with `--volumes-from` and `jpetazzo/hamba`, it updates the configuration of an existing load balancer - Comes with a companion script, [`add-load-balancer-v2.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-load-balancer-v2.py), to update your Compose files --- ## Using `jpetazzo/watchdns` .exercise[ - First, revert the Compose file to remove the load balancer - Then, run `add-load-balancer-v2.py`: ```bash ../bin/add-load-balancer-v2.py rng ``` - Inspect the resulting Compose file ] --- ## Scaling with `watchdns` .exercise[ - Start the application with the new sidekick containers: ```bash docker-compose up -d ``` - Scale `rng`: ```bash docker-compose scale rng=10 ``` - Check logs: ```bash docker-compose logs rng-wd ``` ] --- ## Comments - This is a very crude implementation of the pattern - A Go version would only be a bit longer, but use much less resources - When there are many backends, reacting quickly to change is less important (i.e. it's not necessary to re-resolve records every second!) --- class: title # All things ops
(logs, backups, and more) --- # Logs - Two strategies: - log to plain files on volumes - log to stdout
(and use a logging driver) --- ## Logging to plain files on volumes (Sorry, that part won't be hands-on!) - Start a container with `-v /logs` - Make sure that all log files are in `/logs` - To check logs, run e.g. ```bash docker run --volumes-from ... ubuntu sh -c "grep WARN /logs/*.log" ``` - Or just go interactive: ```bash docker run --volumes-from ... -ti ubuntu ``` - You can (should) start a log shipper that way --- ## Logging to stdout - All containers should write to stdout/stderr - Docker will collect logs and pass them to a logging driver - Logging driver can specified globally, and per container
(changing it for a container overrides the global setting) - To change the global logging driver, pass extra flags to the daemon
(requires a daemon restart) - To override the logging driver for a container, pass extra flags to `docker run` --- ## Specifying logging flags - `--log-driver` *selects the driver* - `--log-opt key=val` *adds driver-specific options*
*(can be repeated multiple times)* - The flags are identical for `docker daemon` and `docker run` --- ## Logging flags in practice - If you provision your nodes with Docker Machine, you can set global logging flags (which will apply to all containers started by a given Engine) like this: ```bash docker-machine create ... --engine-opt log-driver=... ``` - Otherwise, use your favorite method to edit or manage configuration files - You can set per-container logging options in Compose files --- ## Available drivers - json-file (default) - syslog (can send to UDP, TCP, TCP+TLS, UNIX sockets) - awslogs (AWS CloudWatch) - journald - gelf - fluentd - splunk --- ## About json-file ... - It doesn't rotate logs by default, so your disks will fill up (Unless you set `maxsize` *and* `maxfile` log options.) - It's the only one supporting logs retrieval (If you want to use `docker logs`, `docker-compose logs`, or fetch logs from the Docker API, you need json-file!) - This might change in the future (But it's complex since there is no standard protocol to *retrieve* log entries.) All about logging in the documentation: https://docs.docker.com/reference/logging/overview/ --- # Setting up ELK to store container logs *Important foreword: this is not an "official" or "recommended" setup; it is just an example. We do not endorse ELK, GELF, or the other elements of the stack more than others!* What we will do: - Spin up an ELK stack, with Compose - Gaze at the spiffy Kibana web UI - Manually send a few log entries over GELF - Reconfigure our DockerCoins app to send logs to ELK --- ## What's in an ELK stack? - ELK is three components: - ElasticSearch (to store and index log entries) - Logstash (to receive log entries from various sources, process them, and forward them to various destinations) - Kibana (to view/search log entries with a nice UI) - The only component that we will configure is Logstash - We will accept log entries using the GELF protocol - Log entries will be stored in ElasticSearch,
and displayed on Logstash's stdout for debugging --- ## Starting our ELK stack - We will use a *separate* Compose file - The Compose file is in the `elk` directory .exercise[ - Go to the `elk` directory: ```bash cd ~/orchestration-workshop/elk ``` - Start the ELK stack: ```bash unset COMPOSE_FILE docker-compose up -d ``` ] --- ## Making sure that each node has a local logstash - We will configure each container to send logs to `localhost:12201` - We need to make sure that each node has a logstash container listening on port 12201 .exercise[ - Scale the `logstash` service to 5 instances (one per node): ```bash for N in $(seq 1 5); do docker-compose scale logstash=$N done ``` ] --- ## Checking that our ELK stack works - Our default Logstash configuration sends a test message every minute - All messages are stored into ElasticSearch, but also shown on Logstash stdout .exercise[ - Look at Logstash stdout: ```bash docker-compose logs logstash ``` ] After less than one minute, you should see a `"message" => "ok"` in the output. --- ## Connect to Kibana - Our ELK stack exposes two public services:
the Kibana web server, and the GELF UDP socket - They are both exposed on their default port numbers
(5601 for Kibana, 12201 for GELF) .exercise[ - Check the address of the node running kibana: ```bash docker-compose ps ``` - Open the UI in your browser: http://instance-address:5601/ ] --- ## "Configuring" Kibana - If you see a status page with a yellow item, wait a minute and reload (Kibana is probably still initializing) - Kibana should offer you to "Configure an index pattern", just click the "Create" button - Then: - click "Discover" (in the top-left corner) - click "Last 15 minutes" (in the top-right corner) - click "Last 1 hour" (in the list in the middle) - click "Auto-refresh" (top-right corner) - click "5 seconds" (top-left of the list) - You should see a series of green bars (with one new green bar every minute) ---  --- ## Sending container output to Kibana - We will create a simple container displaying "hello world" - We will override the container logging driver - The GELF address is `127.0.0.1:12201`, because the Compose file explicitly exposes the GELF socket on port 12201 .exercise[ - Start our one-off container: ```bash docker run --rm --log-driver gelf \ --log-opt gelf-address=udp://127.0.0.1:12201 \ alpine echo hello world ``` ] --- ## Visualizing container logs in Kibana - Less than 5 seconds later (the refresh rate of the UI), the log line should be visible in the web UI - We can customize the web UI to be more readable .exercise[ - In the left column, move the mouse over the following columns, and click the "Add" button that appears: - host - container_name - message ] --- ## Switching back to the DockerCoins application .exercise[ - Go back to the dockercoins directory: ```bash cd ~/orchestration-workshop/dockercoins ``` - Set the `COMPOSE_FILE` variable: ```bash export COMPOSE_FILE=docker-compose.yml-`NNN` ``` ] --- ## Add the logging driver to the Compose file - We need to add the logging section to each container .exercise[ - Edit the `docker-compose.yml-NNN` file, adding the following lines **to each container**: ```yaml logging: driver: gelf options: gelf-address: "udp://127.0.0.1:12201" ``` ] There is also a script, [`../bin/add-logging.py`](https://github.com/jpetazzo/orchestration-workshop/blob/master/bin/add-logging.py), to do that automatically. --- ## Update the DockerCoins app .exercise[ - Use Compose normally: ```bash docker-compose up -d ``` ] If you look in the Kibana web UI, you will see log lines refreshed every 5 seconds. Note: to do interesting things (graphs, searches...) we would need to create indexes. This is beyond the scope of this workshop. --- ## Logging in production - If we were using an ELK stack: - scale ElasticSearch - interpose a Redis or Kafka queue to deal with bursts - Configure your Engines to send all logs to ELK by default - Start the logging containers with a different logging system
(to avoid a logging loop) - Make sure you don't end up writing *all logs* on the nodes running Logstash! --- # Network traffic analysis - We want to inspect the network traffic entering/leaving `dockercoins_redis_1` - We will use *shared network namespaces* to perform network analysis - Two containers sharing the same network namespace... - have the same IP addresses - have the same network interfaces - `eth0` is therefore the same in both containers --- ## Install and start `ngrep` Ngrep uses libpcap (like tcpdump) to sniff network traffic. .exercise[ - Start a container with the same network namespace:
`docker run --net container:dockercoins_redis_1 -ti alpine sh` - Install ngrep:
`apk update && apk add ngrep` - Run ngrep:
`ngrep -tpd eth0 -Wbyline . tcp` ] You should see a stream of Redis requests and responses. --- # Backups - We want to enable backups for `dockercoins_redis_1` - We don't want to install extra software in this container - We will use a special backup container: - sharing the same volumes - using the same network stack (to connect to it easily) - possibly containing our backup tools - This works because the `redis` container image stores its data on a volume --- ## Starting the backup container - We will use the `--net container:` option to be able to connect locally - We will use the `--volumes-from` option to access the container's persistent data .exercise[ - Start the container: ```bash docker run --net container:dockercoins_redis_1 \ --volumes-from dockercoins_redis_1:ro \ -v /tmp/myredis:/output \ -ti alpine sh ``` - Look in `/data` in the container (that's where Redis puts its data dumps) ] --- ## Connecting to Redis - We need to tell Redis to perform a data dump *now* .exercise[ - Connect to Redis: ```bash telnet localhost 6379 ``` - Issue commands `SAVE` then `QUIT` - Look at `/data` again (notice the time stamps) ] - There should be a recent dump file now! --- ## Getting the dump out of the container - We could use many things: - s3cmd to copy to S3 - SSH to copy to a remote host - gzip/bzip/etc before copying - We'll just copy it to the Docker host .exercise[ - Copy the file from `/data` to `/output` - Exit the container - Look into `/tmp/myredis` (on the host) ] --- ## Scheduling backups In the "old world," we (generally) use cron. With containers, what are our options? -- - run `cron` on the Docker host, and put `docker run` in the crontab -- - run `cron` in the backup container, and make sure it keeps running
(e.g. with `docker run --restart=…`) -- - run `cron` in a container, and start backup containers from there -- - listen to the Docker events stream, automatically scheduling backups
when database containers are started --- # Controlling Docker from a container - In a local environment, just bind-mount the Docker control socket: ```bash docker run -ti -v /var/run/docker.sock:/var/run/docker.sock docker ``` - Otherwise, you have to: - set `DOCKER_HOST`, - set `DOCKER_TLS_VERIFY` and `DOCKER_CERT_PATH` (if you use TLS), - copy certificates to the container that will need API access. More resources on this topic: - [Do not use Docker-in-Docker for CI]( http://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/) - [One container to rule them all]( http://jpetazzo.github.io/2016/04/03/one-container-to-rule-them-all/) --- # Docker events stream - Using the Docker API, we can get real-time notifications of everything happening in the Engine: - container creation/destruction - container start/stop - container exit/signal/out of memory - container attach/detach - volume creation/destruction - network creation/destruction - connection/disconnection of containers --- ## Subscribing to the events stream - This is done with `docker events` .exercise[ - Get a stream of events: ```bash docker events ``` - In a new terminal, do *anything*: ```bash docker run --rm alpine sleep 10 ``` ] You should see events for the lifecycle of the container, as well as its connection/disconnection to the default `bridge` network. --- ## A few tools to use the events stream - [docker-spotter](https://github.com/discordianfish/docker-spotter) Written in Go; simple building block to use directly in Shell scripts - [ahab](https://github.com/instacart/ahab) Written in Python; available as a library; ships with a CLI tool --- # Security upgrades - This section is not hands-on - Public Service Announcement - We'll discuss: - how to upgrade the Docker daemon - how to upgrade container images --- ## Upgrading the Docker daemon - Stop all containers cleanly - Stop the Docker daemon - Upgrade the Docker daemon - Start the Docker daemon - Start all containers - This is like upgrading your Linux kernel, but it will get better (Docker Engine 1.11 is using containerd, which will ultimately allow seamless upgrades.) ??? ## In practice - Keep track of running containers before stopping the Engine: ```bash docker ps --no-trunc -q | tee /tmp/running | xargs -n1 -P10 docker stop ``` - Restart those containers after the Engine is running again: ```bash xargs docker start < /tmp/running ```
(Run this multiple times if you have linked containers!) --- ## Upgrading container images - When a vulnerability is announced: - if it affects your base images: make sure they are fixed first - if it affects downloaded packages: make sure they are fixed first - re-pull base images - rebuild - restart containers --- ## How do we know when to upgrade? - Subscribe to CVE notifications - https://cve.mitre.org/ - your distros' security announcements - Check CVE status in official images
(tag [cve-tracker]( https://github.com/docker-library/official-images/labels/cve-tracker) in [docker-library/official-images]( https://github.com/docker-library/official-images/labels/cve-tracker) repo) - Use a container vulnerability scanner
(e.g. [Docker Security Scanning](https://blog.docker.com/2016/05/docker-security-scanning/)) --- ## Upgrading with Compose Compose makes this particularly easy: ```bash docker-compose build --pull --no-cache docker-compose up -d ``` This will automatically: - pull base images; - rebuild all container images; - bring up the new containers. Remember: Compose will automatically move our volumes to the new containers, so data is preserved. --- class: title # Resiliency
and
high availability --- ## What are our single points of failure? - The TLS certificates created by Machine are on `node1` - We have only one Swarm manager - If a node (running containers) is down or unreachable, our application will be affected --- # Distributing Machine credentials - All the credentials (TLS keys and certs) are on node1
(the node on which we ran `docker-machine create`) - If we lose node1, we're toast - We need to move (or copy) the credentials somewhere safe - Credentials are regular files, and relatively small - Ah, if only we had a highly available, hierarchic store ... -- - Wait a minute, we have one! -- (That's Consul, if you were wondering) --- ## Storing files in Consul - We will use [Benjamin Wester's consulfs]( https://github.com/bwester/consulfs) - It mounts a Consul key/value store as a local filesystem - Performance will be horrible
(don't run a database on top of that!) - But to store files of a few KB, nobody will notice - We will copy/link/sync... `~/.docker/machine` to Consul --- ## Installing consulfs - Option 1: install Go, git clone, go build ... - Option 2: be lazy and use [jpetazzo/consulfs]( https://hub.docker.com/r/jpetazzo/consulfs/) .exercise[ - Be lazy and use the Docker image: ```bash eval $(docker-machine env node1) docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs ``` ] Note: the `jpetazzo/consulfs` image contains the `consulfs` binary. It copies it to `/target` (if `/target` is a volume). --- ## Can't we run consulfs in a container? - Yes we can! - The filesystem will be mounted in the container - It won't be visible outside of the container (from the host) - We can use *shared mounts* to propagate mounts from containers to Docker - But propagating from Docker to the host requires particular systemd flags - ... So we'll run it on the host for now --- ## Running consulfs - The `consulfs` binary takes two arguments: - the Consul server address - a mount point (that has to be created first) .exercise[ - Create a mount point and mount Consul as a local filesystem: ```bash mkdir ~/consul consulfs localhost:8500 ~/consul ``` ] Leave this running in the foreground. --- ## Checking our consulfs mount point - All key/values will be visible: - Swarm discovery - overlay networks - ... anything you put in Consul! .exercise[ - Check that Consul key/values are visible: ```bash ls -l ~/consul/ ``` ] --- ## Copying our credentials to Consul - Use standard UNIX commands - Don't try to preserve permissions, though (`consulfs` doesn't store permissions) .exercise[ - Copy Machine credentials into Consul: ```bash cp -r ~/.docker/machine/. ~/consul/machine/ ``` ] (This command can be re-executed to update the copy.) --- ## Install consulfs on another node - We will repeat the previous steps to install consulfs .exercise[ - Connect to node2: ```bash ssh node2 ``` - Install `consulfs`: ```bash docker run --rm -v /usr/local/bin:/target jpetazzo/consulfs ``` ] --- ## Mount Consul - The procedure is still the same as on the first node .exercise[ - Create the mount point: ```bash mkdir ~/consul ``` - Mount the filesystem: ```bash consulfs localhost:8500 ~/consul & ``` ] At this point, `ls -l ~/consul` should show `docker` and `machine` directories. --- ## Access the credentials from the other node - We will create a symlink - We could also copy the credentials .exercise[ - Create the symlink: ```bash mkdir -p ~/.docker/ ln -s ~/consul/machine ~/.docker/ ``` - Check that all nodes are visible: ```bash docker-machine ls ``` ] --- ## A few words on this strategy - Anyone accessing Consul can control your Docker cluster
(to be fair: anyone accessing Consul can wreck serious havoc to your cluster anyway) - ConsulFS doesn't support *all* POSIX operations, so a few things (like `mv`) will not work) - As a consequence, with Machine 0.6, you cannot run `docker-machine create` directly on top of ConsulFS --- ## What if Consul becomes unavailable? - If Consul becomes unavailable (e.g. loses quorum),
you won't be able to access your credentials - If Consul becomes unavailable ...
your cluster will be in a bad state anyway - You can still access each Docker Engine over the local UNIX socket
(and repair Consul that way) --- # Highly available Swarm managers - Until now, the Swarm manager was a SPOF
(Single Point Of Failure) - Swarm has support for replication - When replication is enabled, you deploy multiple (identical) managers - one will be "primary" - the other(s) will be "secondary" - this is determined automatically
(through *leader election*) --- ## Swarm leader election - The leader election mechanism relies on a key/value store
(Consul, etcd, Zookeeper) - There is no requirement on the number of replicas
(the quorum is achieved through the key/value store) - When the leader (or "primary") is unavailable,
a new election happens automatically - You can issue API requests to any manager:
if you talk to a secondary, it forwards to the primary .warning[Until recently there was a bug when the Consul cluster itself had a leader election;
see [docker/swarm#1782](https://github.com/docker/swarm/issues/1782).] --- ## Swarm replication in practice - We need to give two extra flags to the Swarm manager: - `--replication` *enables replication (duh!)* - `--advertise ip.ad.dr.ess:port` *address and port where this Swarm manager is reachable* - Do you deploy with Docker Machine?
Then you can use `--swarm-opt` to automatically pass flags to the Swarm manager --- ## Cleaning up our current Swarm containers - We will use Docker Machine to re-provision Swarm - We need to: - remove the nodes from the Machine registry - remove the Swarm containers .exercise[ - Remove the current configuration (remember to go back to node1!): ```bash for N in 1 2 3 4 5; do ssh node$N docker rm -f swarm-agent swarm-agent-master docker-machine rm -f node$N done ``` ] --- ## Re-deploy with the new configuration - This time, all nodes can be deployed identically
(instead of 1 manager + 4 non-managers) .exercise[ ```bash grep node[12345] /etc/hosts | grep -v ^127 | while read IPADDR NODENAME; do docker-machine create --driver generic \ --engine-opt cluster-store=consul://localhost:8500 \ --engine-opt cluster-advertise=eth0:2376 \ --swarm --swarm-master \ --swarm-discovery consul://localhost:8500 \ --swarm-opt replication --swarm-opt advertise=$IPADDR:3376 \ --generic-ssh-user docker --generic-ip-address $IPADDR $NODENAME done ``` ] .small[ Note: Consul is still running thanks to the `--restart=always` policy. Other containers are now stopped, because the engines have been reconfigured and restarted. ] --- ## Assess our new cluster health - The output of `docker info` will tell us the status of the node that we are talking to (primary or replica) - If we talk to a replica, it will tell us who is the primary .exercise[ - Talk to a random node, and ask its view of the cluster: ```bash eval $(docker-machine env node3 --swarm) docker info | grep -e ^Name -e ^Role -e ^Primary ``` ] Note: `docker info` is one of the only commands that will work even when there is no elected primary. This helps debugging. --- ## Test Swarm manager failover - The previous command told us which node was the primary manager - if `Role` is `primary`,
then the primary is indicated by `Name` - if `Role` is `replica`,
then the primary is indicated by `Primary` .exercise[ - Kill the primary manager: ```bash ssh node`N` docker kill swarm-agent-master ``` ] Look at the output of `docker info` every few seconds. --- # Highly available containers - Swarm has support for *rescheduling* on node failure - It has to be explicitly enabled on a per-container basis - When the primary manager detects that a node goes down,
those containers are rescheduled elsewhere - If the containers can't be rescheduled (constraints issue),
they are lost (there is no reconciliation loop yet) - In Swarm 1.1, this is an *experimental* feature
(To enable it, you must pass the `--experimental` flag when you start Swarm itself!) - In Swarm 1.2, you don't need the `--experimental` flag anymore --- ## About Swarm generic flags - Some flags like `--experimental` and `--debug` must be *before* the Swarm command
(i.e. `docker run swarm --debug manage ...`) - We cannot use Docker Machine to pass that flag ☹
(Machine adds flags *after* the Swarm command) - Instead, we can use a custom Swarm image: ```dockerfile FROM swarm ENTRYPOINT ["/swarm", "--debug"] ``` - We can tell Machine to use this with `--swarm-image` --- ## Start a resilient container - By default, containers will not be restarted when their node goes down - You must pass an explicit *rescheduling policy* to make that happen - For now, the only policy is "on-node-failure" .exercise[ - Start a container with a rescheduling policy: ```bash docker run --name highlander -d -e reschedule:on-node-failure nginx ``` ] Check that the container is up and running. --- ## Simulate a node failure - We will reboot the node running this container - Swarm will reschedule it .exercise[ - Check on which node the container is running: `NODE=$(docker inspect --format '{{.Node.Name}}' highlander)` - Reboot that node:
`ssh $NODE sudo reboot` - Check that the container has been recheduled:
`docker ps -a` ] --- ## Reboots - When rebooting a node, Docker is stopped cleanly, and containers are stopped - Our container is rescheduled, but not started - To simulate a "proper" failure, we can use the Chaos Monkey script instead ```bash ~/orchestration-workshop/bin/chaosmonkey $NODE
``` --- ## Cluster reconciliation - After the cluster rejoins, we can end up with duplicate containers .exercise[ - Once the node is back, remove one of the extraneous containers: ```bash docker rm -f node`N`/highlander ``` ] --- ## .warning[Caveats] - There are some corner cases when the node is also the Swarm leader or the Consul leader; this is being improved right now! - The safest way to address for now this is to run the Consul servers, the Swarm managers, and your containers, on different nodes. - Swarm doesn't handle gracefully the fact that after the reboot, you have *two* containers named `highlander`, and attempts to manipulate the container with its name will not work. This will be improved too. --- class: title # Conclusions --- ## Swarm cluster deployment - We saw how to use Machine with the `generic` driver to turn any set of machines into a Swarm cluster - This can trivially be adapted to provision cloud instances on the fly (using "normal" drivers of Docker Machine) - For auto-scaling, you can use e.g.: - private admin-only network - no TLS - static discovery on a /24 to /20 network (depending on your needs) --- ## Key/value store - We saw an easy deployment method for Consul - This is good for 3 to 9 nodes - Remember: raft write performance *degrades* as you add nodes! - For bigger clusters: - have e.g. 5 "static" server nodes - put them in round robin DNS record set (or behind an ELB) - run a normal agent on the other nodes --- ## App deployment - We saw how to transform a Compose file into a series of build artifacts - using S3 or another object store is trivial - We saw how to programmatically add load balancing, logging - This can be improved further by using variable interpolation for the image tags - Rolling deploys are relatively straightforward, but: - I recommend to aim directly for blue/green (or canary) deploy - In the production stack, abstract stateful services with ambassadors --- ## Operations - We saw how to setup an ELK stack and send logs to it in a record time *Important: this doesn't mean that operating ELK suddenly became an easy thing!* - We saw how to translate a few basic tasks to containerized environments (Backups, network traffic analysis) - Debugging is surprisingly similar to what it used to be: - remember that containerized processes are normal processes running on the host - `docker exec` is your friend - also: `docker run --net host --pid host -v /:/hostfs alpine chroot /hostfs` --- ## Things we haven't covered - Per-container system metrics (look at cAdvisor, Snap, Prometheus...) - Application metrics (continue to use whatever you were using before) - Supervision (whatever you were using before still works exactly the same way) - Tracking access to credentials and sensitive information (see Vault, Keywhiz...) - ... (tell me what I should cover in future workshops!) ... --- ## Resilience - We saw how to store important data (crendentials) in Consul - We saw how to achieve H/A for Swarm itself - Rescheduling policies give us basic H/A for containers - This will be improved in future releases - Docker in general, and Swarm in particular, move *fast* - Current high availability features are not Chaos-Monkey proof (yet) - We (well, the Swarm team) is working to change that --- ## What's next? - November 2015: Compose 1.5 + Engine 1.9 =
first release with multi-host networking - January 2016: Compose 1.6 + Engine 1.10 =
embedded DNS server, experimental high availability - April 2016: Compose 1.7 + Engine 1.11 =
round robin DNS records, huge improvements in HA - Next release: another truckload of features - I will deliver this workshop about twice a month - Check out the GitHub repo for updated content!
(there is a tag for each big round of updates) --- ## Overall complexity - The scripts used here are pretty simple (each is less than 100 LOCs) - You can easily rewrite them in your favorite language,
adapt and customize them, in a few hours of time - FYI: those scripts are smaller and simpler than the scripts (cloud init etc) used to deploy the VMs for this workshop! - Docker Inc. has commercial products to wrap all this: - Docker Cloud
(manage your Docker nodes from a SAAS portal) - Docker Datacenter
(buzzword-compliant management solution:
turnkey, enterprise-class, on-premise, etc.) --- class: title # Thanks!
Questions? ## [@jpetazzo](https://twitter.com/jpetazzo)
[@docker](https://twitter.com/docker)