mirror of
https://github.com/jpetazzo/container.training.git
synced 2026-05-14 12:56:37 +00:00
1638 lines
32 KiB
Markdown
1638 lines
32 KiB
Markdown
# Metrics collection
|
||
|
||
- We want to gather metrics in a central place
|
||
|
||
- We will gather node metrics and container metrics
|
||
|
||
- We want a nice interface to view them (graphs)
|
||
|
||
---
|
||
|
||
## Node metrics
|
||
|
||
- CPU, RAM, disk usage on the whole node
|
||
|
||
- Total number of processes running, and their states
|
||
|
||
- Number of open files, sockets, and their states
|
||
|
||
- I/O activity (disk, network), per operation or volume
|
||
|
||
- Physical/hardware (when applicable): temperature, fan speed ...
|
||
|
||
- ... and much more!
|
||
|
||
---
|
||
|
||
## Container metrics
|
||
|
||
- Similar to node metrics, but not totally identical
|
||
|
||
- RAM breakdown will be different
|
||
|
||
- active vs inactive memory
|
||
- some memory is *shared* between containers, and accounted specially
|
||
|
||
- I/O activity is also harder to track
|
||
|
||
- async writes can cause deferred "charges"
|
||
- some page-ins are also shared between containers
|
||
|
||
For details about container metrics, see:
|
||
<br/>
|
||
http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/
|
||
|
||
---
|
||
|
||
class: snap, prom
|
||
|
||
## Tools
|
||
|
||
We will build *two* different metrics pipelines:
|
||
|
||
- One based on Intel Snap,
|
||
|
||
- Another based on Prometheus.
|
||
|
||
If you're using Play-With-Docker, skip the exercises
|
||
relevant to Intel Snap (we rely on a SSH server to deploy,
|
||
and PWD doesn't have that yet).
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## First metrics pipeline
|
||
|
||
We will use three open source Go projects for our first metrics pipeline:
|
||
|
||
- Intel Snap
|
||
|
||
Collects, processes, and publishes metrics
|
||
|
||
- InfluxDB
|
||
|
||
Stores metrics
|
||
|
||
- Grafana
|
||
|
||
Displays metrics visually
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Snap
|
||
|
||
- [github.com/intelsdi-x/snap](https://github.com/intelsdi-x/snap)
|
||
|
||
- Can collect, process, and publish metric data
|
||
|
||
- Doesn’t store metrics
|
||
|
||
- Works as a daemon (snapd) controlled by a CLI (snapctl)
|
||
|
||
- Offloads collecting, processing, and publishing to plugins
|
||
|
||
- Does nothing out of the box; configuration required!
|
||
|
||
- Docs: https://github.com/intelsdi-x/snap/blob/master/docs/
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## InfluxDB
|
||
|
||
- Snap doesn't store metrics data
|
||
|
||
- InfluxDB is specifically designed for time-series data
|
||
|
||
- CRud vs. CRUD (you rarely if ever update/delete data)
|
||
|
||
- orthogonal read and write patterns
|
||
|
||
- storage format optimization is key (for disk usage and performance)
|
||
|
||
- Snap has a plugin allowing to *publish* to InfluxDB
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Grafana
|
||
|
||
- Snap cannot show graphs
|
||
|
||
- InfluxDB cannot show graphs
|
||
|
||
- Grafana will take care of that
|
||
|
||
- Grafana can read data from InfluxDB and display it as graphs
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Getting and setting up Snap
|
||
|
||
- We will install Snap directly on the nodes
|
||
|
||
- Release tarballs are available from GitHub
|
||
|
||
- We will use a *global service*
|
||
<br/>(started on all nodes, including nodes added later)
|
||
|
||
- This service will download and unpack Snap in /opt and /usr/local
|
||
|
||
- /opt and /usr/local will be bind-mounted from the host
|
||
|
||
- This service will effectively install Snap on the hosts
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## The Snap installer service
|
||
|
||
- This will get Snap on all nodes
|
||
|
||
.exercise[
|
||
|
||
```bash
|
||
docker service create --restart-condition=none --mode global \
|
||
--mount type=bind,source=/usr/local/bin,target=/usr/local/bin \
|
||
--mount type=bind,source=/opt,target=/opt centos sh -c '
|
||
SNAPVER=v0.16.1-beta
|
||
RELEASEURL=https://github.com/intelsdi-x/snap/releases/download/$SNAPVER
|
||
curl -sSL $RELEASEURL/snap-$SNAPVER-linux-amd64.tar.gz |
|
||
tar -C /opt -zxf-
|
||
curl -sSL $RELEASEURL/snap-plugins-$SNAPVER-linux-amd64.tar.gz |
|
||
tar -C /opt -zxf-
|
||
ln -s snap-$SNAPVER /opt/snap
|
||
for BIN in snapd snapctl; do ln -s /opt/snap/bin/$BIN /usr/local/bin/$BIN; done
|
||
' # If you copy-paste that block, do not forget that final quote ☺
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## First contact with `snapd`
|
||
|
||
- The core of Snap is `snapd`, the Snap daemon
|
||
|
||
- Application made up of a REST API, control module, and scheduler module
|
||
|
||
.exercise[
|
||
|
||
- Start `snapd` with plugin trust disabled and log level set to debug:
|
||
```bash
|
||
snapd -t 0 -l 1
|
||
```
|
||
|
||
]
|
||
|
||
- More resources:
|
||
|
||
https://github.com/intelsdi-x/snap/blob/master/docs/SNAPD.md
|
||
https://github.com/intelsdi-x/snap/blob/master/docs/SNAPD_CONFIGURATION.md
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Using `snapctl` to interact with `snapd`
|
||
|
||
- Let's load a *collector* and a *publisher* plugins
|
||
|
||
.exercise[
|
||
|
||
- Open a new terminal
|
||
|
||
- Load the psutil collector plugin:
|
||
```bash
|
||
snapctl plugin load /opt/snap/plugin/snap-plugin-collector-psutil
|
||
```
|
||
|
||
- Load the file publisher plugin:
|
||
```bash
|
||
snapctl plugin load /opt/snap/plugin/snap-plugin-publisher-mock-file
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Checking what we've done
|
||
|
||
- Good to know: Docker CLI uses `ls`, Snap CLI uses `list`
|
||
|
||
.exercise[
|
||
|
||
- See your loaded plugins:
|
||
```bash
|
||
snapctl plugin list
|
||
```
|
||
|
||
- See the metrics you can collect:
|
||
```bash
|
||
snapctl metric list
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Actually collecting metrics: introducing *tasks*
|
||
|
||
- To start collecting/processing/publishing metric data, you need to create a *task*
|
||
|
||
- A *task* indicates:
|
||
|
||
- *what* to collect (which metrics)
|
||
- *when* to collect it (e.g. how often)
|
||
- *how* to process it (e.g. use it directly, or compute moving averages)
|
||
- *where* to publish it
|
||
|
||
- Tasks can be defined with manifests written in JSON or YAML
|
||
|
||
- Some plugins, such as the Docker collector, allow for wildcards (\*) in the metrics "path"
|
||
<br/>(see snap/docker-influxdb.json)
|
||
|
||
- More resources:
|
||
https://github.com/intelsdi-x/snap/blob/master/docs/TASKS.md
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Our first task manifest
|
||
|
||
```yaml
|
||
version: 1
|
||
schedule:
|
||
type: "simple" # collect on a set interval
|
||
interval: "1s" # of every 1s
|
||
max-failures: 10
|
||
workflow:
|
||
collect: # first collect
|
||
metrics: # metrics to collect
|
||
/intel/psutil/load/load1: {}
|
||
config: # there is no configuration
|
||
publish: # after collecting, publish
|
||
-
|
||
plugin_name: "file" # use the file publisher
|
||
config:
|
||
file: "/tmp/snap-psutil-file.log" # write to this file
|
||
```
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Creating our first task
|
||
|
||
- The task manifest shown on the previous slide is stored in `snap/psutil-file.yml`.
|
||
|
||
.exercise[
|
||
|
||
- Create a task using the manifest:
|
||
|
||
```bash
|
||
cd ~/orchestration-workshop/snap
|
||
snapctl task create -t psutil-file.yml
|
||
```
|
||
|
||
]
|
||
|
||
The output should look like the following:
|
||
```
|
||
Using task manifest to create task
|
||
Task created
|
||
ID: 240435e8-a250-4782-80d0-6fff541facba
|
||
Name: Task-240435e8-a250-4782-80d0-6fff541facba
|
||
State: Running
|
||
```
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Checking existing tasks
|
||
|
||
.exercise[
|
||
|
||
- This will confirm that our task is running correctly, and remind us of its task ID
|
||
|
||
```bash
|
||
snapctl task list
|
||
```
|
||
|
||
]
|
||
|
||
The output should look like the following:
|
||
```
|
||
ID NAME STATE HIT MISS FAIL CREATED
|
||
24043...acba Task-24043...acba Running 4 0 0 2:34PM 8-13-2016
|
||
```
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Viewing our task dollars at work
|
||
|
||
- The task is using a very simple publisher, `mock-file`
|
||
|
||
- That publisher just writes text lines in a file (one line per data point)
|
||
|
||
.exercise[
|
||
|
||
- Check that the data is flowing indeed:
|
||
```bash
|
||
tail -f /tmp/snap-psutil-file.log
|
||
```
|
||
|
||
]
|
||
|
||
To exit, hit `^C`
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Debugging tasks
|
||
|
||
- When a task is not directly writing to a local file, use `snapctl task watch`
|
||
|
||
- `snapctl task watch` will stream the metrics you are collecting to STDOUT
|
||
|
||
.exercise[
|
||
|
||
```bash
|
||
snapctl task watch <ID>
|
||
```
|
||
|
||
]
|
||
|
||
To exit, hit `^C`
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Stopping snap
|
||
|
||
- Our Snap deployment has a few flaws:
|
||
|
||
- snapd was started manually
|
||
|
||
- it is running on a single node
|
||
|
||
- the configuration is purely local
|
||
|
||
--
|
||
|
||
class: snap
|
||
|
||
- We want to change that!
|
||
|
||
--
|
||
|
||
class: snap
|
||
|
||
- But first, go back to the terminal where `snapd` is running, and hit `^C`
|
||
|
||
- All tasks will be stopped; all plugins will be unloaded; Snap will exit
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Snap Tribe Mode
|
||
|
||
- Tribe is Snap's clustering mechanism
|
||
|
||
- When tribe mode is enabled, nodes can join *agreements*
|
||
|
||
- When a node in an *agreement* does something (e.g. load a plugin or run a task),
|
||
<br/>other nodes of that agreement do the same thing
|
||
|
||
- We will use it to load the Docker collector and InfluxDB publisher on all nodes,
|
||
<br/>and run a task to use them
|
||
|
||
- Without tribe mode, we would have to load plugins and run tasks manually on every node
|
||
|
||
- More resources:
|
||
https://github.com/intelsdi-x/snap/blob/master/docs/TRIBE.md
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Running Snap itself on every node
|
||
|
||
- Snap runs in the foreground, so you need to use `&` or start it in tmux
|
||
|
||
.exercise[
|
||
|
||
- Run the following command *on every node:*
|
||
```bash
|
||
snapd -t 0 -l 1 --tribe --tribe-seed node1:6000
|
||
```
|
||
|
||
]
|
||
|
||
If you're *not* using Play-With-Docker, there is another way to start Snap!
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Starting a daemon through SSH
|
||
|
||
.warning[Hackety hack ahead!]
|
||
|
||
- We will create a *global service*
|
||
|
||
- That global service will install a SSH client
|
||
|
||
- With that SSH client, the service will connect back to its local node
|
||
<br/>(i.e. "break out" of the container, using the SSH key that we provide)
|
||
|
||
- Once logged on the node, the service starts snapd with Tribe Mode enabled
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Running Snap itself on every node
|
||
|
||
- I might go to hell for showing you this, but here it goes ...
|
||
|
||
.exercise[
|
||
|
||
- Start Snap all over the place:
|
||
```bash
|
||
docker service create --name snapd --mode global \
|
||
--mount type=bind,source=$HOME/.ssh/id_rsa,target=/sshkey \
|
||
alpine sh -c "
|
||
apk add --no-cache openssh-client &&
|
||
ssh -o StrictHostKeyChecking=no -i /sshkey docker@172.17.0.1 \
|
||
sudo snapd -t 0 -l 1 --tribe --tribe-seed node1:6000
|
||
" # If you copy-paste that block, don't forget that final quote :-)
|
||
```
|
||
|
||
]
|
||
|
||
Remember: this *does not work* with Play-With-Docker (which doesn't have SSH).
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Viewing the members of our tribe
|
||
|
||
- If everything went fine, Snap is now running in tribe mode
|
||
|
||
.exercise[
|
||
|
||
- View the members of our tribe:
|
||
```bash
|
||
snapctl member list
|
||
```
|
||
|
||
]
|
||
|
||
This should show the 5 nodes with their hostnames.
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Create an agreement
|
||
|
||
- We can now create an *agreement* for our plugins and tasks
|
||
|
||
.exercise[
|
||
|
||
- Create an agreement; make sure to use the same name all along:
|
||
```bash
|
||
snapctl agreement create docker-influxdb
|
||
```
|
||
|
||
]
|
||
|
||
The output should look like the following:
|
||
|
||
```
|
||
Name Number of Members plugins tasks
|
||
docker-influxdb 0 0 0
|
||
```
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Instruct all nodes to join the agreement
|
||
|
||
- We dont need another fancy global service!
|
||
|
||
- We can join nodes from any existing node of the cluster
|
||
|
||
.exercise[
|
||
|
||
- Add all nodes to the agreement:
|
||
```bash
|
||
snapctl member list | tail -n +2 |
|
||
xargs -n1 snapctl agreement join docker-influxdb
|
||
```
|
||
|
||
]
|
||
|
||
The last bit of output should look like the following:
|
||
```
|
||
Name Number of Members plugins tasks
|
||
docker-influxdb 5 0 0
|
||
```
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Start a container on every node
|
||
|
||
- The Docker plugin requires at least one container to be started
|
||
|
||
- Normally, at this point, you will have at least one container on each node
|
||
|
||
- But just in case you did things differently, let's create a dummy global service
|
||
|
||
.exercise[
|
||
|
||
- Create an alpine container on the whole cluster:
|
||
```bash
|
||
docker service create --name ping --mode global alpine ping 8.8.8.8
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Running InfluxDB
|
||
|
||
- We will create a service for InfluxDB
|
||
|
||
- We will use the official image
|
||
|
||
- InfluxDB uses multiple ports:
|
||
|
||
- 8086 (HTTP API; we need this)
|
||
|
||
- 8083 (admin interface; we need this)
|
||
|
||
- 8088 (cluster communication; not needed here)
|
||
|
||
- more ports for other protocols (graphite, collectd...)
|
||
|
||
- We will just publish the first two
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Creating the InfluxDB service
|
||
|
||
.exercise[
|
||
|
||
- Start an InfluxDB service, publishing ports 8083 and 8086:
|
||
```bash
|
||
docker service create --name influxdb \
|
||
--publish 8083:8083 \
|
||
--publish 8086:8086 \
|
||
influxdb:0.13
|
||
```
|
||
|
||
]
|
||
|
||
Note: this will allow any node to publish metrics data to `localhost:8086`,
|
||
and it will allows us to access the admin interface by connecting to any node
|
||
on port 8083.
|
||
|
||
.warning[Make sure to use InfluxDB 0.13; a few things changed in 1.0
|
||
(like, the name of the default retention policy is now "autogen") and
|
||
this breaks a few things.]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Setting up InfluxDB
|
||
|
||
- We need to create the "snap" database
|
||
|
||
.exercise[
|
||
|
||
- Open port 8083 with your browser
|
||
|
||
- Enter the following query in the query box:
|
||
```
|
||
CREATE DATABASE "snap"
|
||
```
|
||
|
||
- In the top-right corner, select "Database: snap"
|
||
|
||
]
|
||
|
||
Note: the InfluxDB query language *looks like* SQL but it's not.
|
||
|
||
???
|
||
|
||
## Setting a retention policy
|
||
|
||
- When graduating to 1.0, InfluxDB changed the name of the default policy
|
||
|
||
- It used to be "default" and it is now "autogen"
|
||
|
||
- Snap still uses "default" and this results in errors
|
||
|
||
.exercise[
|
||
|
||
- Create a "default" retention policy by entering the following query in the box:
|
||
```
|
||
CREATE RETENTION POLICY "default" ON "snap" DURATION 1w REPLICATION 1
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Load Docker collector and InfluxDB publisher
|
||
|
||
- We will load plugins on the local node
|
||
|
||
- Since our local node is a member of the agreement, all other
|
||
nodes in the agreement will also load these plugins
|
||
|
||
.exercise[
|
||
|
||
- Load Docker collector:
|
||
|
||
```bash
|
||
snapctl plugin load /opt/snap/plugin/snap-plugin-collector-docker
|
||
```
|
||
|
||
- Load InfluxDB publisher:
|
||
|
||
```bash
|
||
snapctl plugin load /opt/snap/plugin/snap-plugin-publisher-influxdb
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Start a simple collection task
|
||
|
||
- Again, we will create a task on the local node
|
||
|
||
- The task will be replicated on other nodes members of the same agreement
|
||
|
||
.exercise[
|
||
|
||
- Load a task manifest file collecting a couple of metrics on all containers,
|
||
<br/>and sending them to InfluxDB:
|
||
```bash
|
||
cd ~/orchestration-workshop/snap
|
||
snapctl task create -t docker-influxdb.json
|
||
```
|
||
|
||
]
|
||
|
||
Note: the task description sends metrics to the InfluxDB API endpoint
|
||
located at 127.0.0.1:8086. Since the InfluxDB container is published
|
||
on port 8086, 127.0.0.1:8086 always routes traffic to the InfluxDB
|
||
container.
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## If things go wrong...
|
||
|
||
Note: if a task runs into a problem (e.g. it's trying to publish
|
||
to a metrics database, but the database is unreachable), the task
|
||
will be stopped.
|
||
|
||
You will have to restart it manually by running:
|
||
|
||
```bash
|
||
snapctl task enable <ID>
|
||
snapctl task start <ID>
|
||
```
|
||
|
||
This must be done *per node*. Alternatively, you can delete+re-create
|
||
the task (it will delete+re-create on all nodes).
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Check that metric data shows up in InfluxDB
|
||
|
||
- Let's check existing data with a few manual queries in the InfluxDB admin interface
|
||
|
||
.exercise[
|
||
|
||
- List "measurements":
|
||
```
|
||
SHOW MEASUREMENTS
|
||
```
|
||
(This should show two generic entries corresponding to the two collected metrics.)
|
||
|
||
- View time series data for one of the metrics:
|
||
```
|
||
SELECT * FROM "intel/docker/stats/cgroups/cpu_stats/cpu_usage/total_usage"
|
||
```
|
||
(This should show a list of data points with **time**, **docker_id**, **source**, and **value**.)
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Deploy Grafana
|
||
|
||
- We will use an almost-official image, `grafana/grafana`
|
||
|
||
- We will publish Grafana's web interface on its default port (3000)
|
||
|
||
.exercise[
|
||
|
||
- Create the Grafana service:
|
||
```bash
|
||
docker service create --name grafana --publish 3000:3000 grafana/grafana:3.1.1
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Set up Grafana
|
||
|
||
.exercise[
|
||
|
||
- Open port 3000 with your browser
|
||
|
||
- Identify with "admin" as the username and password
|
||
|
||
- Click on the Grafana logo (the orange spiral in the top left corner)
|
||
|
||
- Click on "Data Sources"
|
||
|
||
- Click on "Add data source" (green button on the right)
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Add InfluxDB as a data source for Grafana
|
||
|
||
.small[
|
||
|
||
Fill the form exactly as follows:
|
||
- Name = "snap"
|
||
- Type = "InfluxDB"
|
||
|
||
In HTTP settings, fill as follows:
|
||
- Url = "http://(IP.address.of.any.node):8086"
|
||
- Access = "direct"
|
||
- Leave HTTP Auth untouched
|
||
|
||
In InfluxDB details, fill as follows:
|
||
- Database = "snap"
|
||
- Leave user and password blank
|
||
|
||
Finally, click on "add", you should see a green message saying "Success - Data source is working".
|
||
If you see an orange box (sometimes without a message), it means that you got something wrong. Triple check everything again.
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||

|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Create a dashboard in Grafana
|
||
|
||
.exercise[
|
||
|
||
- Click on the Grafana logo again (the orange spiral in the top left corner)
|
||
|
||
- Hover over "Dashboards"
|
||
|
||
- Click "+ New"
|
||
|
||
- Click on the little green rectangle that appeared in the top left
|
||
|
||
- Hover over "Add Panel"
|
||
|
||
- Click on "Graph"
|
||
|
||
]
|
||
|
||
At this point, you should see a sample graph showing up.
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||
## Setting up a graph in Grafana
|
||
|
||
.exercise[
|
||
|
||
- Panel data source: select snap
|
||
- Click on the SELECT metrics query to expand it
|
||
- Click on "select measurement" and pick CPU usage
|
||
- Click on the "+" right next to "WHERE"
|
||
- Select "docker_id"
|
||
- Select the ID of a container of your choice (e.g. the one running InfluxDB)
|
||
- Click on the "+" on the right of the "SELECT" line
|
||
- Add "derivative"
|
||
- In the "derivative" option, select "1s"
|
||
- In the top right corner, click on the clock, and pick "last 5 minutes"
|
||
|
||
]
|
||
|
||
Congratulations, you are viewing the CPU usage of a single container!
|
||
|
||
---
|
||
|
||
class: snap
|
||
|
||

|
||
|
||
---
|
||
|
||
class: snap, prom
|
||
|
||
## Before moving on ...
|
||
|
||
- Leave that tab open!
|
||
|
||
- We are going to set up *another* metrics system
|
||
|
||
- ... And then compare both graphs side by side
|
||
|
||
---
|
||
|
||
class: snap, prom
|
||
|
||
## Prometheus vs. Snap
|
||
|
||
- Prometheus is another metrics collection system
|
||
|
||
- Snap *pushes* metrics; Prometheus *pulls* them
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Prometheus components
|
||
|
||
- The *Prometheus server* pulls, stores, and displays metrics
|
||
|
||
- Its configuration defines a list of *exporter* endpoints
|
||
<br/>(that list can be dynamic, using e.g. Consul, DNS, Etcd...)
|
||
|
||
- The exporters expose metrics over HTTP using a simple line-oriented format
|
||
|
||
(An optimized format using protobuf is also possible)
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## It's all about the `/metrics`
|
||
|
||
- This is was the *node exporter* looks like:
|
||
|
||
http://demo.robustperception.io:9100/metrics
|
||
|
||
- Prometheus itself exposes its own internal metrics, too:
|
||
|
||
http://demo.robustperception.io:9090/metrics
|
||
|
||
- A *Prometheus server* will *scrape* URLs like these
|
||
|
||
(It can also use protobuf to avoid the overhead of parsing line-oriented formats!)
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Collecting metrics with Prometheus on Swarm
|
||
|
||
- We will run two *global services* (i.e. scheduled on all our nodes):
|
||
|
||
- the Prometheus *node exporter* to get node metrics
|
||
|
||
- Google's cAdvisor to get container metrics
|
||
|
||
- We will run a Prometheus server to scrape these exporters
|
||
|
||
- The Prometheus server will be configured to use DNS service discovery
|
||
|
||
- We will use `tasks.<servicename>` for service discovery
|
||
|
||
- All these services will be placed on a private internal network
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Creating an overlay network for Prometheus
|
||
|
||
- This is the easiest step ☺
|
||
|
||
.exercise[
|
||
|
||
- Create an overlay network:
|
||
```bash
|
||
docker network create --driver overlay prom
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Running the node exporter
|
||
|
||
- The node exporter *should* run directly on the hosts
|
||
- However, it can run from a container, if configured properly
|
||
<br/>
|
||
(it needs to access the host's filesystems, in particular /proc and /sys)
|
||
|
||
.exercise[
|
||
|
||
- Start the node exporter:
|
||
```bash
|
||
docker service create --name node --mode global --network prom \
|
||
--mount type=bind,source=/proc,target=/host/proc \
|
||
--mount type=bind,source=/sys,target=/host/sys \
|
||
--mount type=bind,source=/,target=/rootfs \
|
||
prom/node-exporter \
|
||
-collector.procfs /host/proc \
|
||
-collector.sysfs /host/proc \
|
||
-collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Running cAdvisor
|
||
|
||
- Likewise, cAdvisor *should* run directly on the hosts
|
||
|
||
- But it can run in containers, if configured properly
|
||
|
||
.exercise[
|
||
|
||
- Start the cAdvisor collector:
|
||
```bash
|
||
docker service create --name cadvisor --network prom --mode global \
|
||
--mount type=bind,source=/,target=/rootfs \
|
||
--mount type=bind,source=/var/run,target=/var/run \
|
||
--mount type=bind,source=/sys,target=/sys \
|
||
--mount type=bind,source=/var/lib/docker,target=/var/lib/docker \
|
||
google/cadvisor:latest
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Configuring the Prometheus server
|
||
|
||
This will be our configuration file for Prometheus:
|
||
|
||
```yaml
|
||
global:
|
||
scrape_interval: 10s
|
||
scrape_configs:
|
||
- job_name: 'prometheus'
|
||
static_configs:
|
||
- targets: ['localhost:9090']
|
||
- job_name: 'node'
|
||
dns_sd_configs:
|
||
- names: ['tasks.node']
|
||
type: 'A'
|
||
port: 9100
|
||
- job_name: 'cadvisor'
|
||
dns_sd_configs:
|
||
- names: ['tasks.cadvisor']
|
||
type: 'A'
|
||
port: 8080
|
||
```
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Passing the configuration to the Prometheus server
|
||
|
||
- We need to provide our custom configuration to the Prometheus server
|
||
|
||
- The easiest solution is to create a custom image bundling this configuration
|
||
|
||
- We will use a very simple Dockerfile:
|
||
```dockerfile
|
||
FROM prom/prometheus:v1.4.1
|
||
COPY prometheus.yml /etc/prometheus/prometheus.yml
|
||
```
|
||
|
||
(The configuration file, and the Dockerfile, are in the `prom` subdirectory)
|
||
|
||
- We will build this image, and push it to our local registry
|
||
|
||
- Then we will create a service using this image
|
||
|
||
Note: it is also possible to use a `config` to inject that configuration file
|
||
without having to create this ad-hoc image.
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Building our custom Prometheus image
|
||
|
||
- We will use the local registry started previously on 127.0.0.1:5000
|
||
|
||
.exercise[
|
||
|
||
- Build the image using the provided Dockerfile:
|
||
```bash
|
||
docker build -t 127.0.0.1:5000/prometheus ~/orchestration-workshop/prom
|
||
```
|
||
|
||
- Push the image to our local registry:
|
||
```bash
|
||
docker push 127.0.0.1:5000/prometheus
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-manual
|
||
|
||
## Running our custom Prometheus image
|
||
|
||
- That's the only service that needs to be published
|
||
|
||
(If we want to access Prometheus from outside!)
|
||
|
||
.exercise[
|
||
|
||
- Start the Prometheus server:
|
||
```bash
|
||
docker service create --network prom --name prom \
|
||
--publish 9090:9090 127.0.0.1:5000/prometheus
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-auto
|
||
|
||
## Deploying Prometheus on our cluster
|
||
|
||
- We will use a stack definition (once again)
|
||
|
||
.exercise[
|
||
|
||
- Make sure we are in the stacks directory:
|
||
```bash
|
||
cd ~/orchestration-workshop/stacks
|
||
```
|
||
|
||
- Build, ship, and run the Prometheus stack:
|
||
```bash
|
||
docker-compose -f prometheus.yml build
|
||
docker-compose -f prometheus.yml push
|
||
docker stack deploy -c prometheus.yml prometheus
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Checking our Prometheus server
|
||
|
||
- First, let's make sure that Prometheus is correctly scraping all metrics
|
||
|
||
.exercise[
|
||
|
||
- Open port 9090 with your browser
|
||
|
||
- Click on "status", then "targets"
|
||
|
||
]
|
||
|
||
You should see 11 endpoints (5 cadvisor, 5 node, 1 prometheus).
|
||
|
||
Their state should be "UP".
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Injecting a configuration file
|
||
|
||
(New in Docker Engine 17.06)
|
||
|
||
- We are creating a custom image *just to inject a configuration*
|
||
|
||
- Instead, we could use the base Prometheus image + a `config`
|
||
|
||
- A `config` is a blob (usually, a configuration file) that:
|
||
|
||
- is created and managed through the Docker API (and CLI)
|
||
|
||
- gets persisted into the Raft log (i.e. safely)
|
||
|
||
- can be associated to a service
|
||
<br/>
|
||
(this injects the blob as a plain file in the service's containers)
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Differences between `config` and `secret`
|
||
|
||
The two are very similar, but ...
|
||
|
||
- `configs`:
|
||
|
||
- can be injected to any filesystem location
|
||
|
||
- can be viewed and extracted using the Docker API or CLI
|
||
|
||
- `secrets`:
|
||
|
||
- can only be injected into `/run/secrets`
|
||
|
||
- are never stored in clear text on disk
|
||
|
||
- cannot be viewed or extracted with the Docker API or CLI
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Deploying Prometheus with a `config`
|
||
|
||
- The `config` can be created manually or declared in the Compose file
|
||
|
||
- This is what our new Compose file looks like:
|
||
|
||
.small[
|
||
```yaml
|
||
version: "3.3"
|
||
|
||
services:
|
||
|
||
prometheus:
|
||
image: prom/prometheus:v1.4.1
|
||
ports:
|
||
- "9090:9090"
|
||
configs:
|
||
- source: prometheus
|
||
target: /etc/prometheus/prometheus.yml
|
||
|
||
...
|
||
|
||
configs:
|
||
prometheus:
|
||
file: ../prom/prometheus.yml
|
||
```
|
||
]
|
||
|
||
(This is from `prometheus+config.yml`)
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Specifying a `config` in a Compose file
|
||
|
||
- In each service, an optional `configs` section can list as many configs as you want
|
||
|
||
- Each config can specify:
|
||
|
||
- an optional `target` (path to inject the configuration; by default: root of the container)
|
||
|
||
- ownership and permissions (by default, the file will be owned by UID 0, i.e. `root`)
|
||
|
||
- These configs reference top-level `configs` elements
|
||
|
||
- The top-level configs can be declared as:
|
||
|
||
- *external*, meaning that it is supposed to be created before you deploy the stack
|
||
|
||
- referencing a file, whose content is used to initialize the config
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Re-deploying Prometheus with a config
|
||
|
||
- We will update the existing stack using `prometheus+config.yml`
|
||
|
||
.exercise[
|
||
|
||
- Redeploy the `prometheus` stack:
|
||
```bash
|
||
docker stack deploy -c prometheus+config.yml prometheus
|
||
```
|
||
|
||
- Check that Prometheus still works as intended
|
||
|
||
(By connecting to any node of the cluster, on port 9090)
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Accessing the config object from the Docker CLI
|
||
|
||
- Config objects can be viewed from the CLI (or API)
|
||
|
||
.exercise[
|
||
|
||
- List existing config objects:
|
||
```bash
|
||
docker config ls
|
||
```
|
||
|
||
- View details about our config object:
|
||
```bash
|
||
docker config inspect prometheus_prometheus
|
||
```
|
||
|
||
]
|
||
|
||
Note: the content of the config blob is shown with BASE64 encoding.
|
||
<br/>
|
||
(It doesn't have to be text; it could be an image or any kind of binary content!)
|
||
|
||
---
|
||
|
||
class: prom-auto, config
|
||
|
||
## Extracting a config blob
|
||
|
||
- Let's retrieve that Prometheus configuration!
|
||
|
||
.exercise[
|
||
|
||
- Extract the BASE64 payload with `jq`:
|
||
```bash
|
||
docker config inspect prometheus_prometheus | jq -r .[0].Spec.Data
|
||
```
|
||
|
||
- Decode it with `base64 -d`:
|
||
```bash
|
||
docker config inspect prometheus_prometheus | jq -r .[0].Spec.Data | base64 -d
|
||
```
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Displaying metrics directly from Prometheus
|
||
|
||
- This is easy ... if you are familiar with PromQL
|
||
|
||
.exercise[
|
||
|
||
- Click on "Graph", and in "expression", paste the following:
|
||
```
|
||
sum by (container_label_com_docker_swarm_node_id) (
|
||
irate(
|
||
container_cpu_usage_seconds_total{
|
||
container_label_com_docker_swarm_service_name="dockercoins_worker"
|
||
}[1m]
|
||
)
|
||
)
|
||
```
|
||
|
||
- Click on the blue "Execute" button and on the "Graph" tab just below
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Building the query from scratch
|
||
|
||
- We are going to build the same query from scratch
|
||
|
||
- This doesn't intend to be a detailed PromQL course
|
||
|
||
- This is merely so that you (I) can pretend to know how the previous query works
|
||
<br/>so that your coworkers (you) can be suitably impressed (or not)
|
||
|
||
(Or, so that we can build other queries if necessary, or adapt if cAdvisor,
|
||
Prometheus, or anything else changes and requires editing the query!)
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Displaying a raw metric for *all* containers
|
||
|
||
- Click on the "Graph" tab on top
|
||
|
||
*This takes us to a blank dashboard*
|
||
|
||
- Click on the "Insert metric at cursor" drop down, and select `container_cpu_usage_seconds_total`
|
||
|
||
*This puts the metric name in the query box*
|
||
|
||
- Click on "Execute"
|
||
|
||
*This fills a table of measurements below*
|
||
|
||
- Click on "Graph" (next to "Console")
|
||
|
||
*This replaces the table of measurements with a series of graphs (after a few seconds)*
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Selecting metrics for a specific service
|
||
|
||
- Hover over the lines in the graph
|
||
|
||
(Look for the ones that have labels like `container_label_com_docker_...`)
|
||
|
||
- Edit the query, adding a condition between curly braces:
|
||
|
||
.small[`container_cpu_usage_seconds_total{container_label_com_docker_swarm_service_name="dockercoins_worker"}`]
|
||
|
||
- Click on "Execute"
|
||
|
||
*Now we should see one line per CPU per container*
|
||
|
||
- If you want to select by container ID, you can use a regex match: `id=~"/docker/c4bf.*"`
|
||
|
||
- You can also specify multiple conditions by separating them with commas
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Turn counters into rates
|
||
|
||
- What we see is the total amount of CPU used (in seconds)
|
||
|
||
- We want to see a *rate* (CPU time used / real time)
|
||
|
||
- To get a moving average over 1 minute periods, enclose the current expression within:
|
||
|
||
```
|
||
rate ( ... { ... } [1m] )
|
||
```
|
||
|
||
*This should turn our steadily-increasing CPU counter into a wavy graph*
|
||
|
||
- To get an instantaneous rate, use `irate` instead of `rate`
|
||
|
||
(The time window is then used to limit how far behind to look for data if data points
|
||
are missing in case of scrape failure; see [here](https://www.robustperception.io/irate-graphs-are-better-graphs/) for more details!)
|
||
|
||
*This should show spikes that were previously invisible because they were smoothed out*
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Aggregate multiple data series
|
||
|
||
- We have one graph per CPU per container; we want to sum them
|
||
|
||
- Enclose the whole expression within:
|
||
|
||
```
|
||
sum ( ... )
|
||
```
|
||
|
||
*We now see a single graph*
|
||
|
||
---
|
||
|
||
class: prom
|
||
|
||
## Collapse dimensions
|
||
|
||
- If we have multiple containers we can also collapse just the CPU dimension:
|
||
|
||
```
|
||
sum without (cpu) ( ... )
|
||
```
|
||
|
||
*This shows the same graph, but preserves the other labels*
|
||
|
||
- Congratulations, you wrote your first PromQL expression from scratch!
|
||
|
||
(I'd like to thank [Johannes Ziemke](https://twitter.com/discordianfish) and
|
||
[Julius Volz](https://twitter.com/juliusvolz) for their help with Prometheus!)
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Comparing Snap and Prometheus data
|
||
|
||
- If you haven't set up Snap, InfluxDB, and Grafana, skip this section
|
||
|
||
- If you have closed the Grafana tab, you might have to re-set up a new dashboard
|
||
|
||
(Unless you saved it before navigating it away)
|
||
|
||
- To re-do the setup, just follow again the instructions from the previous chapter
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Add Prometheus as a data source in Grafana
|
||
|
||
.exercise[
|
||
|
||
- In a new tab, connect to Grafana (port 3000)
|
||
|
||
- Click on the Grafana logo (the orange spiral in the top-left corner)
|
||
|
||
- Click on "Data Sources"
|
||
|
||
- Click on the green "Add data source" button
|
||
|
||
]
|
||
|
||
We see the same input form that we filled earlier to connect to InfluxDB.
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Connecting to Prometheus from Grafana
|
||
|
||
.exercise[
|
||
|
||
- Enter "prom" in the name field
|
||
|
||
- Select "Prometheus" as the source type
|
||
|
||
- Enter http://(IP.address.of.any.node):9090 in the Url field
|
||
|
||
- Select "direct" as the access method
|
||
|
||
- Click on "Save and test"
|
||
|
||
]
|
||
|
||
Again, we should see a green box telling us "Data source is working."
|
||
|
||
Otherwise, double-check every field and try again!
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Adding the Prometheus data to our dashboard
|
||
|
||
.exercise[
|
||
|
||
- Go back to the the tab where we had our first Grafana dashboard
|
||
|
||
- Click on the blue "Add row" button in the lower right corner
|
||
|
||
- Click on the green tab on the left; select "Add panel" and "Graph"
|
||
|
||
]
|
||
|
||
This takes us to the graph editor that we used earlier.
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Querying Prometheus data from Grafana
|
||
|
||
The editor is a bit less friendly than the one we used for InfluxDB.
|
||
|
||
.exercise[
|
||
|
||
- Select "prom" as Panel data source
|
||
|
||
- Paste the query in the query field:
|
||
```
|
||
sum without (cpu, id) ( irate (
|
||
container_cpu_usage_seconds_total{
|
||
container_label_com_docker_swarm_service_name="influxdb"}[1m] ) )
|
||
```
|
||
|
||
- Click outside of the query field to confirm
|
||
|
||
- Close the row editor by clicking the "X" in the top right area
|
||
|
||
]
|
||
|
||
---
|
||
|
||
class: prom, snap
|
||
|
||
## Interpreting results
|
||
|
||
- The two graphs *should* be similar
|
||
|
||
- Protip: align the time references!
|
||
|
||
.exercise[
|
||
|
||
- Click on the clock in the top right corner
|
||
|
||
- Select "last 30 minutes"
|
||
|
||
- Click on "Zoom out"
|
||
|
||
- Now press the right arrow key (hold it down and watch the CPU usage increase!)
|
||
|
||
]
|
||
|
||
*Adjusting units is left as an exercise for the reader.*
|
||
|
||
---
|
||
|
||
## More resources on container metrics
|
||
|
||
- [Prometheus, a Whirlwind Tour](https://speakerdeck.com/copyconstructor/prometheus-a-whirlwind-tour),
|
||
an original overview of Prometheus
|
||
|
||
- [Docker Swarm & Container Overview](https://grafana.net/dashboards/609),
|
||
a custom dashboard for Grafana
|
||
|
||
- [Gathering Container Metrics](http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/),
|
||
a blog post about cgroups
|
||
|
||
- [The Prometheus Time Series Database](https://www.youtube.com/watch?v=HbnGSNEjhUc),
|
||
a talk explaining why custom data storage is necessary for metrics
|
||
|
||
.blackbelt[[Monitoring, the Prometheus Way
|
||
](https://www.youtube.com/watch?v=PDxcEzu62jk&list=PLkA60AVN3hh-biQ6SCtBJ-WVTyBmmYho8&index=5) by Julius Volz (DC17US)]
|
||
|
||
.blackbelt[How and Why Prometheus' New Storage Engine Pushes the Limits of Time Series Databases
|
||
by Goutham Veeramachaneni (Tuesday 17:10)]
|