12 KiB
Collecting metrics with Prometheus
-
Prometheus is an open-source monitoring system including:
-
multiple service discovery backends to figure out which metrics to collect
-
a scraper to collect these metrics
-
an efficient time series database to store these metrics
-
a specific query language (PromQL) to query these time series
-
an alert manager to notify us according to metrics values or trends
-
-
We are going to deploy it on our Kubernetes cluster and see how to query it
Why Prometheus?
-
We don't endorse Prometheus more or less than any other system
-
It's relatively well integrated within the Cloud Native ecosystem
-
It can be self-hosted (this is useful for tutorials like this)
-
It can be used for deployments of varying complexity:
-
one binary and 10 lines of configuration to get started
-
all the way to thousands of nodes and millions of metrics
-
Exposing metrics to Prometheus
-
Prometheus obtains metrics and their values by querying exporters
-
An exporter serves metrics over HTTP, in plain text
-
This is what the node exporter looks like:
-
Prometheus itself exposes its own internal metrics, too:
-
If you want to expose custom metrics to Prometheus:
-
serve a text page like these, and you're good to go
-
libraries are available in various languages to help with quantiles etc.
-
How Prometheus gets these metrics
-
The Prometheus server will scrape URLs like these at regular intervals
(by default: every minute; can be more/less frequent)
-
If you're worried about parsing overhead: exporters can also use protobuf
-
The list of URLs to scrape (the scrape targets) is defined in configuration
Defining scrape targets
This is maybe the simplest configuration file for Prometheus:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
-
In this configuration, Prometheus collects its own internal metrics
-
A typical configuration file will have multiple
scrape_configs -
In this configuration, the list of targets is fixed
-
A typical configuration file will use dynamic service discovery
Service discovery
This configuration file will leverage existing DNS A records:
scrape_configs:
- ...
- job_name: 'node'
dns_sd_configs:
- names: ['api-backends.dc-paris-2.enix.io']
type: 'A'
port: 9100
-
In this configuration, Prometheus resolves the provided name(s)
(here,
api-backends.dc-paris-2.enix.io) -
Each resulting IP address is added as a target on port 9100
Dynamic service discovery
-
In the DNS example, the names are re-resolved at regular intervals
-
As DNS records are created/updated/removed, scrape targets change as well
-
Existing data (previously collected metrics) is not deleted
-
Other service discovery backends work in a similar fashion
Other service discovery mechanisms
-
Prometheus can connect to e.g. a cloud API to list instances
-
Or to the Kubernetes API to list nodes, pods, services ...
-
Or a service like Consul, Zookeeper, etcd, to list applications
-
The resulting configurations files are way more complex
(but don't worry, we won't need to write them ourselves)
Time series database
-
We could wonder, "why do we need a specialized database?"
-
One metrics data point = metrics ID + timestamp + value
-
With a classic SQL or noSQL data store, that's at least 160 bits of data + indexes
-
Prometheus is way more efficient, without sacrificing performance
(it will even be gentler on the I/O subsystem since it needs to write less)
Storage in Prometheus 2.0 by Goutham V at DC17EU
Running Prometheus on our cluster
We need to:
-
Run the Prometheus server in a pod
(using e.g. a Deployment to ensure that it keeps running)
-
Expose the Prometheus server web UI (e.g. with a NodePort)
-
Run the node exporter on each node (with a Daemon Set)
-
Setup a Service Account so that Prometheus can query the Kubernetes API
-
Configure the Prometheus server
(storing the configuration in a Config Map for easy updates)
Helm Charts to the rescue
-
To make our lives easier, we are going to use a Helm Chart
-
The Helm Chart will take care of all the steps explained above
(including some extra features that we don't need, but won't hurt)
Step 1: install Helm
- If we already installed Helm earlier, these commands won't break anything
.exercice[
-
Install Tiller (Helm's server-side component) on our cluster:
helm init -
Give Tiller permission to deploy things on our cluster:
kubectl create clusterrolebinding add-on-cluster-admin \ --clusterrole=cluster-admin --serviceaccount=kube-system:default
]
Step 2: install Prometheus
-
Skip this if we already installed Prometheus earlier
(in doubt, check with
helm list)
.exercice[
- Install Prometheus on our cluster:
helm install stable/prometheus \ --set server.service.type=NodePort \ --set server.persistentVolume.enabled=false
]
The provided flags:
-
expose the server web UI (and API) on a NodePort
-
use an ephemeral volume for metrics storage
(instead of requesting a Persistent Volume through a Persistent Volume Claim)
Connecting to the Prometheus web UI
- Let's connect to the web UI and see what we can do
.exercise[
-
Figure out the NodePort that was allocated to the Prometheus server:
kubectl get svc | grep prometheus-server -
With your browser, connect to that port
]
Querying some metrics
- This is easy ... if you are familiar with PromQL
.exercise[
- Click on "Graph", and in "expression", paste the following:
sum by (instance) ( irate( container_cpu_usage_seconds_total{ pod_name=~"worker.*" }[5m] ) )
]
-
Click on the blue "Execute" button and on the "Graph" tab just below
-
We see the cumulated CPU usage of worker pods for each node
(if we just deployed Prometheus, there won't be much data to see, though)
Getting started with PromQL
-
We can't learn PromQL in just 5 minutes
-
But we can cover the basics to get an idea of what is possible
(and have some keywords and pointers)
-
We are going to break down the query above
(building it one step at a time)
Graphing one metric across all tags
This query will show us CPU usage across all containers:
container_cpu_usage_seconds_total
-
The suffix of the metrics name tells us:
-
the unit (seconds of CPU)
-
that it's the total used since the container creation
-
-
Since it's a "total", it is an increasing quantity
(we need to compute the derivative if we want e.g. CPU % over time)
-
We see that the metrics retrieved have tags attached to them
Selecting metrics with tags
This query will show us only metrics for worker containers:
container_cpu_usage_seconds_total{pod_name=~"worker.*"}
-
The
=~operator allows regex matching -
We select all the pods with a name starting with
worker(it would be better to use labels to select pods; more on that later)
-
The result is a smaller set of containers
Transforming counters in rates
This query will show us CPU usage % instead of total seconds used:
100*irate(container_cpu_usage_seconds_total{pod_name=~"worker.*"}[5m])
-
The
irateoperator computes the "per-second instant rate of increase"-
rateis similar but allows decreasing counters and negative values -
with
irate, if a counter goes back to zero, we don't get a negative spike
-
-
The
[5m]tells how far to look back if there is a gap in the data -
And we multiply with
100*to get CPU % usage
Aggregation operators
This query sums the CPU usage per node:
sum by (instance) (
irate(container_cpu_usage_seconds_total{pod_name=~"worker.*"}[5m])
)
-
instancecorresponds to the node on which the container is running -
sum by (instance) (...)computes the sum for each instance -
Note: all the other tags are collapsed
(in other words, the resulting graph only shows the
instancetag) -
PromQL supports many more aggregation operators
What kind of metrics can we collect?
-
Node metrics (related to physical or virtual machines)
-
Container metrics (resource usage per container)
-
Databases, message queues, load balancers, ...
(check out this list of exporters!)
-
Instrumentation (=deluxe
printffor our code) -
Business metrics (customers served, revenue, ...)
class: extra-details
Node metrics
-
CPU, RAM, disk usage on the whole node
-
Total number of processes running, and their states
-
Number of open files, sockets, and their states
-
I/O activity (disk, network), per operation or volume
-
Physical/hardware (when applicable): temperature, fan speed ...
-
... and much more!
class: extra-details
Container metrics
-
Similar to node metrics, but not totally identical
-
RAM breakdown will be different
- active vs inactive memory
- some memory is shared between containers, and accounted specially
-
I/O activity is also harder to track
- async writes can cause deferred "charges"
- some page-ins are also shared between containers
For details about container metrics, see:
http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/
class: extra-details
Application metrics
-
Arbitrary metrics related to your application and business
-
System performance: request latency, error rate ...
-
Volume information: number of rows in database, message queue size ...
-
Business data: inventory, items sold, revenue ...
class: extra-details
Detecting scrape targets
-
Prometheus can leverage Kubernetes service discovery
(with proper configuration)
-
Services or pods can be annotated with:
prometheus.io/scrape: trueto enable scrapingprometheus.io/port: 9090to indicate the port numberprometheus.io/path: /metricsto indicate the URI (/metricsby default)
-
Prometheus will detect and scrape these (without needing a restart or reload)
Querying labels
-
What if we want to get metrics for containers belong to pod tagged
worker? -
The cAdvisor exporter does not give us Kubernetes labels
-
Kubernetes labels are exposed through another exporter
-
We can see Kubernetes labels through metrics
kube_pod_labels(each container appears as a time series with constant value of
1) -
Prometheus kind of supports "joins" between time series
-
But only if the names of the tags match exactly
Unfortunately ...
-
The cAdvisor exporter uses tag
pod_namefor the name of a pod -
The Kubernetes service endpoints exporter uses tag
podinstead -
See this blog post or this other one to see how to perform "joins"
-
Alas, Prometheus cannot "join" time series with different labels
(see Prometheus issue #2204 for the rationale)
-
There is a workaround involving relabeling, but it's "not cheap"
-
see this comment for an overview
-
or this blog post for a complete description of the process
-