flagger/docs/gitbook/tutorials/prometheus-operator.md

# Canary analysis with Prometheus Operator

This guide show you how to use
[Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) for canary analysis.

## Prerequisites

Flagger requires a Kubernetes cluster **v1.16** or newer and Prometheus Operator **v0.40** or newer.

Install Prometheus Operator with Helm v3:

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

kubectl create ns monitoring
helm upgrade -i prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set fullnameOverride=prometheus
```

The `prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false`
option allows Prometheus Operator to watch serviceMonitors outside of its namespace.

Install Flagger by setting the metrics server to Prometheus:

```bash
helm repo add flagger https://flagger.app

kubectl create ns flagger-system
helm upgrade -i flagger flagger/flagger \
--namespace flagger-system \
--set metricsServer=http://prometheus-prometheus.monitoring:9090 \
--set meshProvider=kubernetes
```

Install Flagger's tester:

```bash
helm upgrade -i loadtester flagger/loadtester \
--namespace flagger-system
```

Install [podinfo](https://github.com/stefanprodan/podinfo) demo app:

```bash
helm repo add podinfo https://stefanprodan.github.io/podinfo

kubectl create ns test
helm upgrade -i podinfo podinfo/podinfo \
--namespace test \
--set service.enabled=false
```

## Service monitors

The demo app is instrumented with Prometheus,
so you can create a `ServiceMonitor` objects to scrape podinfo's metrics endpoint:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: podinfo-canary
  namespace: test
spec:
  endpoints:
  - path: /metrics
    port: http
    interval: 5s
  selector:
    matchLabels:
      app: podinfo-canary
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: podinfo-primary
  namespace: test
spec:
  endpoints:
    - path: /metrics
      port: http
      interval: 5s
  selector:
    matchLabels:
      app: podinfo
```

We are setting `interval: 5s` to have a more aggressive scraping.
If you do not define it, you should use a longer interval in the Canary object.

## Metric templates

Create a metric template to measure the HTTP requests error rate:

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: test
spec:
  provider:
    address: http://prometheus-prometheus.monitoring:9090
    type: prometheus
  query: |
    100 - rate(
      http_requests_total{
        namespace="{{ namespace }}",
        job="{{ target }}-canary",
        status!~"5.*"
      }[{{ interval }}])
    /
    rate(
      http_requests_total{
        namespace="{{ namespace }}",
        job="{{ target }}-canary"
      }[{{ interval }}]
    ) * 100
```

Create a metric template to measure the HTTP requests average duration:

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: latency
  namespace: test
spec:
  provider:
    address: http://prometheus-prometheus.monitoring:9090
    type: prometheus
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          http_request_duration_seconds_bucket{
            namespace="{{ namespace }}",
            job="{{ target }}-canary"
          }[{{ interval }}]
        )
      ) by (le)
    )
```

## Canary analysis

Using the metrics template you can configure the canary analysis with HTTP error rate and latency checks:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: kubernetes
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  service:
    port: 80
    targetPort: http
    name: podinfo
  analysis:
    interval: 30s
    iterations: 10
    threshold: 2
    metrics:
    - name: error-rate
      templateRef:
        name: error-rate
      thresholdRange:
        max: 1
      interval: 30s
    - name: latency
      templateRef:
        name: latency
      thresholdRange:
        max: 0.5
      interval: 30s
    webhooks:
      - name: load-test
        type: rollout
        url: "http://loadtester.flagger-system/"
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test/"
```

Based on the above specification, Flagger creates the primary and canary Kubernetes ClusterIP service.

During the canary analysis, Prometheus will scrape the canary service and Flagger will use the HTTP error rate
and latency queries to determine if the release should be promoted or rolled back.