Configure your monitoring tool

This chapter describes how to configure the following monitoring tools:

New Relic
Datadog
Prometheus Alertmanager

Before you configure your tool, be sure to reach out to your Rackspace Support team. For the AMR service to function properly, it is important to consult Rackspace and that appropriate documentation for handling events is created.

Configure New Relic

This section provides simplified steps to configure New Relic to integrate with Rackspace Watchman. For more detailed information, see the New Relic documentation.

Log in to New Relic by using your username and password.
Create a notification channel for the severity level you want for the alert. Enter the following information:
- Select a channel type - Select “Webhook”
- Channel name - Enter a descriptive name for the channel
- Base URL - This should match the cloud account and desired severity. See Getting the webhooks for more information.
Basic auth, custom headers, and custom payload are optional.

📘
Note :Be sure that you configure your tool for the right cloud account, using the right values for secret and severity. If you misconfigure your tool, alerts won’t be translated into Rackspace tickets and might result in downtime.
Configure alerts to use the notification channels. For example:

Each time you set up a new cloud account, you should contact Rackspace to test the end-to-end workflow. If you have any issues, see Troubleshooting.

Configure Datadog

This section provides simplified steps to configure Datadog to integrate with Rackspace Watchman. For more detailed information, see Datadog webhooks.

Log in to Datadog by using your username and password.
Navigate to Integrations and search for Webhook.
Install the Webhook integration.
Scroll down to the Webhooks section and click New.
Enter the following information:
- Select a channel type - Select “Webhook”
- Channel name - Enter a descriptive name for the channel
- Base URL - This should match the cloud account and desired severity. See Getting the webhooks for more information.
Payload, Custom Headers, and Encode as from are optional.

📘
Note :Be sure that you configure your tool for the right cloud account, using the right values for secret and severity. If you misconfigure your tool, alerts won’t be translated into Rackspace tickets and might result in downtime.

Click Save to save the integration settings.
Configure Datadog alerts to use the notification channels. For example:

Each time you set up a new cloud account, you should contact Rackspace to test the end-to-end workflow. If you have any issues, see Troubleshooting.

Configure Prometheus Alertmanager

Prometheus is popular with cluster-level monitoring in the Kubernetes ecosystem and includes an application called Alertmanager, which handles alerts sent by client applications such as the Prometheus server.

This section provides simplified steps to configure Prometheus Alertmanager to integrate with Rackspace Watchman. For more detailed information, see the Alertmanager documentation.

Create a Kubernetes cluster using a non-end-of-life version.
Set up a metrics server by using the official documentation.
Install Prometheus and ensure that the pods and services are running.

The actual steps might vary depending on your setup and deployment model. The following example assumes that you have kubetcl and helm installed.
```
kubectl create namespace Prometheus  
helm install prometheus-operator prometheus-community/prometheus-operator  
 --namespace Prometheus
```

Ensure that the setup completed successfully by using the following commands:

kubectl -n prometheus get pods  
kubectl -n prometheus get services

The output should look similar to the following example

1. NAME                                                     READY   STATUS    RESTARTS   AGE  
   alertmanager-prometheus-operator-alertmanager-0           2/2     Running   0          61s  
   prometheus-operator-grafana-6f8fbdcddc-78g4t              2/2     Running   0          66s  
   prometheus-operator-kube-state-metrics-66b4c95cd9-2sn6g   1/1     Running   0          66s  
   prometheus-operator-operator-67d8cb455-d66pr              2/2     Running   0          66s  
   prometheus-operator-prometheus-node-exporter-58nbf        1/1     Running   0          66s  
   prometheus-operator-prometheus-node-exporter-5rbmx        1/1     Running   0          66s  
   prometheus-prometheus-operator-prometheus-0               3/3     Running   1          51s

Update the Alertmanager configuration section to use the Watchman webhooks. The following example has two webhooks for low and high severities, but you can add all supported severities. The inline comments describe the purpose of each line of the configuration file. Be sure that the yaml formatting is consistent with your yaml configuration.

1. \## Alertmanager configuration directives  
   \## ref: <https://prometheus.io/docs/alerting/configuration/#configuration-file>  
   \##      <https://prometheus.io/webtools/alerting/routing-tree-editor/>

   ## 

   config:  
     global:  
       resolve_timeout: 5m  
     route:  
       \# How long to wait to buffer alerts of the same group before sending initially.  
       group_wait: 10s \# How long to wait before sending an alert that has been added to a group which contains already fired alerts.
group_interval: 5m

\# How long to wait before re-sending a given alert that has already been sent.
repeat_interval: 3h

\# The default receiver. This should not be set to Rackspace webhooks unless it has been agreed as part of onboarding.
receiver: 'null'

\# In this case Alertmanager is configured to group alerts by \*groupLabels\*: \`region\`, \`cluster\`, \`alertname\` and \`severity\`.
\# This results in a de-noising of alerts since all of the \*KubePodNotReady critical\* alerts are sent as a group.
group_by: \[alertname, severity, cluster, region\]

\# Alert routing rules.
routes:
- match:
    severity: normal
  receiver: 'null'
- match:
    severity: rs-low
  receiver: 'webhook-k8s-alertmanager-low'
- match:
    severity: rs-high
  receiver: 'webhook-k8s-alertmanager-high'
1.  \# Rackspace Watchman webhooks will be configured in this section. This is in addition to any other receivers such as email addresses or other systems.  
     receivers:

   - name: 'null'
   - name: webhook-k8s-alertmanager-low  
     webhook_configs:
     - url:<replace with the Watchman URL>
   - name: webhook-k8s-alertmanager-high  
     webhook_configs:
     - url:<replace with the Watchman URL>

Add a section called externalLabels to the Prometheus specification section of the configuration file to support the above group_by labels. This is required for proper identificatino of the source cluster and can aid in troubleshooting.

1. prometheusSpec:  
    \## If true, pass --storage.tsdb.max-block-duration=2h to prometheus. This is already done if using Thanos

   ## 

    disableCompaction: false  
    \## APIServerConfig  
    \## ref: <https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#apiserverconfig>

   ## 

    apiserverConfig: {}

    \## Interval between consecutive scrapes.

   ## 

    scrapeInterval: ""

    \## Interval between consecutive evaluations.

   ## 

    evaluationInterval: ""

    \## ListenLocal makes the Prometheus server listen on loopback, so that it does not bind against the Pod IP.

   ## 

    listenLocal: false

    \## EnableAdminAPI enables Prometheus the administrative HTTP API which includes functionality such as deleting time series.  
    \## This is disabled by default.  
    \## ref: <https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-admin-apis>

   ## 

    enableAdminAPI: false

    \## Image of Prometheus.

   ## 

    image:  
      repository: quay.io/prometheus/prometheus  
      tag: v2.18.2  
      sha: ""

    \## Tolerations for use with node taints  
    \## ref: <https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/>

   ## 

    tolerations: \[\]  
    \#  - key: "key"  
    \#    operator: "Equal"  
    \#    value: "value"  
    \#    effect: "NoSchedule"

    \## Alertmanagers to which alerts will be sent  
    \## ref: <https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#alertmanagerendpoints>

   ## 

    \## Default configuration will connect to the alertmanager deployed as part of this release

   ## 

    alertingEndpoints: \[\]  
    \# - name: ""  
    \#   namespace: ""  
    \#   port: http  
    \#   scheme: http  
    \#   pathPrefix: ""  
    \#   tlsConfig: {}  
    \#   bearerTokenFile: ""  
    \#   apiVersion: v2

    \## External labels to add to any time series or alerts when communicating with external systems

   ## 

    externalLabels:  
      cluster: amr-test-cluster  
      region: us-east-2

If you are installing Prometheus from scratch or want a complete values.yaml file, you can use this `sample file [<https://github.com/RSS-Engineering/amr-public/blob/main/prometheus-alertmanager-values.yaml](https://github.com/RSS-Engineering/amr-public/blob/main/prometheus-alertmanager-values.yaml)’_.

📘
Note :If you use the provided file, ensure that the values in the previous steps are set correctly as per your account configuration.
After you update the values.yaml file, use the following command apply the configuration.

📘
Note :The actual steps might vary depending on your setup and deployment model.
```
helm upgrade -f values.yaml prometheus-operator prometheus-community/prometheus-operator --namespace prometheus
```

To test the configuration, use the following steps:

Save the following as rs-alert-test-pod.yaml:

1. cat rs-alert-test-pod.yaml  
   apiVersion: v1

   kind: Pod  
   metadata:  
     name: memory-demo  
     namespace: rs-alert-test  
   spec:  
     containers:

   - name: memory-demo-ctr  
         image: polinux/stress  
         resources:  
           limits:  
             memory: "200Mi"  
           requests:  
             memory: "100Mi"  
         command: \["stress"\]  
         args: \["--vm", "1", "--vm-bytes", "120M", "--vm-hang", "1"\]  
     EOF

Save the following as rs-alert-test-rules.yaml:

1. cat rs-alert-test-rules.yaml  
   apiVersion: monitoring.coreos.com/v1  
   kind: PrometheusRule  
   metadata:  
     annotations:  
       meta.helm.sh/release-name: prometheus-operator  
       meta.helm.sh/release-namespace: prometheus  
       prometheus-operator-validated: "true"  
     labels:  
       app: prometheus-operator  
       chart: prometheus-operator-9.3.2  
     name: prometheus-operator-general.rules  
   spec:  
     groups:  
     \- name: general.rules  
       rules:  
       \- alert: TargetDown  
         annotations:  
           message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service  
             }} targets in {{ $labels.namespace }} namespace are down.'  
         expr: 100 \* (count(up == 0) BY (job, namespace, service) / count(up) BY (job,  
           namespace, service)) > 10  
         for: 10m  
         labels:  
           severity: warning  
       \- alert: Watchdog  
         annotations:  
           message: |  
             This is an alert meant to ensure that the entire alerting pipeline is functional.  
             This alert is always firing, therefore it should always be firing in Alertmanager  
             and always fire against a receiver. There are integrations with various notification  
             mechanisms that send a notification when this alert is not firing. For example the  
             "DeadMansSnitch" integration in PagerDuty.  
         expr: vector(1)  
         labels:  
           severity: none  
     \- name: rs.rules  
       rules:  
       \- alert: example-memory-alert  
         annotations:  
           message: Pod memory alert  
           description: High memory on test pod  
           runbook_url: <http://example.rackspace.com/runbook/value>  
         expr: sum(container\_memory\_usage_bytes{namespace="rs-alert-test",pod="memory-demo",container="memory-demo-ctr"}) > 150000000  
         for: 1m  
         labels:  
           severity: rs-low  
   EOF
2.

Use the following commands to deploy the pods and resources:

1. kubectl create namespace rs-alert-test  
   kubectl -n rs-alert-test create -f rs-alert-test-pod.yaml  
   kubectl -n prometheus create -f rs-alert-test-rules.yaml

   kubectl -n  prometheus get prometheusrules  
   kubectl -n  rs-alert-test get pods

Open three terminal windows and run the following commands:

1. \# Window 1 - This is going to increase the memory usage by 30M on the pod  
   kubectl exec -ti -n rs-alert-test memory-demo -- stress --vm 1 --vm-bytes 30M --vm-hang 1 -t 500

   \# Window 2 - If the pod is using more than 150Mi, the alert should be firing as per the rules.  
   kubectl top pods -n rs-alert-test

   \# Window 3 - You can also check the current alerts firing  
   kubectl exec -ti $(kubectl get pods -l app=alertmanager -n prometheus  
      -o jsonpath='{.items\[0\].metadata.name}') -n prometheus  
      -c alertmanager -- amtool alert query --alertmanager.url=<http://127.0.0.1:9093>

The output should be similar to the following example:

Depending on the severity set, the corresponding webhook is triggered and a Rackspace alert ticket is generated.

Finally, clean up the test artifacts by using the following commands:

kubectl -n prometheus delete -f rs-alert-test-rules.yaml  
kubectl -n rs-alert-test delete -f rs-alert-test-pod.yaml

If you have any issues, see Troubleshooting.

Configure New Relic

Note :Be sure that you configure your tool for the right cloud account, using the right values for secret and severity. If you misconfigure your tool, alerts won’t be translated into Rackspace tickets and might result in downtime.

Configure Datadog

Note :Be sure that you configure your tool for the right cloud account, using the right values for secret and severity. If you misconfigure your tool, alerts won’t be translated into Rackspace tickets and might result in downtime.

Configure Prometheus Alertmanager

Note :If you use the provided file, ensure that the values in the previous steps are set correctly as per your account configuration.

Note :The actual steps might vary depending on your setup and deployment model.