Contents
This section describes alerting including an explanation of the alert flow, the alarm language, the policies that you can create using alarms and example best practices. In short, Rackspace Cloud Monitoring uses alarms to evaluate the metrics of a check and decide if a notification plan should be executed. It is the primary way to describe exactly what you want to be alerted on.
Let's take an example user work flow of creating a monitor for a particular resource and follow it through the system to understand how the alerting system works:
Using the Create Check call, create a check with one or more monitoring zones (per the
monitoring_zone_pollattribute). When you apply the check via the API, the check is provisioned on the collectors. If the check is successfully applied (as indicated by the HTTP response code) the monitor starts executing the check.Using the Create Alarm call, create a new alarm on the entity that matches this particular check.
Note that alarms are created to match when a specific condition occurs. On this alarm let's assume you've specified the alarm policy as
QUORUM. This parameter describes a deterministic way to represent mixed results in a "multi-datacenter" monitoring environment. To learn more about this concept, see Alert Policies. You can also read Best Practices on Alerting for more pattern applications of the alarm language.If the monitored resource fails, a state change event is generated (since all the collectors agree on the status per the
QUORUMalert policy) and an alert is triggered. Based on the logic you created in the associated notification plan an error notification is sent (the call itself is a webhook).
![]() | Note |
|---|---|
If a check fails to execute, by default alarm associated with check returns a
This may change in future versions of the product, however this is currently the only behavior allowed. This represents a subclass of failures similar to "Connection Timeout"'s or other errors where the result wasn't simply a failure result, but a result where the user was unable to run the check at all. |
The alarm language is one of the most powerful parts of Rackspace Cloud Monitoring. It describes the mechanism to trigger an event. Upon triggering an event a notification plan is executed that describes how to send different notifications.
As mentioned above the default evaluation of a check depends upon whether the
check is able to run successfully. We can illustrate this concept using the HTTP
check as an example. If the alarm checks the status of a 404 response, but the check
is actually getting a Connection Refused message, the result of that check is
ERROR. The availability of the check is determined by the ability
to run the check.
An alarm query is broken down into the following main parts:
- Comments
Comments are either line by line comments that begin with a # or c-style comments /* */.
# This is a comment /* This is a comment */ // This is NOT a comment- String Literals
String literals are surrounded with either ' or ". String literals support the following escape sequences:
Sequence Value \"Double quote \'Single quote \\Backslash \bBackspace \fFormfeed \nNewline \rCarriage return \tTab \uXXXXUnicode character where XXXX is the hex unicode character code Some example string literals:
"Foo" /* A double quoted string */ 'Foo' /* A single quoted string */ '"Foo\'s bar\"' /* Single quoted strings may contain unescaped double quotes */ /* as well as escaped single or double quotes */ "'Bar's foo\'" /* Double quoted strings may contain unescaped single quotes */ /* as well as escaped single or double quotes */ '\u0027abc' /* A string containing an escaped unicode character */- Numeric Literals
Numeric literals are written without quotation marks. Below are some examples:
2773.2 /* Numeric literal */ 200 /* Numeric literal */ -200 /* Numeric literal */ 1.2e-7 /* Numeric literal with exponential notation */
- Declarations
This part of the alarm language is the setting declarations, which globally control the evaluation of the query. The syntax is shown below:
:set <name>=<value>
The current version of the product supports two settings. The first setting specifies the consistency level.
:set consistencyLevel=<value>
This is an important setting that is typically left as
QUORUM(the default) unless there is a specific need to change it. For more information about alerting policies and consistency levels, see Alert Policies.The second setting is the consecutive alert count. It determines how many consecutive evaluations of a state occur before issuing a state change. The default for this setting is 1.
:set consecutiveCount=<value>
- Conditionals
The second part of the query is the conditional statement. The conditional statements determine what criterion constitute sending an alert on behalf of the user. This is by far the most configurable part of the alarm language. There are two types of comparisons: numeric comparisons and text comparisons.
Numeric comparisons have a number of different operators, which are listed below:
== /* Equality when compared with a literal numeric */ != /* Inequality */ >= /* Greater than or equal to */ <= /* Less than or equal to */ < /* Less than */ > /* Greater than */
If the left hand side of the conditional is a metric statement and the right hand side of the equality is another metric statement, then equality and inequality is evaluated based on lexicographical comparison.
Or if the left or right hand side is a literal than the following operators are available for use.
== /* String comparison */ != /* String comparison */ regex /* Regular expression match */ nregex /* Regular expression inverse match */
On top of the single conditional operators, you can also use boolean logic to evaluate multiple conditionals in a single alarm. When trying to determine if a resource is functioning correctly, this built-in flexibility provides you with a powerful tool that lets you examine multiple aspects of a single resource.
The operators available are:
&& /* And */ || /* Or */
- Return Statements
The third part of the query is the return statements. The return statements determine the notification or notifications to execute on the notification plan as well as the state of the alarm. There are two separate methods to represent a return query:
Returning the status:
return new AlarmStatus(<OK|WARNING|CRITICAL>);
Returning the status and message:
return new AlarmStatus(<OK|WARNING|CRITICAL>, <String Status Message>);
Alarms have limits in their constructs. For instance, there are a limited set of conditionals you can use in a single alarm query.
The following list describes the limits and defaults for alarms:
Minimum conditionals in a single query: 0
Maximum limit of conditionals in a single query:10
Criteria: Optional
Not that if criteria is not specified the availability of the check determines the alarm state.
Default consistency level of the alert policy: QUORUM
Default consecutive alert count: 1
Maximum length of a metric name string (in characters): 128
Maximum length of a string literal representing a metric value (in characters): 80
Checks and Alarms have status strings and there is a resolution policy for final message that get displayed to a user in an email or alarm change log or webhook. This message represents a human readable string for the status of the alarm. Status messages may be up to 128 characters long.
The resolution policy is as follows:
Status string interpolation will substitute metrics in a special format to the point in time metric. It looks like this:
return new AlarmStatus(WARNING, 'The check took #{duration}s to execute');![]() | Note |
|---|---|
String Interpolation will substitute a
|
Alert policies define a system for interpreting mixed results from a check. Mixed
results can occur during failure scenarios if you are monitoring a resource in multiple
monitoring zones. For instance, if you're monitoring a website from three different
monitoring zones and the website goes down, a QUORUM calculation consisting
of two monitoring zones would need to agree before sending an alert.
There three different interpretations and alert policies for handling mixed results. Each interpretation has trade-offs that should be considered when determining which policy to use. The interpretation policy and their trade-offs are described below
A single monitoring zone failure. For example, an alert is triggered if one of three, or say one of five, monitoring zones report the failure.
The ONE policy optimizes speed of alerting at the expense of
correctness. For instance, any network blip from the Rackspace Cloud Monitoring to the monitored
resource would potentially generate an alert. This is mitigated in the
QUORUM policy.
A failure is detected in a majority of the monitoring zones. For example, two of three, or three of five monitoring zones report the failure. The calculation is TOTAL / 2 + 1
The QUORUM policy is the recommended solution. It offers the best
speed to correctness trade-off. Only a majority of the infrastructure monitoring
your resource has to agree that the resource is in fact down before sending an
alert. This is the best approach to maintain speed and low time-to-alert.
All monitoring zone's agree the resource is down. For example, three of out of three monitoring zones report the failure.
The ALL policy is the most accurate, but is also prone to failure in
significant failure scenarios. If a network partition between our internal
datacenters happens the alert could be delayed due to the election process. In this
case a machine has to be marked down, then the checks will be re-evaluated as a
group. If they come to a consensus (with the downed collector) then an alert is
generated.
Function modifiers serve to alter the interpretation of a metric. The format of a modifier is as follows:
ex: <funcname>(metric['response_time'])
Function name: previous
This is used to look back at the same metric in the previous time period from the same datacenter. This is useful to make sure a value is always incrementing. Or another use is detecting string changes and sending an alert on that.
if (previous(metric['fingerprint']) != metric['fingerprint']) {
return new AlarmStatus(CRITICAL, 'Fingerprint has changed to: #{fingerprint}');
}
Function name: rate
This is best used for counters. For instance if you are tracking a gauge such as bytes_in on an network interface, this will give you the rate as defined by this formula where V=value, and T=time.
(V1 - V0) /
(T1 - T0)
if (rate(metric['rx_bytes']) > 5242880) {
return new AlarmStatus(CRITICAL, 'Received greater than 5 MBps.');
}
if (rate(metric['rx_bytes']) > 1048576) {
return new AlarmStatus(WARNING, 'Received greater than 1 MBps.');
}
Function name: percent
This function is used to calculate a percentage, useful in situations like the example below.
![]() | Note |
|---|---|
Notice the order of the two statements below, since it executes sequentially it is important to be most specific as the first matched condition wins. This is true for all conditions, it is commonly exposed in statements like this. |
if (percent(metric['used'], metric['total']) > 90) {
return new AlarmStatus(CRITICAL, 'Less than 10% free space left.');
}
if (percent(metric['used'], metric['total']) > 80) {
return new AlarmStatus(WARNING, 'Less than 20% free space left.');
}
This section covers common solution patterns for creating useful alerts. It focuses on alarms and how you can use the alarm language to best achieve these patterns.
Critical on 404 or Connection Refused
This example assumes a provisioned Remote HTTP with standard settings. It checks that the return code (which is a metric of type string) is the string equivalent of a 404. HTTP response codes are numeric, but since they hold no numeric value, we interpret them as strings.
if (metric['code'] == "404") {
return new AlarmStatus(CRITICAL, "Page not found!");
}
Check for the existence of a body match and error out if present
This example assumes a provisioned Remote HTTP with an metric
called body_match added to the response. You can use this string
metric to check the existence of the text, and error out if found.
Using the HTTPS prefix automatically defaults the port to the
standard 443 instead of port 80. This particular
example looks for the word "forbidden" in the body match, and if found returns
CRITICAL with the error message: "Forbidden found,
returning CRITICAL."
if (metric['body_match'] regex ".*forbidden.*") {
return new AlarmStatus(CRITICAL, "Forbidden found, returning CRITICAL.");
}
Check the cert_end_in metric; critical if
less than a week away
This example assumes a provisioned Remote HTTP against an HTTPS server and adds a set of metrics that are specific to SSL in the hash of metrics.
This example checks the certificate expiration in seconds, abbreviated as the
cert_end_in:
/* 1 week = 604 800 seconds */
if (metric['cert_end_in'] < 604800) {
return new AlarmStatus(WARNING, "Cert expiring in less than 1 week.");
}
/* 2 days = 172 800 seconds */
if (metric['cert_end_in'] < 172800) {
return new AlarmStatus(CRITICAL, "Cert expiring in less than 2 days.");
}
This example assumes a provisioned Remote TCP check. It also specifies a
banner_match
'OpenSSH.*', which
matches content based on the text sent from the
server upon connection. For a complete
description, see Remote TCP. However if a banner
matches, then a metric is added to the result,
called banner_match. One common
solution is to check for the existence of that
metric and return CRITICAL
otherwise.
/* Have the check match at the edge */
if (metric['banner_matched'] != "") {
return new AlarmStatus(OK);
}
/* Or use the regex parser in the
language to build multiple matches
in a single alarm. */
if (metric['banner'] regex "OpenSSH.*") {
return new AlarmStatus(OK);
}
return new AlarmStatus(CRITICAL, "Match not found.");
This example assumes a provisioned Remote DNS check against a
working nameserver. In this example, the alarm matches against the
answer metric. As with all alarms, if the check is marked
available=false (which in this case means the nameserver fails to respond) than
the alarm is
CRITICAL.
# Match if the 127... address was in the resolution
# if it wasn't than default to CRITICAL
if (metric["answer"] regex ".*127.8.2.1.*") {
return new AlarmStatus(OK, "Resolved the correct address!");
}
return new AlarmStatus(CRITICAL);
The following example uses the Rackspace Cloud Monitoring command line interface (CLI). For information on downloading and installing the CLI, see https://github.com/racker/rackspace-monitoring-cli.
One of the most widely used remote checks is the SSH check. This check not only verifies that something is listening on the expected port, but establishes an SSH session and returns the host key fingerprint as a metric, further verifying that the SSH server is operating as expected.
The following example assumes the existence of an entity with the IP address eth0 and ID enk8YUv0Cd, and a notification plan with ID nplU9hLUgc. This check connects to an SSH server using port 22 by default:
raxmon-checks-create \ --entity-id=enk8YUv0Cd \ --label=ssh \ --type=remote.ssh \ --target-alias=eth0 \ --monitoring-zones=mzord,mzdfw,mzlon
Alarm for this check:
If the monitoring service is unable to connect to the SSH server for the check, any alarms using the check will automatically fail. However, we can additionally verify that the server returns the expected host key fingerprint, which could reveal an unexpected change on the server or a man in the middle attack.
raxmon-alarms-create \
--entity-id=enk8YUv0Cd \
--notification-plan-id=nplU9hLUgc \
--check-id=chTFHxHn0p \
--criteria="if (metric['fingerprint'] != '13dd6c5df600f9a15c67ea5db491ac9a') { return new AlarmStatus(CRITICAL, 'Incorrect SSH Host Fingerprint'); }"
![[Note]](/cm-v1.0-cm-devguide/common/images/admon/note.png)
