Monitor Business Services
COMMERCIAL FEATURE: Access business service monitoring (BSM) in the packaged Sensu Go distribution. For more information, read Get started with commercial features.
NOTE: Business service monitoring (BSM) is in public preview and is subject to change.
Sensu’s business service monitoring (BSM) provides high-level visibility into the current health of any number of your business services. Use BSM to monitor every component in your system with a top-down approach that produces meaningful alerts, prevents alert fatigue, and helps you focus on your core business services.
BSM requires two resources that work together to achieve top-down monitoring: service components and rule templates. Service components are the elements that make up your business services. Rule templates define the monitoring rules that produce events for service components based on customized evaluation expressions.
An example of a business service might be a company website. The website itself might have three service components: the primary webserver that publishes website pages, a backup webserver in case the primary webserver fails, and an inventory database for the shop section of the website. At least one webserver and the database must be in an OK state for the website to be fully available.
In this scenario, you could use BSM to create a current status page for this company website that displays the website’s high-level status at a glance. As long as one webserver and the database have an OK status, the website status is OK. Otherwise, the website status is not OK. Most people probably just want to know whether the website is currently available — it won’t matter to them whether the website is functioning with one or both webservers.
At the same time, the company does want to make sure the right person addresses any webserver failures, even if the website is technically still OK. BSM allows you to customize rule templates that apply a threshold for taking action for different service components as well as what action to take.
To continue the company website example, if the primary webserver fails but the backup webserver does not, you might use a rule template that creates a service ticket to address the next workday (in addition to the rule template that is emitting “OK” events for the current status page). Another monitoring rule might trigger an alert to the on-call operator should both webservers or the inventory database fail.
NOTE: BSM requires high event throughput. Configure a PostgreSQL datastore to achieve the required throughput and use the BSM feature.
Service component example
Here is an example service component definition that includes the website-services
service and applies the built-in aggregate
rule template for events generated by checks with the webserver
subscription:
---
type: ServiceComponent
api_version: bsm/v1
metadata:
name: webservers
spec:
services:
- website-services
interval: 60
query:
- type: fieldSelector
value: webserver in event.check.subscriptions
rules:
- template: aggregate
name: webservers_50-70
arguments:
critical_threshold: 70
warning_threshold: 50
handlers:
- slack
{
"type": "ServiceComponent",
"api_version": "bsm/v1",
"metadata": {
"name": "webservers"
},
"spec": {
"services": [
"website-services"
],
"interval": 60,
"query": [
{
"type": "fieldSelector",
"value": "webserver in event.check.subscriptions"
}
],
"rules": [
{
"template": "aggregate",
"name": "webservers_50-70",
"arguments": {
"critical_threshold": 70,
"warning_threshold": 50
}
}
],
"handlers": [
"slack"
]
}
}
Rule template example
This example lists the definition for the built-in aggregate rule template:
---
type: RuleTemplate
api_version: bsm/v1
metadata:
name: aggregate
namespace: default
spec:
arguments:
properties:
critical_count:
description: create an event with a critical status if there the number of
critical events is equal to or greater than this count
type: number
critical_threshold:
description: create an event with a critical status if the percentage of non-zero
events is equal to or greater than this threshold
type: number
metric_handlers:
default: {}
description: metric handlers to use for produced metrics
items:
type: string
type: array
produce_metrics:
default: {}
description: produce metrics from aggregate data and include them in the produced
event
type: boolean
set_metric_annotations:
default: {}
description: annotate the produced event with metric annotations
type: boolean
warning_count:
description: create an event with a warning status if there the number of
critical events is equal to or greater than this count
type: number
warning_threshold:
description: create an event with a warning status if the percentage of non-zero
events is equal to or greater than this threshold
type: number
required:
description: Monitor a distributed service - aggregate one or more events into a
single event. This BSM rule template allows you to treat the results of multiple
disparate check executions – executed across multiple disparate systems – as a
single event. This template is extremely useful in dynamic environments and/or
environments that have a reasonable tolerance for failure. Use this template when
a service can be considered healthy as long as a minimum threshold is satisfied
(e.g. at least 5 healthy web servers? at least 70% of N processes healthy?).
eval: |2
if (events && events.length == 0) {
event.check.output = "WARNING: No events selected for aggregate
";
event.check.status = 1;
return event;
}
event.annotations["io.sensu.bsm.selected_event_count"] = events.length;
percentOK = sensu.PercentageBySeverity("ok");
if (!!args["produce_metrics"]) {
var ts = Math.floor(new Date().getTime() / 1000);
event.timestamp = ts;
var tags = [
{
name: "service",
value: event.entity.name
},
{
name: "entity",
value: event.entity.name
},
{
name: "check",
value: event.check.name
}
];
event.metrics = sensu.NewMetrics({
points: [
{
name: "percent_non_zero",
timestamp: ts,
value: sensu.PercentageBySeverity("non-zero"),
tags: tags
},
{
name: "percent_ok",
timestamp: ts,
value: percentOK,
tags: tags
},
{
name: "percent_warning",
timestamp: ts,
value: sensu.PercentageBySeverity("warning"),
tags: tags
},
{
name: "percent_critical",
timestamp: ts,
value: sensu.PercentageBySeverity("critical"),
tags: tags
},
{
name: "percent_unknown",
timestamp: ts,
value: sensu.PercentageBySeverity("unknown"),
tags: tags
},
{
name: "count_non_zero",
timestamp: ts,
value: sensu.CountBySeverity("non-zero"),
tags: tags
},
{
name: "count_ok",
timestamp: ts,
value: sensu.CountBySeverity("ok"),
tags: tags
},
{
name: "count_warning",
timestamp: ts,
value: sensu.CountBySeverity("warning"),
tags: tags
},
{
name: "count_critical",
timestamp: ts,
value: sensu.CountBySeverity("critical"),
tags: tags
},
{
name: "count_unknown",
timestamp: ts,
value: sensu.CountBySeverity("unknown"),
tags: tags
}
]
});
if (!!args["metric_handlers"]) {
event.metrics.handlers = args["metric_handlers"].slice();
}
if (!!args["set_metric_annotations"]) {
var i = 0;
while(i \u003c event.metrics.points.length) {
event.annotations["io.sensu.bsm.selected_event_" + event.metrics.points[i].name] = event.metrics.points[i].value.toString();
i++;
}
}
}
if (!!args["critical_threshold"] && percentOK \u003c= args["critical_threshold"]) {
event.check.output = "CRITICAL: Less than " + args["critical_threshold"].toString() + "% of selected events are OK (" + percentOK.toString() + "%)
";
event.check.status = 2;
return event;
}
if (!!args["warning_threshold"] && percentOK \u003c= args["warning_threshold"]) {
event.check.output = "WARNING: Less than " + args["warning_threshold"].toString() + "% of selected events are OK (" + percentOK.toString() + "%)
";
event.check.status = 1;
return event;
}
if (!!args["critical_count"]) {
crit = sensu.CountBySeverity("critical");
if (crit \u003e= args["critical_count"]) {
event.check.output = "CRITICAL: " + args["critical_count"].toString() + " or more selected events are in a critical state (" + crit.toString() + ")
";
event.check.status = 2;
return event;
}
}
if (!!args["warning_count"]) {
warn = sensu.CountBySeverity("warning");
if (warn \u003e= args["warning_count"]) {
event.check.output = "WARNING: " + args["warning_count"].toString() + " or more selected events are in a warning state (" + warn.toString() + ")
";
event.check.status = 1;
return event;
}
}
event.check.output = "Everything looks good (" + percentOK.toString() + "% OK)";
event.check.status = 0;
return event;
{
"type": "RuleTemplate",
"api_version": "bsm/v1",
"metadata": {
"name": "aggregate",
"namespace": "default"
},
"spec": {
"arguments": {
"properties": {
"critical_count": {
"description": "create an event with a critical status if there the number of critical events is equal to or greater than this count",
"type": "number"
},
"critical_threshold": {
"description": "create an event with a critical status if the percentage of non-zero events is equal to or greater than this threshold",
"type": "number"
},
"metric_handlers": {
"default": {},
"description": "metric handlers to use for produced metrics",
"items": {
"type": "string"
},
"type": "array"
},
"produce_metrics": {
"default": {},
"description": "produce metrics from aggregate data and include them in the produced event",
"type": "boolean"
},
"set_metric_annotations": {
"default": {},
"description": "annotate the produced event with metric annotations",
"type": "boolean"
},
"warning_count": {
"description": "create an event with a warning status if there the number of critical events is equal to or greater than this count",
"type": "number"
},
"warning_threshold": {
"description": "create an event with a warning status if the percentage of non-zero events is equal to or greater than this threshold",
"type": "number"
}
},
"required": null
},
"description": "Monitor a distributed service - aggregate one or more events into a single event. This BSM rule template allows you to treat the results of multiple disparate check executions – executed across multiple disparate systems – as a single event. This template is extremely useful in dynamic environments and/or environments that have a reasonable tolerance for failure. Use this template when a service can be considered healthy as long as a minimum threshold is satisfied (e.g. at least 5 healthy web servers? at least 70% of N processes healthy?).",
"eval": "\nif (events \\u0026\\u0026 events.length == 0) {\n event.check.output = \"WARNING: No events selected for aggregate\n\";\n event.check.status = 1;\n return event;\n}\n\nevent.annotations[\"io.sensu.bsm.selected_event_count\"] = events.length;\n\npercentOK = sensu.PercentageBySeverity(\"ok\");\n\nif (!!args[\"produce_metrics\"]) {\n var ts = Math.floor(new Date().getTime() / 1000);\n\n event.timestamp = ts;\n\n var tags = [\n {\n name: \"service\",\n value: event.entity.name\n },\n {\n name: \"entity\",\n value: event.entity.name\n },\n {\n name: \"check\",\n value: event.check.name\n }\n ];\n\n event.metrics = sensu.NewMetrics({\n points: [\n {\n name: \"percent_non_zero\",\n timestamp: ts,\n value: sensu.PercentageBySeverity(\"non-zero\"),\n tags: tags\n },\n {\n name: \"percent_ok\",\n timestamp: ts,\n value: percentOK,\n tags: tags\n },\n {\n name: \"percent_warning\",\n timestamp: ts,\n value: sensu.PercentageBySeverity(\"warning\"),\n tags: tags\n },\n {\n name: \"percent_critical\",\n timestamp: ts,\n value: sensu.PercentageBySeverity(\"critical\"),\n tags: tags\n },\n {\n name: \"percent_unknown\",\n timestamp: ts,\n value: sensu.PercentageBySeverity(\"unknown\"),\n tags: tags\n },\n {\n name: \"count_non_zero\",\n timestamp: ts,\n value: sensu.CountBySeverity(\"non-zero\"),\n tags: tags\n },\n {\n name: \"count_ok\",\n timestamp: ts,\n value: sensu.CountBySeverity(\"ok\"),\n tags: tags\n },\n {\n name: \"count_warning\",\n timestamp: ts,\n value: sensu.CountBySeverity(\"warning\"),\n tags: tags\n },\n {\n name: \"count_critical\",\n timestamp: ts,\n value: sensu.CountBySeverity(\"critical\"),\n tags: tags\n },\n {\n name: \"count_unknown\",\n timestamp: ts,\n value: sensu.CountBySeverity(\"unknown\"),\n tags: tags\n }\n ]\n });\n\n if (!!args[\"metric_handlers\"]) {\n event.metrics.handlers = args[\"metric_handlers\"].slice();\n }\n\n if (!!args[\"set_metric_annotations\"]) {\n var i = 0;\n\n while(i \\u003c event.metrics.points.length) {\n event.annotations[\"io.sensu.bsm.selected_event_\" + event.metrics.points[i].name] = event.metrics.points[i].value.toString();\n i++;\n }\n }\n}\n\nif (!!args[\"critical_threshold\"] \\u0026\\u0026 percentOK \\u003c= args[\"critical_threshold\"]) {\n event.check.output = \"CRITICAL: Less than \" + args[\"critical_threshold\"].toString() + \"% of selected events are OK (\" + percentOK.toString() + \"%)\n\";\n event.check.status = 2;\n return event;\n}\n\nif (!!args[\"warning_threshold\"] \\u0026\\u0026 percentOK \\u003c= args[\"warning_threshold\"]) {\n event.check.output = \"WARNING: Less than \" + args[\"warning_threshold\"].toString() + \"% of selected events are OK (\" + percentOK.toString() + \"%)\n\";\n event.check.status = 1;\n return event;\n}\n\nif (!!args[\"critical_count\"]) {\n crit = sensu.CountBySeverity(\"critical\");\n\n if (crit \\u003e= args[\"critical_count\"]) {\n event.check.output = \"CRITICAL: \" + args[\"critical_count\"].toString() + \" or more selected events are in a critical state (\" + crit.toString() + \")\n\";\n event.check.status = 2;\n return event;\n }\n}\n\nif (!!args[\"warning_count\"]) {\n warn = sensu.CountBySeverity(\"warning\");\n\n if (warn \\u003e= args[\"warning_count\"]) {\n event.check.output = \"WARNING: \" + args[\"warning_count\"].toString() + \" or more selected events are in a warning state (\" + warn.toString() + \")\n\";\n event.check.status = 1;\n return event;\n }\n}\n\nevent.check.output = \"Everything looks good (\" + percentOK.toString() + \"% OK)\";\nevent.check.status = 0;\n\nreturn event;\n"
}
}
Configure BSM via the web UI
The Sensu web UI BSM module allows you to create, edit, and delete service components and rule templates inside the web UI.
Configure BSM via APIs and sensuctl
BSM service components and rule templates are Sensu resources with complete definitions, so you can use Sensu’s service component and rule template APIs to create, retrieve, update, and delete service components and rule templates.
You can also use sensuctl to create and manage service components and rule templates via the APIs from the command line.