Event Engine

Introduction

The Event Engine is the backend process used by NAV to process the event queue. Whenever a NAV subsystem posts an event to the queue, the Event Engine will pick it up and decide what do to with it.

Typically, the Event Engine will generate an alert from the event, or it may ignore the event entirely, depending on the circumstances. In some cases, it will delay the alert for a grace period, while waiting for another corresponding event to resolve the pending problem.

Plugins

Most of the work of the Event Engine is done by event handler plugins from the nav.eventengine.plugins namespace. Each event picked from the queue will be offered to each of the plugins, until one of them decides to handle the event. If no plugins wanted to handle the event, the Event Engine will perform a very simple default routine to translate the event directly into an alert (possibly using alert hints given in the event itself).

Configuration

The operation of the Event Engine can be customized using configuration options in eventengine.conf. Most of the configuration concerns itself with configuring the grace periods (timeouts) for various types of alerts. The default configuration looks somewhat like this:

# NAV eventengine configuration
[export]
# If set, the script option will point to a program that will receive a
# continuous stream of JSON serialized alert objects on its STDIN.
;script = /path/to/event/receiver/script

[timeouts]
#
# This section configures timeout values for alert quarantines. A quarantine
# is when one or multiple alerts are held back for a period of time, while
# waiting for the problem to resolve itself. It is a protection against a
# torrent of alerts for things that are rapidly flapping. 
#
# All options are commented out with default values. Uncomment to change the
# defaults. Valid units are s=seconds, m=minutes, h=hours, d=days.
#

# When a boxDown event is received, how long to wait for resolve before
# sending out a boxDownwarning.
;boxDown.warning = 1m

# When a boxDown event is received, how long to wait for resolve before
# finally declaring the IP device as down.
;boxDown.alert = 4m

# When a moduleDown event is received, how long to wait for resolve before
# sending out a moduleDownWarning.
;moduleDown.warning = 1m

# When a moduleDown event is received, how long to wait for resolve before
# finally declaring the module as down.
;moduleDown.alert = 4m

# When a linkDown event is received, how long to wait for resolve before
# finally declaring the link as down.
;linkDown.alert = 4m

# When an snmpAgentDown event is received, how long to wait for resolve before
# finally declaring the SNMP agent as down.
;snmpAgentDown.alert = 4m

# When a bgpDown event is received, how long to wait for resolve before
# finally declaring the BGP session to be down.
;bgpDown.alert = 1m

[linkdown]
# This section contains options to control which link down events to
# send alerts about. Also see settings in ipdevpoll.conf about which links to
# generate events for in the first place.

# When enabled, only link loss on redundant links cause alerts to be sent.
# The rationale is that on a non-redundant link, you will get boxDown alerts
# for the devices behind that link, which are now unreachable.
;only_redundant = yes

# If a linkDown event is posted for a switch port that doesn't carry any of
# these vlans (tagged or untagged), no alert is sent. This is a
# space-separated list of VLAN tag numbers. An empty value means no filtering
# based on VLAN.
;limit_to_vlans =

Alert severity

All NAV alerts (as generated by Event Engine) are assigned a severity value, in the interval 1 through 5. These values can be used as part of your users’ Alert Profile filters, and should be interpreted roughly like this:

5 = Information
4 = Low
3 = Moderate
2 = High
1 = Critical

Severity values are normally chosen by the NAV program that generates the event that an alert is based on. However, NAV cannot distinguish what severity level any given alert constitutes for your NOC. Therefore, the Event Engine lets you configure your own severity rules, using YAML syntax, in the configuration file severity.yml. Any rules present in this file will be processed to set or modify the existing severity of any matching alert that is generated.

Configuring `severity.yml`

Here is an example severity configuration:

severity.yml

---
default-severity: 3
rules:
    - alert_type: boxDown
      severity: 2
      rules:
          - netbox.category.id: GSW
            severity: 1
          - netbox.category.id: GW
            severity: 1

    - alert_type: boxDownWarning
      severity: 5

    - netbox.organization.id: foobar
      severity: '+2'

This configuration starts off by assigning a default severity level of 3 to every alert that Event Engine generates, regardless of what the original severity value of the event was.

Then follows a list of rules that will be processed in the order they appear in the file. Each rule consists of:

One or more alert attribute match expressions.
One severity value modification expression to be applied to an alert that matches the attribute expressions.
Optionally, a sub-list of more rules to further apply to any alert that matched the expressions of this rule.

The first example rule will match any alert whose alert_type value equals boxDown (NAV’s alert type for a lasting “box is unreachable” incident). Any such alert will be assigned a severity level of 2. Furthermore, the rule lists two additional sub-rules to ensure that if the boxDown alert was issued for any netbox (IP Device) whose category is a router (a category id of either GSW or GW), the severity is set to the most critical level of 1.

The second top-level example rule will match any alert whose type is boxDownWarning, and set its severity to the least critical level of 5. This is the stateless early warning the Event Engine issues a few minutes before declaring a stateful boxDown. It is safe to consider this type of alert as only informational.

The final top-level example rule will match any alert whose associated netbox (IP Device) is owned by the organizational id foobar. This rule uses a severity modifier expression of +2, which will add 2 to the current alert’s existing severity value.

In summary, if a boxDown alert is dispatched for a router in your network, this rule set will ensure its severity is set to 1. However, if the router belongs to your less important foobar department, two severity levels will be deducted, and the alert comes out with a severity of 3.

Modifier expressions

There are two types of supported severity modifier expressions for use in rules:

Absolute values: An absolute integer will replace a matching alert’s current severity level.
Relative values: Prefixing an integer with + (or -) will increase (or decrease) the existing severity value by the given amount.

Event Engine will silently ensure that no assigned or calculated severity value will ever exceed the valid range of 1-5.

Important

Please note that relative values must be enclosed in quotes, to avoid confusion with absolute values. YAML interprets +2 as the absolute value of 2, while '+2' is a relative value.

A good practice would be to always quote your values, as that will work as intended in all cases.

Available `event_type` and `alert_type` values

Two of the available alert attributes that can be matched against in severity rules are event_type and alert_type. However, event_type is a Python object: To match against an event type id/name, you must match against the object’s id attribute, i.e. event_type.id, as the example configuration file shows. See the event- and alert-type reference documentation for a detailed list of available type names to match.

Other matchable attributes

Most alerts generated by the Event Engine are associated with a specific IP device registered in NAV (known as netbox internally). Severity rules can be used to match against attributes of IP devices, or even sub-attributes thereof. As with the examples above, the ID (or name) of the organizational unit that is responsible for an IP device can be read from netbox.organization.id. The ID of the wiring closet this device is located in (as organized by you, the admin, in SeedDB), can be had from netbox.room.id.

See the reference documentation for the Netbox model to see all the available attributes of an IP device.

Exporting alerts from NAV into other systems

The Event Engine can be made to export a stream of alerts. By setting the script option in the [export] section of eventengine.conf to the path of an executable program or script, the Event Engine will start that program and feed a continuous stream of JSON blobs. describing the alerts it generates, to that programs STDIN.

Alert JSON format

The Event Engine will export each alert as a discrete JSON structure. The receiving script will therefore need to be able to parse the beginning and end of each such object as it arrives. Each object will be separated by a newline, but no guarantees are made that the JSON blobs themselves will not also contain newlines.

Tip

Here is a Stack Overflow comment describing how Python’s existing JSON library can be used to decode arbitrarily big strings of “stacked” JSON, such as is the case with the the alert export stream.

An exported alert may look like this as JSON:

{
   "id" : 212310,
   "history" : 196179,
   "time" : "2019-11-05T10:03:10.235877",
   "message" : "box down example-sw.example.org 10.0.1.42",
   "source" : "pping",
   "state" : "s",
   "on_maintenance" : false,
   "netbox" : 138,
   "device_groups" : null,
   "device" : null,
   "subid" : "",
   "subject_type" : "Netbox",
   "subject" : "example-sw.example.org",
   "subject_url" : "/ipdevinfo/example-sw.example.org/",
   "alert_details_url" : "/api/alert/196179/",
   "netbox_history_url" : "/devicehistory/history/%3Fnetbox=138",
   "event_history_url" : "/devicehistory/history/?eventtype=e_boxState",
   "event_type" : {
      "description" : "Tells us whether a network-unit is down or up.",
      "id" : "boxState"
   },
   "alert_type" : {
      "description" : "Box declared down.",
      "name" : "boxDown"
   },
   "severity" : 3,
   "value" : 100
}

Attributes explained

These are the attributes present in the JSON blob describing an alert:

id

The internal integer ID of this alert in NAV. This number is volatile, as the alert object disappears from NAV as soon as the Alert Engine has completed its processing of the alert.

history

The internal integer ID of NAV’s corresponding alert history entry. I.e., if this alert created a new problem state in NAV, this will be a new ID. If this alert resolves or otherwise concerns an existing state in NAV, this will refer to the pre-existing history ID.

E.g. if a boxDown alert is issued for an IP device, and later, a boxUp alert is issued for the same IP device, both of these alerts will refer to the same alert history entry.

time

This is the timestamp of the alert, in ISO8601 format. Usually, this corresponds to the timestamp of the originating event. E.g., for boxState type alerts, this corresponds to the exact timestamp the pping program reported it could no longer receive ICMP echo replies from a device.

message

This is a short, human-readable description of what the alert is all about.

source

This is a reference to the NAV subsystem that postged the original event that caused this alert.

state

This is NAV’s internal moniker for the state represented by this alert:

x: This is a stateless alert (e.g. a generic warning or point-in-time event)
s: This alert starts a new state in the alert history table.
e: This alert ends (resolves) an existing state in the alert history table.

on_maintenance

A boolean that tells you whether the subject of this alert is currently on active maintenance, according to NAV’s schedule. This would typically be used to withhold notifications about alerts that occur during a known maintenance period for a device.

netbox

A database primary key to the IP device this alert is associated with.

device_groups

A list of NAV device groups that the associated IP device is a member of.

device

A database primary key to the physical device this alert is associated with.

subid

If this alert’s subject is a sub-component of the IP device referenced in the netbox attribute, this will be some internal sub-ID of this component. This reference ID can be interpreted differently, depending on the alert type, which is what NAV does when the subject attribute described below is composed.

subject

An object that describes the alert’s actual subject (or object, if you will, since NAV’s terminology is grammatically challenged).

subject_type

NAV’s internal model name of the subject’s data type. This would typically be things like Netbox, Interface, Module, GatewayPeerSession etc.

A subject_type value combined with the subid value can be used as a unique identifier of a NAV component by a 3rd party tool.

subject_url

A relative canonical URI to a NAV web page (meant for human consumption) describing the alert’s subject.

alert_details_url

A relative canonical URI to NAV’s REST API, where the details of the alert state entry can be retrieved.

netbox_history_url

A relative canonical URI to a NAV web page (meant for human consumption) detailing the recent alert history of this alert’s associated IP device.

event_history_url

A relative canonical URI to a NAV web page (meant for human consumption) detailing the recent history of alerts of the same event type (e.g. all the recent alerts of the boxState category, if this is a boxDown alert).

event_type

A sub-structure describing the event category of this alert:

id: The event category id of this alert.
description: A description of said event category.

alert_type

A sub-structure describing the alert type of this alert.

id: The event type id of this alert.
description: A description of said alert type.

severity

The severity of this alert. This is usually an integer in the interval 1 through 5, where 1 is the most critical level.

value

The alert value. This is usually an integer in the range 0-100, but at the moment, this carries no specific meaning in NAV.