Operations: Alerting, monitoring and dashboards

I've spent some time in the trenches, working in Operations: supporting services and applications, and the underlying VMs, OS, Storage and Hardware. This is a 24x7 job, on call/standby, and you have KPIs/SLAs of 99.9xx% uptime - five 9s means only 5 minutes per year downtime. Which means that you need to have eyes and ears on every layer in that stack, that constantly monitor and alert you when things go wrong.

So when starting to building your alerting strategy, first read My Philosophy on Alerting, which mentions things like:

  • keep alerting simple,
  • alert on symptoms,
  • have good consoles to allow pinpointing causes,
  • and avoid having pages where there is nothing to do.

Golden Rule: Alerts must be actionable - which means every alert must result in someone doing something. If nothing is required, kill that alert: rather under-alert than over-alert.

Alerts should not go to email. Email is too slow, and out-of-band. Send alerts to the standby guys IM (Telegram, Slack) via bots, or use PagerDuty

For an Engineer on call/standby, this is crucial. He needs to know what needs to be done, and where to look for more info, and quickly (this guy has to respond to pagers with-in 5 mins). Alerts that dont need action, or too many alerts, will lead to burn-out and real alerts being ignored.

We have a set of guidelines for the content of alerts, which all our alerts should follow,
Make the title/summary descriptive and concise.#
✘ ALERT: Something went wrong.
✓ Disk is 80% full on prod-web-loadbalancer-af5462ce.
Make sure to include the metric which triggered the alert somewhere in the body.#
✘ Diskspace on a disk is filling.
✓ avg(last_1h):max:system.disk.in_use{env:prod-web-loadbalancer} by {host} > 0.8
The body should also include a description of what the actual problem is, and why it's an issue.#
✘ Disk is full.
✓ The disk on this host is at 80% capacity. If it becomes too full it could cause system instability as new files will not be able to be created and current files will not be written to.
Provide clear steps to resolve the problem, or link to a run book. Alerts with neither of these things are useless.#
✘ Fix it by deleting stuff.
✓ Follow the run book here for identifying and resolving disk space issues: https://example.com/runbook/disk. Additionally, you should investigate whether log rotation thresholds are sufficient to prevent this happening again, the following run book has the necessary steps: https://example.com/runbook/log-rotate

Read this for what to do on standby, and this, and another one

Site Reliability Engineers (SRE) have made alerting and monitoring into a science.

In summary, here is what I believe the complete landscape should include, regarding monitoring requirements, and having a consolidated dashboard:

  • Real-time monitoring of response times and queue lengths to the systems that you integrate with
  • Historical trending of response times, queue lengths, number of requests, etc.
  • Probing – sending test queries every x minutes, and posting to a dashboard. This includes End to End (E2E) service monitoring, where you test the service, from the perspective of a user, and not just individial servers.

For Network and server/box monitoring, which are just up/down for a server, we use LibreNMS because it comes with auto-discovery, and pre-built checks, unlike Nagios, where you need to write everything by hand, and still get it horribly wrong. Nagios does not have a back-end DB (everthing is in flat files), and no API. Icinga, a fork of Nagios, may fix some issues.
Microsoft SCOM can monitor Unix boxes as well
Monit as a process watchdog is a good idea as well.

In terms of a dashboard for the probes, and real time checks, we use Dashing, which is not Smashing

For trending, we use ELK/Kibana and for Application Performance Monitoring (APM) we use Graphana

Use automation, like Puppet/Ansible, to roll out the above checks/agents to all systems.

Another good read