Note: This post will be updated as I move through the project.
I’m rebuilding a monitoring and alerting infrastructure and I thought I would bring you along. I don’t consider myself an expert in any of this, but I do (sometimes) use (some of) it effectively for real-world stuff.
The objective of this project is as follows:
- Be able to visualize the current operational state of all of the devices and supporting services.
- Receive alerts when any device or supporting service is not operating properly.
- Produce reports about the historical operation of the devices and supporting services.
- Allow a “doesn’t care about how any of this works“ user to be able to access all of the above with a single login/interface
- Be able to quickly and easily redeploy the entire stack in the event that the hardware it is running on fails. I’ll be using containerized applications (Docker) where possible, so this will probably involve some docker compose files and config file backups.
I’ve been dabbling with this for a while so I am already doing some of this monitoring and alerting with an earlier iteration of this setup. In this project I will be streamlining my existing setup, adding some more stuff, and hopefully “doing it right”.
Here is the tech I will be using (subject to change as I get into it):
- InfluxDB for storing time series and event data. I’ve used this before for storing time series, but not really for events. We’ll see how it goes
- Grafana for visualization. I’ve been using grafana very casually for a while.
- Telegraf for data collection into influxdb where possible.
- Node-RED for data transformation and event/alert processing. Maybe. I haven’t used node red before, but the model seems very intuitive.
- Grafana OnCall for alert management. I’ve never used this before and I’m not 100% sure it will work for me, but it has some features I would really like. We’ll see how it goes.
- Headless chromium and some web scraping library for web scraping. I’ve had some problems in the past with headless chromium running in docker filling up the disk it’s running on. I haven’t figured out how to resolve this yet (although I didn’t try very hard), so we’ll see. I might find something better/easier.
- I’ll also likely have to find or build some glue tech to make some stuff work together
My inputs (the things I am monitoring) are:
- Some air compressors and related equipment with which I can communicate using Modbus.
- Some UPSs via SNMP and also USB or RS-232.
- A few devices and sensors that are not network enabled which i will probably attach to some MQTT something-or-other.
- A few wifi access points (SNMP and/or web API and/or web scraping)
- Some cellular network extenders (web API/scraping).
- A few other monitoring and alerting systems that belong to others but that are inside my area of concern. Probably via some combination of web scraping and apis and also via their generated notification events (mostly email).
- Email notifications from several places. This is for aggregation and normalization of the notifications to then manage them through Grafana OnCall.
The other supporting infrastructure I need (and will also need to monitor):
- MQTT broker. I’ve been using Mosquitto, so I’ll probably stick with it.
- SMTP forwarder/relay for some internal legacy systems (on Windows XP) that need to send out email but don’t have modern TLS (and shouldn’t be anywhere near the Internet).
- Dynamic DNS updater. Because of reasons, I am currently using Dreamhost as the DNS provider and updating it with a script (via cron) I got from somewhere which I will link to at some point later.
- VPN. I’ve been naively using OpenVPN for a while. I’ve been experimenting a bit with nebula overlays. I’ll probably settle on something else.
So that’s where we are going.