back to article Is tech monitoring software still worth talking about?

It's 2016, and the number one complaint I hear from sysadmins is still about monitoring software. The complaints have evolved with time, and every organization seems to have its own challenges. Despite this, monitoring software seems to be one of the most universal frustrations in modern IT. In the small business world, simply …

  1. Cirdan
    Headmaster

    Monitoring software has its place

    With great power comes great responsibility.

    ...Cirdan...

    p.s. Any professional organizations, like a trade guild, which band together for the common good of sysadmins across corporations? Would supporting the F/OSS solutions together make them more worthy?

  2. Norphy

    https://www.youtube.com/watch?v=bqeGxMgVOHI

  3. Alister

    My armadillo needs a new shell, the old one got bash'd...

    1. Trevor_Pott Gold badge
      Pint

      Thanks for that.

  4. 1T_Dave

    The days of of monolithic monitoring software is hopefully close to an end. I haven't come across a good one yet! There are plenty of building blocks available in AWS and Azure that you can use to build a really good monitoring system, oh and monitoring is now longer good enough! Don't just monitor and alert a human.Fix it!

    1. Electron Shepherd

      I haven't come across a good one yet!

      Clearly, you haven't looked at the one I help develop! :)

      Don't just monitor and alert a human.Fix it

      We get this quite a lot, but in reality, a lot of problems that a system administrator needs to know about can't be fixed automatically, or if they can, any automated solution is probably the wrong one. For example:

      1. An unusually high number of audit failures are logged against a production SQL Server, and these are coming from inside the network. How would you fix that automatically? It may be a disgruntled employee trying to "hack" the system, or it could be a genuine mistake made in good faith by someone who simply needs some training. An automated system can't know.

      2. A server is running low on disk space. What's the automatic response? Delete the oldest files? Delete the biggest files? Somehow automatically reconfigure the SAN to allocate more space? None of those is the right answer - the only practical way to do it is to get a human expert to look at the situation and decide.

      3. A process is burning 100% CPU across all cores and slowing everything down. The possible solutions are to force-terminate the process or lower its priority to allow other processes to run. Neither of those two is the "right" answer - they don't solve the problem, just mask it.

      1. 1T_Dave

        Oh no! Another developer who thinks they've cracked it! They all say the same.

        1. So what should happen in these circumstances? You want the Audit failures to stop? How could this happen automatically? Block IP? Granted once you've "fixed" it you'd want a human to investigate.

        2. Yes I want more space automatically allocated. In the cloud that is possible! Oh and why you using "disk" to store persistant data?

        3. Drop the server out the load balancer and automatically rebuild another one or add more servers. You'd then want to investigate what happened, but any customer impact is mitigated.

        Granted all these require software that's built in a certain way... Maybe the problem isn't with the monitoring systems it's with the software it's trying to monitor?

        1. Electron Shepherd

          You want the Audit failures to stop? Block IP

          This mis-understands the problem - an auto block treats the symptoms, not the cause. If you have a disgruntled employee, or someone who doesn't know how to do their job properly, an IP block really isn't going to help.

          Yes I want more space automatically allocated

          You can't allocate "more disk space in the cloud" when the disk is question is full of SQL Server log files on an internal production box. Even if it is possible, simply allocating disk space may well just mask an underlying configuration problem, and not actually solve the problem.

          Drop the server out the load balancer and automatically rebuild another one or add more servers

          First, even if this was in some sort of load-balanced situation (e.g. a web server farm), simply dropping the box and rebuilding doesn't address the underlying problem, which is probably a software bug somewhere that needs fixing. Second, there's a lot of systems out there that can't simply have more servers allocated - not everything is built that way.

          The role of monitoring is to detect problems, proactively if possible and reactively when not. The role of the system administrator is to make sure that, whenever possible, the problem doesn't re-occur, and that's very difficult to automate.

          1. 1T_Dave

            OK - I get you aren't going to be able to fix everything... The 'fix it' part isn't my major gripe, monolithic monitoring systems are. I want a loosely coupled monitoring system, that gives me flexibility and doesn't trap me in a vendors way of thinking. AWS Cloudwatch is the best I've seen.

            P.S. I not responsible for anything on-premise.

            1. Anonymous Coward
              Anonymous Coward

              Cloud automatically adding more space

              Well I'm sure the fact they can do that has nothing to do with the fact that they make more money the more storage they allocate to you...

              If something is broken and consuming space needlessly, do you really want an automated process adding more space to cover it up? I want to see the look on your boss' face when you have a 500% increase in your monthly billing because your cloud provider 'helpfully' allocated more space every time you started running out, instead of monitors letting you know it was full, a human investigating, and noticing you were only running out of space because something was broken.

  5. m0th3r

    The big problem is SNMP

    What I find is that most monitoring systems base themselves on SNMP. After digging into these systems, I've found SNMP is HORRIBLY complex, and while its original intent may have been good, it is now a pool of dung, where each vendor implements their own interpretation.

    Thus, when you approach a monitoring software vendor or team (in case of open-source types) with a request to add a device, or fix issues that make a device's graphs flat, you are met with the IT equivalent of the Maginot Line, since SNMP is such a ton of work to get through. The leader of Observium is even actively and openly aggressive towards "wireless ISPs and their shit", for example.

    I've settled on LibreNMS for our WiFi network, and find that in order to fine-tune alerts, you need to learn a whole new template system & language. So far, when a PoE switch goes down, I get an alert for that one, plus 10 extra alerts for the devices plugged into it - aren't systems smart enough to realize that if the switch goes, so will everything plugged into it? I only need an alert for the highest-order failure point.

    Writing this as Pushover is receiving 26 alerts for a minor power glitch... I see space for disruption here.

  6. Anonymous Coward
    Anonymous Coward

    SNMP

    I wouldn't say that the problem with SNMP is that it's horribly complex, because you can't have a solution that's flexible without some degree of complexity. The biggest problem I've had with SNMP is with OIDs that aren't persistent between reboots.

    1. John Stoffel

      Re: SNMP

      This is a huge issue, esp on Netapps, where if you add/delete volumes, the OIDs for a volume change, so tracking disk usage over time is ... challenging ... to say the least.

      1. Mark 32

        Re: SNMP

        Give up trying to monitor NetApp via SNMP, use the Web service API

  7. vgrig_us

    All that needed is...

    All that needed is - common templating system and automated template builds from mib files... Common interface for sharing between different monitoring systems won't hurt either.

    Looks like a good project for Linux Foundation’s Core Infrastructure Initiative to take on, no?

    PS and for all of you singing praises to proprietary solutions - no way in hell i'm paying per monitor: i'd rather learn xml and write templates for OSS monitoring myself.

    1. Electron Shepherd

      Re: All that needed is...

      All that needed is - common templating system and automated template builds from mib files

      Until you have something that isn't monitored via SNMP. To take just one example - connect to a remote web server, validate its SSL certificate, and warn you if it's due to expire soon. I don't know of any way to do that via SNMP.

      At the operating system level, for Windows there's lots of really useful information that simply isn't exposed via SNMP. This is true to a lesser extent on Linux, which does broadly have better SNMP support - but try monitoring the contents of the logs in /var via SNMP.

      1. vgrig_us

        Re: All that needed is...

        There maybe no way to check SSL certificate with SNMP, but pretty much any monitoring software can to it - Nagios has a plugin. Templating is needed for devices (phisycal of virtual) - SSL certificate check is generic and device independent, no need to involve SNMP.

        Log file problem is not actual SNMP - that's content not and event... Though again - nagios has plenty of plugins for that (yes, client install is probably required).

        As for Windows, well - MS just had to do it their own way: no surprise here.

  8. John Stoffel

    Monitoring tools suck... but it's defining the metrics that's hard

    The hard part is getting everyone to agree on what is important to monitor and what isn't. And then the problem becomes finding a solution to measure that metric, and then to display it in a meaningful way. And then how do you do alerts so that they are meaningful? And at 3am?

    Monitoring is one of the hardest jobs to do, because us humans are very very very good at filtering, until we get overwhelmed and we shutdown. Or just ignore it.

    There's nothing worse than a monitoring system which screams at you or the users so much that you just stop paying attention to something you can't fix, adjust, or monitor.

    But back to metrics. My personal bugaboo was a WAN accelerator appliance vendor who had/have a great product, it really does a good job. But when performance fell off a cliff because someone was pushing 100Mb/s through a 10MBit/sec pipe and causing all the traffic to get congested, it became hell to *find* that stream/application/endpoint-pair which was driving all the traffic.

    There were all sorts of shiny knobs and buttoms and graphs, which all helped the developers I feel, but none that ever really helped the end user sysadmin who's dealing with an irate user saying "The WAN is slow again!".

    See? It all goes back to the end-user-testing story I just finished reading about, where not understanding the end user's needs, and testing LIKE an end-user is the key.

    John

  9. phuzz Silver badge

    So nobodys going to recommend what they're using? I'll go first then.

    Back when I was working at a Windows only company I found Spiceworks to be really good. It could autoadd basically everything, and even had a ticketing system built in.

    Since then I've used Cacti, which is ok, but is purely SNMP based, and we've now moved on to Zabbix which works fine with SNMP, but also has it's own monitoring client for computers and servers, and a proxy system, so you don't need all of the boxes on your secure network having access to your main monitoring server.

    Zabbix is a bit of a faff to set up until you get used to it's terminology, but as far as I can tell, so are all monitoring systems. they're just about to release version 3 which I've not tried yet.

    1. Fazal Majid

      Zabbix

      There's a dearth of thorough reviews of open-source monitoring software, but I will take open-source over proprietary or hosted solutions any day.

      We use Zabbix (with the PostgreSQL backend) to manage just shy of a hundred physical servers and around 500 containers. Like any serious piece of software, there is a learning curve, and the terminology is sometimes confusing because it is written by Russians, not native English speakers, but I haven't found it particularly difficult to set up. It 's certainly easier than Nagios, Ganglia or MON, and actually usable by non-technical users like support or management.

      My main beef with it is that it assumes "no news is good news" and will ignore items (metrics) that are not sending data, which usually means the system is down or hung so badly the agent is not responding either. Ad-hoc querying and graphing capabilities are also somewhat crude, e.g. "build me a screen (dashboard) of CPU vs. swap for all machines in host group 'database servers'". The PHP-based web UI is a bit tired and it would be nice to have modern JS/canvas-based interactive graphs, but it is serviceable.

      That said the template system is fairly flexible and powerful if you give some forethought to design, it does have the ability to handle dependencies so as to reduce the flood of downstream alerts, and is fairly easy to extend. Performance is better than a Python/Perl/Ruby solution like ZenOSS, but you will still need to dedicate a system past 100 monitored hosts/VMs or so.

    2. ColonelNZ

      Current company uses Zabbix and I have to say I'm Not A Fan. Web UI is atrocious for finding the information you actually want and writing monitoring scripts is needlessly complex.

      I have to the say the best I've come across is OMD/Check_MK. It uses the well proven Nagios core but smooths out a lot of the niggles such as auto generating the monitors, and providing a distributed monitoring system.

      1. Down not across

        I concur on the GUI of Zabbix. In the interest of fairness I should probably mention I only ran it briefly as I was evaluating various options and probably didn't spend enough time with it.

        I kinda liked nxmc, but it had its own niggles.

        I've always ended up falling back to cacti for some reason probably because its graphing just works and graphs are useful for trends/history.

        For me the ideal solution would be fully modular. Both for collecting/monitoring and for GUI/alerting.

        I prefer to be able to store at least few months worth of statistics online (for pro-active monitoring/trending).

        Alerting should be very flexible. Ideally it would also have some idea of topology so that of a switch goes down you don't necessarily (unless you want to) get alerts for every device connected to it. Alerts should also have the ability to auto-escalate (for example, mail->sms->sms-to-alt-number->sms-to-group-of-numbers).

        Obviously it needs to do SNMP (both collecting and act as trap receiver.

        Ideally it should work with or without an agent on target host (and if it does agent it should be possible to either push or pull).

        Oh and don't want it to be written in Java either.

        Then you start getting to niceties/fluff like maybe an Android app for lightweight dashboard/alarm panel.

        To be fair there are quite a few open source solutions that come close for most parts.

  10. MasterofDisaster

    Monitoring is critical to "broken glass" policing

    If your not monitoring it is very likely the overall organization sees more "broken glass" than they should, thus forming a view of IT and the overall system. In some places that's okay (lots of IT teams carry forward with a poor view of their efforts), but if the "broken glass" is something mission-critical to the company then it is hard to see not having monitoring specifically tuned to that mission-critical element. For example, if you manage an IP-based video surveillance network and it is very visible when it fails (e.g. bank heist with no video footage), you'd be foolish not to have a purpose-built monitoring tool for at least that purpose. The less broken glass, the more likely there are multiple monitoring approaches (purpose built for the mission-critical stuff, generic for the other systems). In other (and fewer) words, monitoring is a reflection of the corporate importance of what is monitored.

  11. jamesb2147

    Interesting discussion of FOSS solutions

    Here's my $0.02:

    I wore the network and monitoring hats as part of an IT team of 15 supporting a university of 2000 students and ~500 faculty and staff. I was *the* network admin/engineer/architect.

    Zabbix looked a bit strange, though I didn't invest much time in it.

    Spiceworks is 80% terrible at network monitoring. Great for configuration backups on our network, though! Couldn't use it for desktops since we were mostly Macs.

    Zenoss is has similar network monitoring functionality to Spiceworks but is 90% better at it. Seemed to have very limited server monitoring IIRC. We used this for a bit to monitor the network in between other solutions. Never got the trap receiver working properly.

    Cacti was too limited for our needs, though good at what it did (if ugly).

    MRTG was much the same as Cacti.

    NAGIOS was, by all accounts, extremely powerful. Never did get it configured though as it all seemed too much of a PITA, Yes, we even tried a couple of wrappers that were supposed to address using the GUI to add monitors, etc.

    Xymon was a PITA to configure and maintain though decent at what it did.

    PRTG was what we eventually settled on. It was reasonably priced and did both network and server monitoring competently. There was a lot to learn but the system could be configured in a basic way with advanced configuration applied as time allowed for learning. It was not the best system available, and I wouldn't sing its praises from the hilltops, but it's probably worth checking out even if you hate proprietary solutions. It is reasonably competent, with alerts based on groups, dependency monitoring (router goes down and stop monitoring all the downstream switches and do NOT alert me again), SSL expiry monitors, frequent updates which occasionally add real features (and they fixed several SSL vulns in a matter of a few months), etc.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon