Network Monitoring

Market for Network Monitoring solutions is over-crowded with offerings from multitude of vendors (both Open-source and Commercial). Solutions comes in different shapes and sizes but there’s still no ‘one size fits all’. As its hard to navigate this space for beginner (and in some cases even seasoned) Sysadmin in this post I will try to provide some guiding lights to simplify selection.

Major Requirements
- Detection
- Notification
Design considerations
Functional requirements
Conclusion

Major Requirements

So first things first - why we need Network Monitoring (NM)? The answer here is pretty straightforward - we need to know when something’s broken. This leads us to 2 major functional requirements for NM solution:

Detect failure
Notify responsible personnel

Let’s split those into more detailed ones to understand how deep could this rabbit’s hole go.

Detection

Let’s review what factors usually define good detection:

Wide range of supported parameters
Flexible schedule
Lowest number of false-positives

Wide range of supported parameters

You should look for solutions which cover all parameters you know you need to monitor. As bare minimum, you should focus on:

Availability
Performance (CPU load)
Capacity (Memory & Disk)

You may also want to consider log-file or SNMP-trap feeds into your NM solution as additional sources of events.

However, don’t limit yourself by those, as you never know what would you need tomorrow. While NM solution vendors try to pack maximum possible parameters, they still can miss some which is vital for you now, therefore you should always look if solution allows to extend itself with custom checks.

If you want to go beyond problem detection to root cause analysis, you might find it useful to review my post on Use Network Monitoring for troubleshooting.

Flexible schedule

In an ideal world, NM will notify you of a problem before your users. Schedule (or a check frequency) plays important role in achieving this goal. Therefore NM solution should provide you with ability to define check schedule which suites best for you. You should look for possibility to define:

Check intervals - more used services will require more frequent checks, while less popular ones are fine with greater intervals
Service hours - ideally, NM should be able to support different intervals on different time of day/week - perhaps some systems are not used during weekend or you want to ignore load spikes at night caused by some batch jobs. That’s rather for False-positives part.

Hint:

Some monitoring solutions allows you to define maintenance windows, so you won't get alerted while installing updates and not impact your SLA.

Lowest number of false-positives

False-positives are the curse of Network Monitoring, as those can de-rail your monitoring initiative altogether. It’s when your NM alert you of something which haven’t happened. Why this is bad? Because either you will end up with maintenance overhead or with ignoring ALL alerts.

Solution vendors usually don’t ship total crap which can report “1” when reading “0”, so it’s important to understand where False-positives comes from while looking for proper NM solution. Two main reasons why modern solutions could generate false-positives is because they can’t make a full sense of:

Your environment and dependencies within - for example it can continue checking services behind firewall when firewall itself failed
Business work-flow - for example it can report high CPU utilization at night when you run compute-intensive batch jobs

1st case addressed by vendors implementing sophisticated correlation algorithms - while those can reduce number of false-positives, there are still a lot of room for improvement. The other case is completely out of the vendor control, which brings us to another criteria of a good detection:

Flexible conditions definition - this is must have feature to look for if you want lower number of false-positives. Good solution will allow you to analyze more than one sample for a specific metric/parameter, apply math functions like sum(), avg() and delta() to those samples or even transform metric’s value.

Hint:

Some solutions allow you to define condition for alert triggering based on set of metrics analyzed together - for example when all cluster nodes or LACP group members go down.

Notification

Characteristics of a good notification are:

Actionable - is really not a requirement for a solution, but rather recommendation how you should implement those - to make notifications work they should be relevant!
Timely - we want to know about failure before our users notice (better yet - before it even happens). When evaluating solutions you should look to delivery methods beyond e-mail - SMS, push-notifications etc.
Right audience - your monitoring solution should be capable of routing different notifications to different teams - high network or storage latency notifications should go to network and storage admins respectively. Again - relevance is the key of effective notifications! Note, that irrelevant notification is considered as false-positive by your team.
Reliable - if you rely primarely on e-mail for notification delivery - think again - what would you do if your mail-server go down? If you rely on external APIs for notification delivery - what would you do if DNS will go down? The answer is - SMS - look for native (not through 3rd-party service provider) SMS support in your monitoring solution, otherwise plan for some custom development. Depending on your SLAs you might want to consider redundant SMS modems. No reliability discussion is complete without “monitoring of monitoring” - you have to be sure, that your monitoring system is working, so consider another system which will (as minimum) monitor primary one.

Design considerations

It’s generally good practice to research how your next monitoring solution is designed to understand whether it’s a good fit for your organization. Below are few criteria to help you choose the right one.

Agent-based vs. agent-less

Agent-based solutions usually comes with the benefit of greater number of parameters supported. One of the major drawbacks of those is increased maintenance because you need to maintain agents. Agent-based solutions are also more firewall-friendly.

Agent-less solutions, on the other hand, do not add on the maintenance, but are limited in a number of supported parameters and less flexible.

Pull vs. Push

The difference between the two lays in how data from servers is delivered to monitoring server and how discovery is made. It’s a controversial topic in the industry of monitoring, but the truth is - you have to decide what best suites your needs.

Pull is an approach, where monitoring solution pulls metric from the servers/hosts. It also scans a network for new targets. Benefits of this approach is that it allows you to capture hosts which might be missing otherwise, but the drawback is that some rouge hosts might be filling up your monitoring system.

Push is when servers/hosts pushing metrics into monitoring solution. Target discovery is made by hosts themselves registering within monitoring system. While this will keep your monitoring system uncluttered, you might remain un-aware of some rogue hosts on your network. Generally, this approach is more firewall-friendly whether you need to collect metrics across network perimeter.

Performance/Scalability

For network monitoring solutions, performance is usually measured in number of metrics collected per minute. Most solutions have no issues collecting several thousands metrics per minute if not limited by DataBase performance. Few Xeon cores should easily handle an environment of 400 hosts (assuming 50 metrics per hosts collected each minute).

But what if you operate a private cloud and have to scale to 500 physical servers (or more) each hosting 50 virtual machines reporting 50 metrics each minute? This is where few (if any) solutions can shine. And while it’s not a CPU or network throughput issue, the bottleneck is a database performace and specifically it’s disk I/O. Back to our private cloud example, this scale requires database to perform 1.25 million writes to database per minute (~20k writes/second) - to make sense of the numbers - it’s enough load to keep 100 15K SAS drives spinning 24/7. While Database disk I/O optimization is nothing new, utilizing approaches such as buffered writes and SSD caching could provide only limited relief. Add 5 dashboards displaying 5 sets of 10 metrics refreshing 12 hours-intervals every 10 seconds and you end up in total 18k metric values read per second. At that scale I would recommend to look for solution which supports Time series Database (TSDB) as it’s storage back-end for collected data.

Think of scaling to Public Cloud with million machines? Here’s video you will find interesting https://www.youtube.com/watch?v=likpVWB5Lvo.

High availability and Fault tolerance

If availability of monitoring solution is essential for your business, you should evaluate what options solution provides to keep metric’s flow un-interrupted. Specifically, you should pay attention to the following questions:

Availability pattern - does solution implements/supports running in Active/Active mode? Solutions which tend to lock database for exclusive use, making it impossible for another server to be connected to the same database. Another challenge in Active/Active mode is failure notifications - if not properly implemented - you will end up with duplicate notifications for each failure.

Fault tolerance - is solution designed in a way to recover from network failures? Agent-based solutions could implement local caching of metric values during periods when server is un-available and push them through as soon as connectivity is restored - that way you wouldn’t have any gaps on your graphs.

Functional requirements

GUI

Surprisingly, user interface might define whether network monitoring solution implementation will be successfull or below average. The GUI for monitoring solution, in fact, consists of two major aspects - Administrative/Configuration and Visualisation.

Administrative - when administrative interface designed without through UX work behind it, it’s logic may look non-intuitive and doing simple tasks - looks cumbersome, therefore icreasing you maintenance costs and decreasing acceptance.

Visualisation - dashboard of monitoring solution should be capable of providing you with overview on the health of your services. Consider major metrics you plan to put on such a dashboard to understand whether solution is a good fit. Bonus feature - dashboard should provide you with the ability to drill-down into the details of selected metrics.

Documentation

Completeness and accessibility of documentation is not something exclusive for network monitoring solutions - the better it is, the faster you can move.

Security/Encryption

When assessing NM solutions from security perspective, you should consider following aspects:

Access controls - Role-based Access Control (RBAC) is one of the key requirements if your solution will cross departmental or organizational borders. Those controls, depending on a solution, can be either simple (like assessing permissions to manage groups of host) or fine-grained, controlling what actions (create/delete/modify/view) are allowed to what objects. Multi-tenant considerations should be addressed if applicable.

Encryption - Traditionally, NM solutions were oriented on monitoring systems within datacenter, therefore only few vendors made security one of the key design priorities. Today, with IT landscape spreading beyond datacenters to hosted service providers or clouds, encrypting monitoring traffic becoming important to reduce maintenance cost (like VPNs). While agent-less systems rely mostly on underlying protocols (like SSH, SNMP etc.) for encryption, agent-based systems should have an encryption built-in.

APIs/extensibility

As with any other solution, you should consider options of extending functionality and integration into existing tools. Some of the vendors have an open architecture and a great plug-in ecosystem around their product.

Some ideas you may want to consider:

Integration with your ticket-tracking system to reflect your Incident Response workflow
Integration with inventory system to add hosts to monitoring or populating your inventory with data collected by monitoring solution

Reporting

Depending on your environment, you might also want to look into reporting capabilities of monitoring solutions. Few reasons why you might want this is:

Statistics - to know what’s hosts are more risky - Top 10 hosts by CPU, RAM or Disk usage are a good candidates to start with
SLA - depending on requirements from your business stakeholders, you might need to present monthly reports on important services and whether those meet service levels defined (ability to define your maintenance windows would be much more important here)
Incident reports - you might want to review incidents/alerts for defined period (say weekly) to identify what hosts were causing problems

Conclusion

There are a lot of things to consider when choosing Network Monitoring solution. Solution space is pretty wide and each vendor come with it’s own offering, but what to choose depends completely on your requirements. In this post I’ve tried to list most important of those according to my own experience.