The network is not reliable

One of the first mistakes made by a developer doing his first proper distributed application is to make the assumption that the network is reliable. When he learns through painful experience that the network isn't actually reliable, he starts to think about monitoring the parts of his application that live on different machines, especially on the application server - sometimes known as the middle tier in a 3-tier application. And now our intrepid hero really starts to dig a hole for himself.

His first instinct is usually to use some form of "heartbeat". The idea behind this is that each remote component or service transmits a regular heartbeat across the local network to an associated heartbeat monitor running on a central machine. Each heartbeat monitor has a list of components that it's supposed to watch, and it does something like displaying an application status page on the intranet with a "smiley" face to represent a healthy component, and a "frowny" face for any component that has missed a heartbeat.

Our hero congratulates himself on his design, and rushes to add this feature to his application. An operator is then assigned to monitor the application's status page, and report any problems to the application team. All is well with the world. Until...

The operator phones our hero at 2 AM to say that component ABC on server XYZ has stopped broadcasting a heartbeat. This is an application-critical component, so he needs to respond fairly fast. The developer uses Terminal Services to log into machine XYZ, but after 5 minutes he's still waiting for the login to happen because the server doesn’t seem to be responding. So he phones the server support team to report a non-functioning server. They take 5 more minutes to respond, but then they report that they can log into server XYZ without a problem.

Having established that he's going across a different network segment to the server support team, he rings the network support team to report a possible router or switch issue. After another 10 minutes, the network support team phones back to say that the network appears to be functioning normally. Sure enough, when the developer tries to log into server XYZ again, the login works perfectly and the heartbeat status page is now showing a smiley face again for component ABC.

Our hero never does establish what went wrong, or why.

The problem is that it's almost impossible for a monitor to figure out what's happening when the components that it's monitoring are remote to the monitor. There's just too many failures (some of them quite ingenious) that can occur between the monitor and its associated components. Here's a short list, by no means complete:

  • There’s a problem with Windows on the remote machine that prevents the heartbeat from reaching the network stack.
  • There’s a problem with the network card on the remote machine.
  • There’s a problem with one of the network cables - it might be faulty or no longer plugged in properly.
  • There’s a problem with the network itself, such as a faulty DNS server, network switch or router causing network segmentation.
  • There’s a problem with the network card on the monitoring machine.
  • There’s a problem with the operating system on the monitoring machine that prevents the heartbeat from reaching the monitor.
  • The monitor is busy handling other heatbeats, and therefore is unable to process the new heartbeat in a timely fashion.
  • There’s a problem with the monitor. The monitor's process has terminated or hung in some way, or the thread that’s supposed to process the heartbeat has terminated.
  • The monitoring machine is either down or disconnected from the network for some reason.

So the remote heartbeat monitor has no way of knowing what's really wrong. The problem could be with the component being monitored or with any of the hardware or software sitting between the monitor and the component. To perform remote diagnosis of what's really wrong is a seriously non-trivial problem.

There are other problems with this application monitoring design pattern. First, it's not easy for a remote monitor to take any corrective action, such as restarting a component that appears to be dead in the water. Second, when you have several distributed applications running on your local network, the number of heartbeat messages can rise to a significant proportion of your total network traffic. Although any good network should be optimised to handle a large number of small heartbeat messages, it's plain from the scenario above that most of these hearbeats are useless in the real world.

One way of avoiding these problems is to use a local monitor design pattern. If you place a heartbeat monitor (such as a Windows service) on every machine that hosts any part of your distributed application, that montor is able to analyze a problem in much more detail than is possible with a remote monitor. When a heartbeat is missed, the local monitor can check for problems such as low memory, low disk space, or high processor utilization. It can sometimes determine whether a problematic component is completely dead or is just hung. If necessary, it can kill and restart a dead component, or take some other corrective action. The local monitor can also watch the overall health of the machine on which it’s running and provide advance warning about problems such as low disk space that might affect other components running on the machine.

All this is possible because diagnosis of local failure is much easier and more reliable than diagnosis of remote failure. You still need one or more remote monitors to watch the local monitors and present the aggregated results, but these remote monitors won’t generate anywhere near the amount of network traffic they did in the original scenario. As an added benefit, each remote monitor doesn’t need to maintain a complex and ever-changing list of application components to watch, as this list can now be kept local to each machine. Instead, each remote monitor has a much smaller list of local monitors to watch, preferably one per machine. Because the local heartbeats are aggregated before being pushed to the remote monitors, your network is no longer flooded with (mainly useless) heartbeat messages and you have much more reliable diagnostics. 

[Update: typo fixed]