How to Design Software — Monitoring Systems

Explore the conceptual architecture of monitoring systems and learn how to build your own system observation engine.

Joseph Gefroh
10 min readFeb 6, 2022

--

Have you ever seen products with status pages? Ever wonder how it all works and how to make your own? Most companies don’t do it themselves — they use a 3rd-party vendor like StatusPage combined with automated reporting tools like Pingdom. If you’re using this in a production environment, chances are you’ll want to use a vendor.

However, if you’re curious about design approach and possible ways to build something like this on the quick, then this is the article for you!

A screenshot of a monitoring page.

Firstly — a conceptual exploration of “Monitoring”

The purpose of a monitoring system is simple: determine whether something is working or not.

That statement has a bit more nuance than meets the eye, so let’s take a closer look.

What are you monitoring?

It’s important to recognize that the “something” that gets monitored can be anything. Oftentimes, it is a system within your control, such as an application server. Other times, it is a vendor system. Perhaps it is a mix of both, such as a technical process that needs to contact a vendor system.

Approaching it from the perspective that the monitored target is arbitrary keeps your thinking from going down the path of a specific implementation.

What does it mean to “work”?

A system that is working can mean a lot of different things. The most binary definition is whether the system is up (working) or down (not working).

That’s an incomplete answer, though.

  • If the system is up, but not doing what it was intended to, does it mean it is working?
  • How about if it is doing what it is intended to do, but only for half our users?What about 1% of our users?
  • Does a system have to perform flawlessly to be considered “working”?
  • If the system is doing what it is supposed to, but delayed, is it “working”?

Based on this ambiguity here, it is clear that from a design perspective, the meaning of what it means to be “working” relies on us to define. The acceptable limits of which errors are tolerated, the threshold, is also determined by us as the operators of the system.

Following this train thought, it means “working” is an arbitrary definition.

What are we building?

A map of various monitoring system components

If we refine the answers to our conceptual questions above, we are left with the essence of monitoring: we are building something that can tell whether another system is functioning based on the parameters we define.

In essence — observing. There are, of course, other elements of monitoring such as alerting, recovering, etc. — we’ll get to that later.

The basic parts of a monitoring system

The Monitor

The Monitor is the component that contains the general logic for monitoring. Note that it doesn’t contain the logic for monitoring a specific system, but rather the algorithm for monitoring:

  • Step 1: Check a system and get a result
  • Step 2a: If the result is “working”, do something
  • Step 2b: If the result is “not working”, do something
  • Step 3: Repeat step 1

If we break that down into a single section of code, we’ll get the following:

loop
result = check_system
if result.working?
puts 'system up'
else
puts 'system down'
end
end

Obviously, some details here are missing — we’ll get to this.

The Configuration

Remember — one of our key constraints is the fact that the system being monitored is arbitrary. This means that the Monitor doesn’t actually know how to talk to or check the system.

The configuration defines what the Monitor actually does. This is a piece of data, articulated as code or stored in a database.

configuration = {
frequency: "2.days",
target: "PaymentVendor"
}

One could easily write code to interpret this:

def is_time_to_check_system?(frequency)
interval_number = frequency.split('.').first
interval_unit = frequency.split('.').second
interval = if interval_unit == 'days'
interval_number * (60 * 24)
elsif interval_unit == 'minutes'
interval_number * 60
end
return last_observed_at + interval < Time.now
end

The resulting algorithm:

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system if result.working?
puts 'system up'
else
puts 'system down'
end
end

Where the configuration lives is not particularly relevant to the monitoring system — it could be in the code, in a file, or even in the database if you want users to be able to set up these configurations themselves or want them defined easily during runtime!

The Integration

This is the part of the system that actually knows how to contact the thing being monitored. You’ll want to make sure it has a consistent calling interface for the monitor to use, to allow for multiple kinds of integrations to be monitored.

Let’s suppose you’re monitoring a 3rd-party payment system that has an API that always returns a response with a status code:

  • SUCCESS if the request succeeded
  • FAILED if the request failed

You can create a library that calls this API endpoint and interprets the response into a standard format your monitoring system can understand.

class PaymentVendorMonitorIntegration < MonitorIntegration
def call(params = {})
vendor_response = VendorLib.http_call(params)
monitor_result = to_monitoring_system_result(vendor_response)

return monitor_result
end
def to_monitoring_system_result(vendor_response)
return {
status: vendor_response.status == 'success' ? 'UP' : 'DOWN',
message: vendor_response.body['message']
}
end
end

Here, the standard interface for a MonitorIntegration is #call and the result which contains a status and a message . While an actual implementation will contain many more things, this fundamentally is the core of determining a system’s status.

def check_system(target)
monitor_class = "#{target}MonitorIntegration".constantize
monitor_class.new.call
end

You could the integrate with the above code:

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target]) if result.working?
puts 'system up'
else
puts 'system down'
end
end
is_time_to_check_system?(frequency) # ...
check_system(target) # ...

The Standardized Response

The important part of monitoring multiple systems is you want to standardize the response regardless of what that system gives you.

You could map the to_monitoring_system_result into a class:

class MonitorResult  def initialize(params = {})
@params = params
end
def working?
@params[:status] == 'up'
end
end

Now, when you receive a result from a monitor, you can expect to be able to call working? and have it return a consistent result.

That’s the basics of observing!

Obviously, your code will have to change to support different integration targets using some sort of system identifier.

There’s multiple iterations available from here:

  • Supporting more advanced system checks, such as running more complex scripts
  • Supporting specificity of endpoints, or adding parameters to checks
  • Moving configurations into the database to allow for user-configured checks

These I leave as an exercise to the reader.

What do you want to do after you monitor a system?

If a tree falls in the forest and no living thing is around to hear it, does it make it sound? Does it matter if it did?

There’s no point in monitoring something if you don’t have an action item arising out of it. What we do after we determine whether something is working or not working matters just as much as determining it in the first place.

Our monitor above just logs to the console — that’s not useful. You’ll want to add additional behaviors beyond just observing. These may include:

  • Alerting — Letting people know when the system is operating outside of defined norms.
  • Tracking — Recording statuses over time so that trends and long-term empirical baselines can be established
  • Recovering — Triggering automatic actions that can help the system recover.

We’ll explore an approach for each of these.

Alerting

Suppose you want to send a Slack message if a critical system is down. You an do something like:

def send_incident_notification
SlackNotification.send_to('#outages', 'System is down!')
end

Which could go here:

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target]) if result.working?
puts 'system up'
else
send_incident_notification
end
end

Tracking

You’ll likely want to have a history of whether a system is down or not so you can report and track uptime over a larger period of time.

The naive approach

The first approach is simple: log the results of all in your favorite data store.

Store the status in your favorite data store
def log_monitor_result(result)
MonitorResult.create!(status: result.working?,
occurred_at: Time.now)
end

You can incorporate it easily after getting the result:

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target]) log_monitor_result(result) if result.working?
puts 'system up'
else
send_incident_notification
end
end

Once you have a history of statuses, you can easily create a chart that shows uptime and availability at whatever frequency you desire.

Assigning a “0” or “1” to the status lets you chart downtimes.

A smarter approach

This naive approach has a key downsides — you’ll store a lot of records.

  • If you’re monitoring just one system every minute, that’s 1,440 records / day, or 525,600 per year.
  • If you’re monitoring multiple things, such as 50 different systems, that’s 72,000 records / day or 26,280,000 per year.

A better approach isn’t to log every attempt, but to log only if the status has changed.

This would only log the things we actually care about —outages or recoveries, and allow us to assume the last known status for any time period afterwards.

Logging only changes reduces the storage footprint considerably.

This has a best-case storage requirement of 1 database record if your system never goes down.

def is_different_than_last_status?(result)
previous_result = MonitorResult.all.order(:occurred_at).desc.first
previous_result.try(:status) != result[:status]
end

Our new algorithm looks like below:

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target]) if is_different_than_last_status?(result)
log_monitor_result(result)
end
if result.working?
puts 'system up'
else
send_incident_notification
end
end

Future iterations

Tracking additional things can help you monitor and manage even more — you could easily track:

  • Additional data like HTTP status codes
  • Response payloads to inform outage resolution
  • Response times for performance management

and more!

The limits of tracking are what you decide to store

Recovering

If you have a reliable way to make the system function properly, you can execute this in the event of downtime fairly easily, creating self-recovering systems.

def attempt_system_recovery
send_restart_command # I leave this to your imagination
end

Incorporating it when your system is seen as down can help automate recoveries.

loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target])if is_different_than_last_status?(result)
log_monitor_result(result)
end
if result.working?
puts 'system up'
else
send_incident_notification
attempt_system_recovery
end
end

This is, of course, a proof of concept, and not something you should use in production — unconstrained automated system recoveries can easily cause your system to never recover or do things that makes the incident even worse. Be judicious!

A final generalization

What do Alerting, Tracking, and Recovering all have in common? That’s right — their lifecycle is tied to the monitoring system.

When a monitor is triggered, and a result is received, these are ultimately actions that occur. One could take the system a step further and generalize it into a MonitorAction.

class MonitorAction
def do(params = {})
end

def should_do?(config, result)
end
end

Implementations of Alerting, Tracking, and Recovering could look like:

class AlertMonitorAction < MonitorAction
def do(params = {})
SlackNotification.send_to('#outages', 'System is down!')
end
def should_do?(config, result)
result.working?
end
end
class TrackMonitorAction < MonitorAction
def do(params = {})
MonitorResult.create!(status: result.working?,
occurred_at: Time.now)
end
def should_do?(config, result)
previous_result = MonitorResult.order(:occurred_at).desc.first
previous_result.try(:status) != result[:status]
end
end
class RecoverMonitorAction < MonitorAction
def do(params = {})
send_restart_command
end
end

The algorithm could call a perform_actions function that checks whether something should happen:

def perform_actions(configuration, result)
configuration[:actions].each{ |action_name|
action = "#{action_name}Action".constantize.new
if action.should_do?
action.do(configuration, result)
end
}
end

The algorithm would be changed to:

configuration = {
frequency: "2.days",
target: "PaymentVendor",
actions: ['Monitor', 'Track'] # whatever you want to run
}
loop
next unless is_time_to_check_system?(configuration[:frequency])
result = check_system(configuration[:target])

perform_actions(configuration, result)
end

You can see how this could evolve even further to support different actions, parameterized actions per configuration, or even actions before the system is checked.

You could even move the configuration into the database so that it could be defined by the user.

Actions can be generalized and moved into the database for per-configuration behavior.

That’s it!

Monitoring systems are conceptually fairly simple.

However, a real production system is so much more: production monitors are backgrounded, off of the main application, and have some sort of user-defined interface to set up. There could be even more specificity like monitoring endpoints, status interpretation beyond a binary “up” / “down”, etc.

Real-life complexity comes from ensuring it functions reliably and can perform reliable recovery if needed, and can track to the granularity needed.

Did you like this article? Let me know in the comments, or connect with me on LinkedIn!

This article is a part of my series How to Design Software.

Become a Medium member and help support my writing as well as thousands of other authors!

--

--

Joseph Gefroh

VP of Product and Engineering @ HealthSherpa. Opinions my own. Moved to Substack. https://jgefroh.substack.com/