The Hund Blog

Monitoring Cron Jobs

Jun 23, 2017

Jonathan la Cour

Hund's webhook integrations are robust solutions for reporting statuses and custom metrics to your status page. These are ideal if you need to report information from your platform, or if Hund does not integrate with a monitoring service you use.

Today we will be looking at using webhooks to report the health and run times of cron jobs.

Reporting Statuses

Upon creation of a webhook component, you are provided with a unique webhook URL and key:

Webhook Component

We can send a POST request to the provided webhook URL with component's current status. The reported status must be a valid status integer (-1 : Outage, 0 : Degraded, 1 : Operational). If you do not provide the status, then a status of 1 is assumed.

We shall start off by making a simple shell script, report_status.sh:

#!/bin/sh
if [ $3 -eq 0 ]; then status=1; else status=-1; fi

curl -fsSm 10 --retry 3 -X POST -H "X-WEBHOOK-KEY: $1" \
  https://status.example.com/state_webhook/watchdog/$2?status=$status

This script takes three arguments: your webhook key, your watchdog ID, and an exit code. This cURL command is silent and will retry up to 3 times, with a 10-second timeout per request.

Example Job:

0 23 * * * ~/backup.sh; ~/report_status.sh xzji2pvfhm39o04d6cr1 594aa7927884281f3a83c289 $?

This cron job runs every day at 23:00. Our report_status.sh script will always* run with backup.sh's exit code (i.e. $?).

Just like that, you are reporting the status of your cron job. However, what if the server is down, or it is unable to resolve hund.io? That is where the dead man's switch feature comes in.

* report_status.sh will not run if set -e is used and backup.sh fails with a non-zero exit code.

Dead Man's Switch

We want to ensure our component's status is accurate, so we need to report an outage when the job stops running for any reason. Webhook watchdogs have a "dead man's switch" option that can report an outage when the component's status is not received when expected:

Dead Man's Switch

Here, we have configured the dead man's switch to expect our status report every day, with a consecutive check threshold of 1. With such a low threshold our webhook watchdog would immediately be considered "dead," and an outage state would fire after not receiving statuses.

A one-minute cron job can safely use a low consecutive check threshold without worrying about sensitive reporting. If such a cron job were failing intermittently, Hund's event backend would only notify you of the outage upon the first failure (depending on your consecutive check threshold). A restoration notification will only fire after the intermittent failures have stopped.

Reporting Run Times

It would be valuable to see how long our cron job takes to run. Webhook metric providers are for reporting arbitrary datasets, so let us define our new run_time metric:

Metric Definition

Here, we have defined the metric's title and a custom y supremum. The y supremum is assumed to be our least upper bound for the data.

This new run_time metric we have defined can be used to initialize new run_time metrics on other components if desired.

Now we need to report our metric to the metric provider URL presented earlier. We will start off again with a simple shell script, report_time.sh:

#!/bin/sh
duration=$(echo "$(date +%s.%3N) - $3" | bc | sed -r 's/^(-?)\./\10./')

curl -fsSm 10 --retry 3 -X POST -H "X-WEBHOOK-KEY: $1" \
  -H "Content-Type: application/json" \
  -d "{ \"metrics\": { \"run_time\": [ { \"y\": $duration } ] } }" \
  https://status.example.com/state_webhook/metrics/$2

This script takes three arguments: your webhook key, your metric provider ID, and a start timestamp. This cURL command is sending a POST request with our run_time data.

Example Job:

* * * * * st=$(date +\%s.\%3N); ~/generate_reports.sh; ec=$?; ~/report_time.sh qwy7k84r0fx16mcivlza 594d85087884283e510d4a5a $st; ~/report_status.sh qwy7k84r0fx16mcivlza 594d85087884283e510d4a58 $ec

Note: % must be escaped in crontab since it will convert to a newline.

Here we are reporting both the cron job's status as well as its run time.

We define our start and end timestamps with millisecond precision by using date +%s.%3N. In our report_time.sh script, we use bc (i.e. basic calculator) to calculate our duration, which is necessary to perform floating point arithmetic. However, our calculated duration may not have a leading zero for sub-second values, though we can easily use sed to ensure we have a JSON-valid float.

We could alternatively report millisecond durations by instead using date +%s%3N, without requiring bc and sed.

After some time, an interesting graph has developed on our component:

Run Time Graph

Summary

There exist many services for monitoring cron jobs, but we can see that Hund provides a more robust solution than others, packaged in a status page service.

Hund's webhook metric provider allows us to report arbitrary datasets that can provide great insight into performance, while the webhook component provides confidence in operational health.

You can try Hund for free today, no credit card required.