Prevent client frustration by reducing downtime

We've all been there, the phone rings and suddenly it's panic stations. A client site is down, they're panicing and are very upset, and we have to drop everything to find out what's happening. This doesn't happen often, but when it does it puts the client and the whole dev team in a spin. It certainly not predictable, it's somewhat preventable, and it's completely frustrating.

We currently implement a strategy whereby we carry out a monthly site review for all of our clients who work with us on a retainer basis. This kind of monthly review to catch things unnecessary load issues, recurring internal errors and noticing traffic patterns is key to helping prevent any site down time from occurring. However, occasionally there are things that can't be predicted or prevented that will bring a site down and we'd rather be one step ahead of the client, so that we know that their site is down before they do and can jump on it and solve the issue before they even know about it. We can then inform them that there was an issue, but that we addressed it ASAP and fixed the underlying problem before it became a bigger problem. If I were a client I'd love that, and so since we do love our clients, we decided to set up a simple process that would provide us with that monitoring at little additional overhead cost to us (and not additional cost to our retainer clients).

There are number of services out there that offer a notification service, which will monitor your site and inform you when your site is down. Here's a nice list. These are mostly paid services, but with the number of clients we have on retainer that could be come either cost prohibitive or require that we raise our retainer fees.  So, to mitigate costs, I went looking for a solution to do this in house and came upon something that works very well indeed.

This free PHP script from serviceuptime.com uses PHP to poll a URL and determine if it's responsive or not. The script comes as an HTML form which allows you to submit one URL at a time and check the status of FTP, HTTP and POP services on that domain and report back what it found. I needed something that would do this automatically for all of the sites we have on retainer. I realized that I could pass the domain and the service to check via the URL, in the format "?psc_host=example.com&psc_http=1", where psac_host is the domain we want to check and psac_http is the service we want to check.

So I wrote a simple little bash script that runs on cron and calls this URL for all of our retainer websites, and will then email me if the response is anything but a "200 OK". The email functionality was added in the PHP script using the simple mail() function in PHP.

#!/bin/bash
 
# Run checks on various retainer domains
 
 
 
export PATH=/usr/local/git/bin:$PATH
 
LOGFILE=/Users/X/Scripts/siteuptime.log
 
echo "==== Start ====" >> $LOGFILE 2>&1
 
date >> $LOGFILE
 
 
 
# Just add a site to this list
 
# Notice that the array elements are separated by a blank space
 
# so be sure to add a space after each site name before the newline.
 
retainer_sites=(
 
        site1.ca
 
        site2.ca
 
        site3.ca
 
        site4.ca
 
)
 
 
 
# Loop through all retainer sites and check them using the site checker PHP script, using curl. Also suppress all output.
 
for i in ${retainer_sites[@]}; do
 
        echo "Checking: ${i}" >> $LOGFILE 2>&1
 
        curl --silent --compressed "<a href="http://sitechecker.com/?psc_host=">http://sitechecker.com/?psc_host=</a>${i}&amp;psc_http=1" > /dev/null 2>&amp;1
 
done
 
 
 
echo "===== End ====" >> $LOGFILE 2>&amp;1

Once you've written the script you can then add it to the cron process. In this instance it's running every minute.

*     *       *       *       *       root  /Users/X/Scripts/siteuptime.sh

I will now receive notification via email when a client site has become unresponsive. Also for those of you worried about the bandwidth increasing due to this traffic, the PHP script uses fsockopen and fgets() to poll the server and specifies to only check for the existence of the first 4000 bytes (4k) and stop checking after that. So it shouldn't create a spike in bandwidth.