Explore recent insights, retrospectives, and stories.

Latest Posts

Automating 24/7 Monitoring for a Weekend Image
On Call Support Automation

Automating 24/7 Monitoring for a Weekend

This is the story of how I was tasked with monitoring our website all weekend, but I wrote a script to do it instead. I did not realize it at the time, but one Friday morning, a storm was brewing. We received an innocuous ticket from a client that orders appeared to be stuck. At this point, the ticket was assigned to my colleague, and I didn't poke around much further. Introduction At around 3:00 p.m., however, the ticket was transferred to me, as my colleague was planning on logging off early. Naturally, the ticket was to be prioritized, but there was no cause for alarm. During the knowledge transfer, we noticed that the scheduled job that was supposed to be processing orders appeared to be stuck. The job that usually only takes a few seconds had been running for over an hour and had still not completed. When checking the queue, we noticed that there was an abnormal number of orders. What we later found out was that the client had a special promotion where certain products were free with the sign-up of a subscription. This resulted in an absurd number of new orders coming in. I double-checked to make sure that orders were still being processed, albeit a little slowly. I still wasn't sure that there was a bigger problem at hand, and in addition, there was a senior dev looking into the root cause. One particularly unfortunate behavior was that when the scheduled job ran, it would first grab all the orders and then process them first in, last out (FILO). Orders that were added after the scheduled job started would not be processed until the job was run again. Since there were thousands of orders coming in, some unlucky orders had to wait longer and longer, especially because the scheduled job had been restarted several times (we were unsure if it was stuck or not). Come 5:00 p.m., I sync with the senior developer along with the project manager to ensure that we are all on the same page. The job is running and processing as expected, but it is a little slow. Doing the math, it will take over 1.5 days to finish processing, even if there are no more new orders. We are all on the same page, and the PM lets us know that he will talk to the client and let them know what we have found. The senior developer works in a different timezone, so his shift had actually been over for more than four hours. With that in mind, I tell him to log off and reiterate that if there are any further problems, I will handle them. Of course, at this point, there is nothing to do but wait, so I monitored for maybe another 10 minutes before stepping away from my computer to take a break. I also made dinner plans with my significant other, and at around 6:00 p.m., I began getting ready to leave for dinner. I reminded myself that I should check my messages before I leave, but I was not too worried. In addition, I had my phone with me, so they could always reach me that way. Where the Trouble Begins I picked up my significant other and was on my way to the restaurant when I got a message from the Big Boss (my boss's boss). He was trying to figure out what was going on and put out a fire that had been raging. I realized I also had a missed call. I was still driving at this point but ended up pulling into a parking lot in order to see what was going on. I ended up calling the Big Boss back to see what was going on. It turns out that after I had left, the PM was not able to explain what had happened to the client or provide assurance that the orders were still being processed. I realized at this point that I had failed to remember to check my computer before I left for dinner. Once home, I jumped on a call with the senior developer and the Big Boss and saw that I had missed about a dozen messages. I was told that the client was having a meltdown and was not convinced that orders were going through. They even suggested that they start manually processing orders to get through the whole backlog (this would have taken more than a week and been prone to errors). The Big Boss was barely able to talk the client down and assured them that we would handle the situation. We again confirmed that the orders were being processed, even though it was slow. To ensure that the client was happy, the Big Boss told me that I would have to take turns monitoring the orders to make sure that they were still being processed. We were to give status updates every hour until the orders were completely processed.  Long term, we needed to speed up the order processing times and ensure that the queue is FIFO instead of FILO.  We never had an issue before because the client would have at most 10 orders in a day.  This particular lucky Friday, we had received well in the thousands. In the short term however, this meant that we would have to take turns staying up all night in order to ensure that the client was happy. At this point, I was just happy that I was not getting fired (as it was mostly my fault that the issue escalated this far). As such, I offerred to take the graveyard shift. The Automation In the middle of the night, I realized that it was somewhat stupid to stay up all night just to count how many orders were remaining by hand. I noticed that to login to the backofice/admin section of the website, it only required basic auth. Naturally, I started working on a Puppeteer script that logs in, goes to the right page/tab, counts how many orders are remaining, and logs the number into a Google Sheets document. From there, I used the timestamp and number of orders to graph it in a chart using Sheets. In order to automate the script runs, I created a scheduled task in Windows to run a batch script that, in turn, runs Puppeteer. At the end of the day, I probably should have just checked my messages before I left for dinner. Writing the script honestly took the better part of a working day and was totally not worth it, but at the very least, it was fun. Lessons Learned: Always have your phone with you when you are on call Always double check that an issue is completely resolved from the client's perspective before assuming all work is done

Jan 29, 2026