Blueprint for a Monitoring Stack

At one point in my career, I spent over 2 years building a monitoring stack. It started out the way many do; with people staring at dashboards, hoping to divine the secrets of production from ripples in the space time continuum before an outage occurred.

Over these two years we was able to transform not just the technology used, but the entire way the organization viewed monitoring, eventually removing the need for a NOC altogether.

I’ll walk you through the final design which was responsible everything from data acquisition to alerting and much besides. In this post I’ll go over some of the design decisions we made, why we made them and some guidance for anybody designing their own monitoring stack.

Read more »

The Cost of Backups

I’ve been putting together a recent series on how to easily run backups on Kubernetes and it struck me that there’s a range of theory that underpins the decisions one must make when designing a backup system. This theory is not often discussed and in many cases you “take backups in case something bad happens” without having a clear understanding of why different backups should be taken, how long they should be retained for or when you should be taking them.

In this post I’ll go over some of the costs associated with backups (both the direct and indirect) and how those will affect various decisions you make when designing a system that uses backups.

Read more »

Rotating Backups

Recently I wrote a blog post on how to execute scheduled backups using Kubernetes CronJobs. In that post I showed how easily one could dump backup files to an S3 bucket on a schedule using some trivially simple containers, but the astute among you will have noticed that I didn’t touch on the topic of backup rotation…

Backup rotation is the process of removing old or extraneous backups to make the best of your available storage space and before I write a post on how we do that, I’d like to discuss what backup rotation implies and how it should be done to maximize business value and minimize risk.

Read more »