At one point in my career, I spent over 2 years building a monitoring stack. It started out
the way many do; with people staring at dashboards, hoping to divine the secrets of production
from ripples in the space time continuum before an outage occurred.
Over these two years we was able to transform not just the technology used, but the entire
way the organization viewed monitoring, eventually removing the need for a NOC altogether.
I’ll walk you through the final design which was responsible everything from data acquisition
to alerting and much besides. In this post I’ll go over some of the design decisions we made,
why we made them and some guidance for anybody designing their own monitoring stack.
I’ve been putting together a recent series on how to easily run backups
on Kubernetes and it struck me that there’s a range of theory that underpins
the decisions one must make when designing a backup system. This theory is
not often discussed and in many cases you “take backups in case something bad
happens” without having a clear understanding of why different backups should
be taken, how long they should be retained for or when you should be taking
them.
In this post I’ll go over some of the costs associated with backups (both the
direct and indirect) and how those will affect various decisions you make when
designing a system that uses backups.
Recently I wrote a blog post on how to execute scheduled backups using
KubernetesCronJobs. In that post I showed how easily one
could dump backup files to an S3 bucket on a schedule using some trivially
simple containers, but the astute among you will have noticed that I didn’t
touch on the topic of backup rotation…
Backup rotation is the process of removing old or extraneous backups to make
the best of your available storage space and before I write a post on how we
do that, I’d like to discuss what backup rotation implies and how it should
be done to maximize business value and minimize risk.