Designing Human Systems

Recently I was having a conversation with a colleague who asserted that we (SREs) are broadly the types of engineer who, if given the choice, try to focus on perfecting the fundamentals. This surprised me, because if you were to ask me about my views on engineering, I’d probably lean in a slightly different direction.

My personal view on SRE is that its a game of balance. We’re not Software Engineers, we’re not Operations Engineers and we’re also not Security Engineers. We tread a fine line in the middle, pushing on aspects of the broader (humans included) system to help it find a stable equilibrium in which it delivers maximum value for all stakeholders. That kind of balancing requires a very pragmatic, flexible approach and often depends more on the subtleties of the system at hand than a rigidly theoretical approach can offer.

With that in mind, I think that as engineers, we need to focus on building systems that support that healthy equilibrium. Doing so means balancing a wide range of requirements from different, often competing, stakeholders while attempting to divine what the future may bring. In my experience, however, all of this becomes much easier to deal with if you can solve two key problems: velocity and observability.

Before I dive into that, let’s quickly talk about that experience.

Read more »

App Updates

Today I work as an SRE, surrounded by dozens of complex systems designed to make the process of taking code we write and exposing it to customers. It’s easy to forget that software deployment itself is a problem that many developers have not yet solved.

Today I’d like to run you through a straightforward process I recently implemented for Git Tool to enable automated updates with minimal fuss. It’s straightforward, easy to implement and works without any fancy tooling.

Read more »

Live Unit Testing .NET Core - Where are my tests?

So you’re sitting in front of your computer, wondering why your unit tests won’t show up in Visual Studio’s Live Test Window. They appear fine in the normal Tests Window and they run without problems, you haven’t done anything weird and all you want is to be able to see whether your code works.

You’re not alone and there is a solution!

Read more »

.gitignore unicode

Posted on

Have you ever run into a situation where Git just refused to obey your commands? No, I’m not talking about that time you “typo-ed” git commit and ended up git reset --hard-ing your repository back to the dawn of the universe, I’m talking about it really, truly, ignoring you.

I have, so let me tell you a story about what happened and how I fixed it so that you can avoid future hair-loss and avoid questioning the nature of your reality.

Read more »

Organizing your Development Directory

As an engineer, I like to think that I help fix problems. That’s what I’ve tried to do most of my life and career and I love doing so to this day. It struck me, though, that there was one problem which has followed me around for years without due attention: the state of my development directories.

That’s not to say that they are disorganized, I’ve spent hours deliberating over the best way to arrange them such that I can always find what I need, yet I often end up having to resort to some dark incantation involving find to locate the project I was certain sat under my Work folder.

No more, I’ve drawn the line and decided that if I can’t fix the problem, automation damn well better be able to!

I’d like to introduce you to my new, standardized (and automated), development directory structure and the tooling I use to maintain it. With any luck, you’ll find it useful and it will enable you to save time, avoid code duplication and more easily transition between machines.

Read more »

Python Iterators, Next

How iterators work in Python, details about the next function and a lesson from production

Recently we had an outage. It was a small one, by all accounts and as a result of the way our system is designed, it didn’t impact any users, lose any data and wasn’t in any way noticeable to anybody except us. It did happen though and that’s a problem.

The cause of this outage was pretty simple, engineer A designed a nice new feature in library X; engineer B liked this feature and decided to use it in service Y. This is a daily occurrence and is generally a very good thing, new, cleaner solutions help you constantly refactor away technical debt and improve the readability and all important maintainability of your code.

This time, however, it went wrong and caused an outage so let’s talk about how that happened and take a detour through the land of Python iterators at the same time.

Read more »

Backups and Google Cloud Storage

Posted on

Configuration Creating a Cloud Storage Bucket How to create the bucket What permissions to set Configuring Access to your Bucket Enabling interoperability support Accessing your bucket via s3 Configuring your Backup Task The s3 compatible endpoint Configuring your Bucket Policies Content expiration and cold-storage rules Access permissions Migrating Old Backups Using the transfer tool Using the mc …

Read more »

Blueprint for a Monitoring Stack

At one point in my career, I spent over 2 years building a monitoring stack. It started out the way many do; with people staring at dashboards, hoping to divine the secrets of production from ripples in the space time continuum before an outage occurred.

Over these two years we was able to transform not just the technology used, but the entire way the organization viewed monitoring, eventually removing the need for a NOC altogether.

I’ll walk you through the final design which was responsible everything from data acquisition to alerting and much besides. In this post I’ll go over some of the design decisions we made, why we made them and some guidance for anybody designing their own monitoring stack.

Read more »

Scaling for Latency with Async I/O

I’ve just spent the last month rewriting the core component in a monitoring stack which is responsible for protecting the availability of a billion dollar per year franchise. The purpose of this rewrite was to improve the ability of our engineers to implement new features in a safe, quick and easy way - what we delivered ended up offering a four order of magnitude performance and efficiency improvement over our previous system.

Let’s talk about how that happened, why it was possible and how we achieved that without it being a focal point of the redesign. I’m going to discuss evented input-output, often referred to as async.

Hopefully, by the time you’ve finished reading this article you should have a good grasp of what evented IO is, how it works and some of the situations in which it has a lot to offer - as well as some of the significant advantages it has over alternative approaches when we start talking about large scale production systems.

Read more »

The Cost of Backups

I’ve been putting together a recent series on how to easily run backups on Kubernetes and it struck me that there’s a range of theory that underpins the decisions one must make when designing a backup system. This theory is not often discussed and in many cases you “take backups in case something bad happens” without having a clear understanding of why different backups should be taken, how long they should be retained for or when you should be taking them.

In this post I’ll go over some of the costs associated with backups (both the direct and indirect) and how those will affect various decisions you make when designing a system that uses backups.

Read more »

Patterns for APIs

If you’ve built a production API before, you’ll know that they tend to evolve over time. This evolution is not only unavoidable, it is a natural state that any active system will exist in until it is deprecated.

Realizing and designing to support this kind of evolution in a proactive way is one of the aspects that differentiates a mature API from the thousands that litter the Wall of Shame.

At the same time, it is important that your API remains easy to use and intuitive, maximizing the productivity of developers who will make use of it.

Read more »

Rotating Backups with Kubernetes

Recently I wrote a blog post on how to execute scheduled backups using Kubernetes CronJobs. In that post I showed how easily one could dump backup files to an S3 bucket on a schedule using some trivially simple containers, but the astute among you will have noticed that I didn’t touch on the topic of backup rotation…

Backup rotation is the process of removing old or extraneous backups to make the best of your available storage space and in this post I’ll go over the approach I use to keep track of the backups that are important to us.

Read more »

Rotating Backups

Recently I wrote a blog post on how to execute scheduled backups using Kubernetes CronJobs. In that post I showed how easily one could dump backup files to an S3 bucket on a schedule using some trivially simple containers, but the astute among you will have noticed that I didn’t touch on the topic of backup rotation…

Backup rotation is the process of removing old or extraneous backups to make the best of your available storage space and before I write a post on how we do that, I’d like to discuss what backup rotation implies and how it should be done to maximize business value and minimize risk.

Read more »

Scheduled Backups with Kubernetes

It’s a poorly hidden fact that I love Kubernetes. After spending months running everything from Marathon DCOS and CoreOS to Rancher and Docker Swarm in production, Kubernetes is the only container orchestration platform that has truly struck me as truly “production ready” and I have been running it for the past year as a result.

While functionality when I first started using it (v1.4) was somewhat patchy and uninteresting, some of the more recent updates have been making sizeable strides towards addressing the operations challenges we face on a daily basis.

With v1.8, Kubernetes has introduced the CronJob controller to batch/v1beta1, making it generally available for people to play with. Sounds like the perfect time to show you how we use CronJobs to manage automated, scheduled, backups within our environments.

Read more »

Relational and Document DBs

One of the most interesting discussions to have with people, notably those with traditional database experience, is that of the relationship between an off the shelf RDBMS and some modern NoSQL document stores.

What makes this discussion so interesting is that there’s invariably a lot of opinion driven from, often very valid, experience one way or another. The truth is that there simply isn’t a silver-bullet database solution and that by better understanding the benefits and limitations of each, one can make vastly better decisions on their adoption.

Read more »

Out of the Box Docker

Docker's Logo
Docker’s Logo

Docker is become an incredibly prevalent tool in the development and operations realms in recent months. Its combination of developer friendly configuration and simple operational management make it a very attractive prospect for companies and teams looking to adopt CI and CD practices.

In most cases, you’ll see Docker used to deploy applications in much the same way as a zip file or virtual machine image. This is certainly the most common use case for Docker, but by no means the extent of its functionality.

In this post I’m going to discuss some of the more interesting problems we’ve used Docker to solve and why it serves as a great solution to them.

Read more »

Dockerizing Aurelia

Aurelia's Logo
Aurelia’s Logo

Aurelia is a modern web application framework in the spirit of Angular, with an exceptionally concise and accessible developer experience and standards compliant implementation. It is hands down my favorite web framework right now and one I’d strongly recommend for most projects.

One of Aurelia’s greatest claims to fame is the incredible productivity you can achieve, enabling you to build a full web application in just days, if not hours.

When building the application becomes that fast, spending a day putting together your deployment pipelines to roll out your application becomes incredibly wasteful, so how can we avoid that?

Well, Docker offers us a great way to deploy and manage the life-cycle of production applications. It enables us to deploy almost anywhere, with minimal additional effort and in a highly reproducible fashion.

In this post I’ll go over the process of Dockerizing an existing Aurelia web application built with WebPack, however the same process applies to those built using SystemJS.

Read more »

Signing Git Commits using Keybase

KeyBase's Logo
KeyBase’s Logo

With the increasing popularity of Git as a tool for open source collaboration, not to mention distribution of code for tools like Go, being able to verify that the author of a piece of code is indeed who they claim to be has become absolutely critical.

This requirement extends beyond simply ensuring that malicious actors cannot modify the code we’ve published, something GitHub and its kin (usually) do a very good job of preventing. The simple fact is that by adopting code someone else has written, you are entrusting your clients' security to them - you best be certain that trust is wisely placed.

Using Git’s built in support for PGP signing and pairing it with Keybase provides you with a great framework on which to build and verify that trust. In this post I’ll go over how one sets up their development environment to support this workflow.

Read more »

Feeling Lucky

Anybody who has worked in the development world for a significant portion of time will have built up a vast repertoire of abbreviations to describe how they solve problems. Everything from TDD to DDD and, my favourites, FDD and HDD. There are so many in fact that you’ll find a website dedicated to naming and shaming them.

I’m not one to add another standard to the mix… Oh who am I kidding, let me introduce you to Chance Driven Development.

XKCD Standards

Read more »


Inki is a small proof of concept project I’ve been working on which is designed to manage transient, single-use, SSH keys for an automated remediation tool our team is in the process of building.

In this blog post I’ll go over some of the design decisions motivating a tool like Inki, some of its interesting implementation details and the questions we’re hoping it will allow us to answer.

Read more »