I’ve been having a lot of fun over the past couple of weeks shading into the ops side of things – I’ve been experimenting with an alternative to the existing infrastructure at my day job, and have been learning a ton. There’s a long way to go, but I think I’ve got the start of an interesting, flexible setup.
At the heart of this cluster is a configuration server; I wanted to have new instances easily register themselves and discover existing instances and services. After looking at a few alternatives, I ended up going with etcd, in large part because of its curl-friendly interface (which also powers my experimental Ember.js web UI for etcd, wetcd).
For monitoring, I wanted something that’d easily adjust to instances spinning up and down. Sensu looked like a good fit, so I’ve been giving that a go. It introduces dependencies that might scare some people off (Ruby on every client, for instance), but it’s incredibly flexible.
The piece that I think gets overlooked most frequently in infrastructure setups is log aggregation – when you’re adding and removing instances willy-nilly, you really need a solid, central place to view and analyze logs (especially if you want to keep people away from direct access to the servers). I’m loving Logstash for this, especially since it just added some great features as it hit 1.2. For getting the logs to Logstash, I’m relying on good ol’ rsyslog. Finally, I’m using Kibana 3 to view and analyze the logs.
So here’s what happens: when a new instance comes up, it first pings the configuration server to find out where monitoring, logging, and the like live so they can be set correctly. Once provisioning is complete, the instance then notifies the configuration server that it is available for whatever role it’s playing.
All boxes get a standard set of monitoring checks by default, in addition to checks specific to whatever they’re running (nginx, Redis, etc.); some of these checks get forwarded on to the analytics server. Finally, all logs get shipped to the log server via rsyslog. It all works together shockingly (from my naive, developer-oriented background) well.
There’s more to talk about (for instance, why I prefer managing all of this with Ansible instead of Chef or Puppet), but I’d love to hear what you all think. Hit me up on Twitter with comments and questions!